[00:01:27] (SwiftObjectCountSiteDisparity) firing: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [00:19:09] PROBLEM - Check systemd state on an-web1001 is CRITICAL: CRITICAL - degraded: The following units failed: hardsync-published.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:24:53] PROBLEM - BGP status on cr1-drmrs is CRITICAL: BGP CRITICAL - AS13030/IPv6: Idle - Init7, AS13030/IPv4: Idle - Init7, AS6939/IPv6: Idle - HE, AS6939/IPv4: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [00:26:55] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:27:27] PROBLEM - Check systemd state on vrts1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:30:25] RECOVERY - Check systemd state on an-web1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:31:51] PROBLEM - clamd running on vrts1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [00:32:37] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:39:05] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/970830 [00:39:11] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/970830 (owner: 10TrainBranchBot) [00:55:21] (03CR) 10CI reject: [V: 04-1] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/970830 (owner: 10TrainBranchBot) [00:57:26] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/970830 (owner: 10TrainBranchBot) [00:58:25] RECOVERY - Check systemd state on vrts1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:58:33] RECOVERY - clamd running on vrts1001 is OK: PROCS OK: 1 process with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [01:02:41] PROBLEM - Check systemd state on vrts1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:02:49] PROBLEM - clamd running on vrts1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [01:15:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:26:14] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [01:27:59] RECOVERY - Check systemd state on vrts1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:28:07] RECOVERY - clamd running on vrts1001 is OK: PROCS OK: 1 process with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [02:25:25] PROBLEM - Check systemd state on vrts1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:25:31] PROBLEM - clamd running on vrts1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [02:28:13] RECOVERY - Check systemd state on vrts1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:28:19] RECOVERY - clamd running on vrts1001 is OK: PROCS OK: 1 process with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [02:30:59] RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:31:21] RECOVERY - Check systemd state on gitlab1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:38:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:42:17] PROBLEM - Check systemd state on vrts1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:42:21] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [02:42:25] PROBLEM - clamd running on vrts1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [02:57:43] RECOVERY - Check systemd state on vrts1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:57:49] RECOVERY - clamd running on vrts1001 is OK: PROCS OK: 1 process with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [03:03:21] PROBLEM - Check systemd state on vrts1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:03:27] PROBLEM - clamd running on vrts1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [03:04:33] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:28:39] RECOVERY - Check systemd state on vrts1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:28:45] RECOVERY - clamd running on vrts1001 is OK: PROCS OK: 1 process with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [03:51:17] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:01:27] (SwiftObjectCountSiteDisparity) firing: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [05:10:15] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 143, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:11:03] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 225, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:16:23] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:16:45] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:17:39] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.277 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:18:01] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:26:14] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [05:36:01] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1003), Fresh: 136 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231102T0600) [06:00:05] kormat, marostegui, and Amir1: (Dis)respected human, time to deploy Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231102T0600). Please do the needful. [06:01:53] PROBLEM - Check systemd state on vrts1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:01:57] PROBLEM - clamd running on vrts1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [06:28:47] RECOVERY - Check systemd state on vrts1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:30:17] PROBLEM - clamd running on vrts1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [06:33:07] PROBLEM - Check systemd state on vrts1001 is CRITICAL: CRITICAL - degraded: The following units failed: clamav-daemon.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:36:21] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 137 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring [06:42:21] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [06:55:32] (03PS1) 10Marostegui: install_server: Do not reimage db1232 [puppet] - 10https://gerrit.wikimedia.org/r/970909 [06:58:17] RECOVERY - Check systemd state on vrts1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:58:19] RECOVERY - clamd running on vrts1001 is OK: PROCS OK: 1 process with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [07:01:47] !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host vrts1001.eqiad.wmnet [07:03:58] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1232 [puppet] - 10https://gerrit.wikimedia.org/r/970909 (owner: 10Marostegui) [07:05:42] !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host vrts1001.eqiad.wmnet [07:32:57] PROBLEM - Check systemd state on prometheus2005 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:33:13] PROBLEM - Check systemd state on prometheus1006 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:33:35] PROBLEM - Check systemd state on prometheus1005 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:33:49] ^ probably related to me setting up the new parsercache section [07:33:53] PROBLEM - Check systemd state on prometheus2006 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:44:33] (JobUnavailable) firing: (6) Reduced availability for job mysql-core in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:45:35] RECOVERY - Check systemd state on prometheus2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:46:09] RECOVERY - Check systemd state on prometheus1005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:46:31] RECOVERY - Check systemd state on prometheus2006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:47:13] RECOVERY - Check systemd state on prometheus1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:48:45] (JobUnavailable) firing: (6) Reduced availability for job mysql-core in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:51:17] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:51:56] (03CR) 10Muehlenhoff: [C: 03+2] RT: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/970724 (owner: 10Muehlenhoff) [07:53:24] (03PS1) 10Muehlenhoff: Switch RT to nftables [puppet] - 10https://gerrit.wikimedia.org/r/970912 [08:00:05] Amir1, apergos, and jnuche: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport and config training. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231102T0800). [08:00:19] morning folks! this is your friendly neighborhood deployer, back from adventures in hardware self-destruction land, ready to scap away. however, there are no trainees signed up to learn how to deploy, and no patch owners signed up to deploy anything. so, have a peaceful day and see you next time! [08:01:28] (SwiftObjectCountSiteDisparity) firing: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [08:12:10] !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on an-airflow1007.eqiad.wmnet with reason: Downtime as we setup the new WMDE Airflow instance [08:12:24] !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on an-airflow1007.eqiad.wmnet with reason: Downtime as we setup the new WMDE Airflow instance [08:33:39] (03PS1) 10Arnaudb: prompt: discard color on lines [puppet] - 10https://gerrit.wikimedia.org/r/970832 (https://phabricator.wikimedia.org/T344036) [08:38:51] (03CR) 10Arnaudb: "please help my prompt be better 😄" [puppet] - 10https://gerrit.wikimedia.org/r/970832 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [08:47:06] (03CR) 10Marostegui: [C: 03+1] prompt: discard color on lines [puppet] - 10https://gerrit.wikimedia.org/r/970832 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [08:48:15] !log jmm@cumin2002 START - Cookbook sre.debmonitor.remove-hosts for 1 hosts: cloudcontrol2006-dev.codfw.wmnet [08:48:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.debmonitor.remove-hosts (exit_code=0) for 1 hosts: cloudcontrol2006-dev.codfw.wmnet [08:55:11] (03PS1) 10Zabe: Update Netskope IP ranges [extensions/TrustedXFF] (wmf/1.42.0-wmf.3) - 10https://gerrit.wikimedia.org/r/970762 (https://phabricator.wikimedia.org/T350199) [08:55:16] jouncebot: nowandnext [08:55:16] For the next 0 hour(s) and 4 minute(s): UTC morning backport and config training (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231102T0800) [08:55:16] In 1 hour(s) and 4 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231102T1000) [08:55:17] In 1 hour(s) and 4 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231102T1000) [08:55:34] (03CR) 10Zabe: [C: 03+2] Update Netskope IP ranges [extensions/TrustedXFF] (wmf/1.42.0-wmf.3) - 10https://gerrit.wikimedia.org/r/970762 (https://phabricator.wikimedia.org/T350199) (owner: 10Zabe) [08:56:36] (03PS1) 10JMeybohm: Rebuild php icu67 images to include libxml2 sec updates [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/971057 (https://phabricator.wikimedia.org/T345561) [08:56:38] (03PS1) 10JMeybohm: php7.4-fpm-multiversion-base: Switch to icu67 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/971058 (https://phabricator.wikimedia.org/T345561) [08:57:01] PROBLEM - haproxy process on dbproxy1017 is CRITICAL: PROCS CRITICAL: 0 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [08:57:10] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on dbproxy1017.eqiad.wmnet with reason: decomissionning via T348956 [08:57:14] T348956: decommission dbproxy1017.eqiad.wmnet - https://phabricator.wikimedia.org/T348956 [08:57:25] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on dbproxy1017.eqiad.wmnet with reason: decomissionning via T348956 [08:57:54] (03Merged) 10jenkins-bot: Update Netskope IP ranges [extensions/TrustedXFF] (wmf/1.42.0-wmf.3) - 10https://gerrit.wikimedia.org/r/970762 (https://phabricator.wikimedia.org/T350199) (owner: 10Zabe) [08:58:20] (03CR) 10Arnaudb: [C: 03+2] prompt: discard color on lines [puppet] - 10https://gerrit.wikimedia.org/r/970832 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [08:59:11] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Rebuild php icu67 images to include libxml2 sec updates [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/971057 (https://phabricator.wikimedia.org/T345561) (owner: 10JMeybohm) [08:59:13] !log zabe@deploy2002 Started scap: Backport for [[gerrit:970762|Update Netskope IP ranges (T350199)]] [08:59:19] T350199: Update Netskope TrustedXFF IP ranges - https://phabricator.wikimedia.org/T350199 [08:59:49] arnaudb: check the alert above for dbproxy1017 [08:59:52] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/971057 (https://phabricator.wikimedia.org/T345561) (owner: 10JMeybohm) [09:00:20] arnaudb: Ah I see you downtimed it later, cool [09:00:39] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "I would suggest that once migration is done we rename the images back to the name without -icu67 though." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/971058 (https://phabricator.wikimedia.org/T345561) (owner: 10JMeybohm) [09:00:41] !log zabe@deploy2002 zabe: Backport for [[gerrit:970762|Update Netskope IP ranges (T350199)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:01:21] !log zabe@deploy2002 zabe: Continuing with sync [09:02:48] (03PS1) 10Cathal Mooney: Export direct routes to Switches as well as OSPF & server BGP [homer/public] - 10https://gerrit.wikimedia.org/r/971109 (https://phabricator.wikimedia.org/T344547) [09:03:01] (03CR) 10JMeybohm: [C: 03+2] php7.4-fpm-multiversion-base: Switch to icu67 (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/971058 (https://phabricator.wikimedia.org/T345561) (owner: 10JMeybohm) [09:03:13] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Rebuild php icu67 images to include libxml2 sec updates [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/971057 (https://phabricator.wikimedia.org/T345561) (owner: 10JMeybohm) [09:03:17] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] php7.4-fpm-multiversion-base: Switch to icu67 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/971058 (https://phabricator.wikimedia.org/T345561) (owner: 10JMeybohm) [09:03:36] (03CR) 10Cathal Mooney: [C: 03+2] Export direct routes to Switches as well as OSPF & server BGP [homer/public] - 10https://gerrit.wikimedia.org/r/971109 (https://phabricator.wikimedia.org/T344547) (owner: 10Cathal Mooney) [09:03:45] (JobUnavailable) resolved: (5) Reduced availability for job mysql-core in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:04:29] (03PS1) 10Arnaudb: haproxy: disabling notifications on dbproxy1017 [puppet] - 10https://gerrit.wikimedia.org/r/970833 (https://phabricator.wikimedia.org/T350141) [09:04:39] (03Merged) 10jenkins-bot: Export direct routes to Switches as well as OSPF & server BGP [homer/public] - 10https://gerrit.wikimedia.org/r/971109 (https://phabricator.wikimedia.org/T344547) (owner: 10Cathal Mooney) [09:04:43] !log installing krb5 security updates on bullseye [09:04:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:01] (03CR) 10Marostegui: [C: 03+1] haproxy: disabling notifications on dbproxy1017 [puppet] - 10https://gerrit.wikimedia.org/r/970833 (https://phabricator.wikimedia.org/T350141) (owner: 10Arnaudb) [09:05:45] !log installing krb5 security updates on buster/bullseye/bookworm [09:05:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:00] (03CR) 10Marostegui: [C: 03+1] "Btw the bug points to db1131's one." [puppet] - 10https://gerrit.wikimedia.org/r/970833 (https://phabricator.wikimedia.org/T350141) (owner: 10Arnaudb) [09:06:21] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 144, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:06:39] !log zabe@deploy2002 Finished scap: Backport for [[gerrit:970762|Update Netskope IP ranges (T350199)]] (duration: 07m 25s) [09:06:45] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:06:49] T350199: Update Netskope TrustedXFF IP ranges - https://phabricator.wikimedia.org/T350199 [09:09:12] (03PS2) 10Giuseppe Lavagetto: docker::builder: add system to properly perform a weekly update [puppet] - 10https://gerrit.wikimedia.org/r/970391 (https://phabricator.wikimedia.org/T344478) [09:09:14] (03PS2) 10Giuseppe Lavagetto: docker::builder: switch systemd timer to our new script [puppet] - 10https://gerrit.wikimedia.org/r/970392 (https://phabricator.wikimedia.org/T344478) [09:10:08] (03PS2) 10Arnaudb: haproxy: disabling notifications on dbproxy1017 [puppet] - 10https://gerrit.wikimedia.org/r/970833 (https://phabricator.wikimedia.org/T348956) [09:11:07] (03CR) 10Arnaudb: [C: 03+2] haproxy: disabling notifications on dbproxy1017 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/970833 (https://phabricator.wikimedia.org/T348956) (owner: 10Arnaudb) [09:12:37] 10SRE, 10Infrastructure-Foundations, 10netops: Announce internal/core routes from CRs to L3 switches - https://phabricator.wikimedia.org/T344547 (10cmooney) 05Resolved→03Open p:05Low→03Medium [09:13:21] !log published image php7.4-fpm-multiversion-base:7.4.33-6 now based on icu67 php packages - T345561 [09:13:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:25] T345561: Upgrade the MediaWiki servers to ICU 67 - https://phabricator.wikimedia.org/T345561 [09:17:13] 10SRE, 10Infrastructure-Foundations, 10netops: Announce internal/core routes from CRs to L3 switches - https://phabricator.wikimedia.org/T344547 (10cmooney) After deployment in Codfw I noticed an issue which is affecting our EVPN switches. The problem isn't anything to do with EVPN, but more the fact that o... [09:21:34] (03CR) 10Giuseppe Lavagetto: [C: 03+2] docker::builder: add system to properly perform a weekly update [puppet] - 10https://gerrit.wikimedia.org/r/970391 (https://phabricator.wikimedia.org/T344478) (owner: 10Giuseppe Lavagetto) [09:22:27] RECOVERY - haproxy process on dbproxy1017 is OK: PROCS OK: 2 processes with command name haproxy https://wikitech.wikimedia.org/wiki/HAProxy [09:25:24] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on ldap-rw[1001,2001].wikimedia.org with reason: setup in progress [09:25:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on ldap-rw[1001,2001].wikimedia.org with reason: setup in progress [09:26:14] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [09:31:08] (03CR) 10Volans: [C: 03+1] "LGTM, optional nit inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/967935 (https://phabricator.wikimedia.org/T340739) (owner: 10Jbond) [09:31:37] 10SRE, 10RESTBase: RESTBase page summary is not functional on test.wikipedia.org - https://phabricator.wikimedia.org/T350349 (10Urbanecm_WMF) p:05Triage→03High Tagging with #SRE, since the issue seems to be in the way testwiki is setup, rather than in RESTBase itself. [09:31:46] 10SRE, 10RESTBase: RESTBase page summary is not functional on test.wikipedia.org - https://phabricator.wikimedia.org/T350349 (10Urbanecm_WMF) [09:32:33] !log installing openssl bugfix updates from Bullseye point release (update to 1.1.1w) [09:32:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:26] (03PS1) 10Cathal Mooney: Fix syntax error in policy term [homer/public] - 10https://gerrit.wikimedia.org/r/971111 (https://phabricator.wikimedia.org/T344547) [09:39:22] (03PS1) 10Giuseppe Lavagetto: docker::builder: add ssh script [puppet] - 10https://gerrit.wikimedia.org/r/971112 [09:39:36] (03PS1) 10Elukey: changeprop: allow to define Kafka settings for Job Queues [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950) [09:39:41] (03PS2) 10Cathal Mooney: Correctly set MED for OSPF routes [homer/public] - 10https://gerrit.wikimedia.org/r/971111 (https://phabricator.wikimedia.org/T344547) [09:40:30] (03CR) 10Cathal Mooney: [C: 03+2] Correctly set MED for OSPF routes [homer/public] - 10https://gerrit.wikimedia.org/r/971111 (https://phabricator.wikimedia.org/T344547) (owner: 10Cathal Mooney) [09:40:37] (03PS2) 10Giuseppe Lavagetto: docker::builder: add ssh script [puppet] - 10https://gerrit.wikimedia.org/r/971112 [09:41:05] (03Merged) 10jenkins-bot: Correctly set MED for OSPF routes [homer/public] - 10https://gerrit.wikimedia.org/r/971111 (https://phabricator.wikimedia.org/T344547) (owner: 10Cathal Mooney) [09:41:22] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/970720 (owner: 10Muehlenhoff) [09:41:59] (03CR) 10Muehlenhoff: [C: 03+2] Switch cuminunpriv to nftables [puppet] - 10https://gerrit.wikimedia.org/r/970720 (owner: 10Muehlenhoff) [09:44:03] (03PS1) 10Elukey: services: upgrade changeprop jobqueue eqiad's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/971114 (https://phabricator.wikimedia.org/T348950) [09:44:48] (03CR) 10Giuseppe Lavagetto: [C: 03+2] docker::builder: add ssh script [puppet] - 10https://gerrit.wikimedia.org/r/971112 (owner: 10Giuseppe Lavagetto) [09:46:07] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/970722 (owner: 10Muehlenhoff) [09:48:56] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Announce internal/core routes from CRs to L3 switches - https://phabricator.wikimedia.org/T344547 (10cmooney) Routing now looks ok, for instance in esams to the loopbacks of each CR: ` cmooney@asw1-bw27-esams> show route 185.15.59.128/32... [09:50:04] (03PS1) 10Giuseppe Lavagetto: docker::builder: fix image location [puppet] - 10https://gerrit.wikimedia.org/r/971115 [09:52:42] (03CR) 10Giuseppe Lavagetto: [C: 03+2] docker::builder: fix image location [puppet] - 10https://gerrit.wikimedia.org/r/971115 (owner: 10Giuseppe Lavagetto) [09:57:35] (03PS1) 10Muehlenhoff: Setup rsync between apt1001/apt1002 [puppet] - 10https://gerrit.wikimedia.org/r/971117 (https://phabricator.wikimedia.org/T331613) [09:57:37] (03PS1) 10Muehlenhoff: pki: Switch to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/971118 [10:00:04] mvolz: OwO what's this, a deployment window?? Services – Citoid / Zotero. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231102T1000). nyaa~ [10:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231102T1000) [10:00:44] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/971117 (https://phabricator.wikimedia.org/T331613) (owner: 10Muehlenhoff) [10:02:28] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Bump Bitu version to 0.0.2 [software/bitu] - 10https://gerrit.wikimedia.org/r/970281 (owner: 10Slyngshede) [10:02:34] (03CR) 10Muehlenhoff: [C: 03+2] netbox: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/970722 (owner: 10Muehlenhoff) [10:03:25] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Announce internal/core routes from CRs to L3 switches - https://phabricator.wikimedia.org/T344547 (10cmooney) 05Open→03Resolved [10:05:44] I'd like to deploy a backport in two hours, one hour before the regular backport window. Any objections? The patch in question is https://gerrit.wikimedia.org/r/c/mediawiki/core/+/970770 [10:07:21] (03CR) 10EoghanGaffney: [C: 03+1] sre.gitlab.upgrade: unpause runners during downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/970768 (owner: 10Jelto) [10:08:16] (03PS22) 10Fabfur: haproxy: enable healthcheck-dedicated backend [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) [10:08:50] (03CR) 10Hnowlan: "Internal test via the gateway: `curl -H "Host: wikimedia.org" https://rest-gateway.discovery.wmnet:4113/wikimedia.org/v1/metrics/edits/agg" [puppet] - 10https://gerrit.wikimedia.org/r/970367 (https://phabricator.wikimedia.org/T336385) (owner: 10Hnowlan) [10:09:15] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc101[56] - https://phabricator.wikimedia.org/T342164 (10Marostegui) [10:09:18] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install pc201[56] - https://phabricator.wikimedia.org/T342163 (10Marostegui) [10:11:44] (03PS1) 10Muehlenhoff: Switch netbox::standalone to nftables [puppet] - 10https://gerrit.wikimedia.org/r/971119 [10:13:34] (03CR) 10Muehlenhoff: Switch netboxdb to nftables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/969331 (owner: 10Muehlenhoff) [10:15:54] (03CR) 10Filippo Giunchedi: "Followup from a chat with Simon, I'm holding my +1 until we're not exporting metrics for nics without link" [puppet] - 10https://gerrit.wikimedia.org/r/970329 (https://phabricator.wikimedia.org/T347312) (owner: 10Slyngshede) [10:17:20] (03PS1) 10Arnaudb: mariadb: clone db1136 to db1236 [puppet] - 10https://gerrit.wikimedia.org/r/970834 (https://phabricator.wikimedia.org/T344036) [10:17:50] 10SRE, 10Infrastructure-Foundations, 10netops: Do we need to generate aggregates for LVS service IP ranges - https://phabricator.wikimedia.org/T350354 (10cmooney) p:05Triage→03Low [10:18:15] 10SRE, 10Infrastructure-Foundations, 10netops: Do we need to generate aggregates for LVS service IP ranges? - https://phabricator.wikimedia.org/T350354 (10cmooney) [10:18:21] (03PS2) 10Arnaudb: mariadb: clone db1136 to db1236 [puppet] - 10https://gerrit.wikimedia.org/r/970834 (https://phabricator.wikimedia.org/T344036) [10:20:49] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/970655 (owner: 10Muehlenhoff) [10:21:15] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/970716 (owner: 10Muehlenhoff) [10:22:31] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/970335 (owner: 10Majavah) [10:23:35] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host ganeti2014.codfw.wmnet [10:24:44] (03CR) 10Hnowlan: [C: 03+1] services: upgrade changeprop jobqueue eqiad's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/971114 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey) [10:25:08] (03CR) 10Majavah: [C: 03+2] kubeadm: only install containerd.io with docker [puppet] - 10https://gerrit.wikimedia.org/r/968634 (https://phabricator.wikimedia.org/T284656) (owner: 10Majavah) [10:25:21] (03CR) 10Majavah: [C: 03+2] kubeadm: containerd: add kernel modules and config [puppet] - 10https://gerrit.wikimedia.org/r/968635 (https://phabricator.wikimedia.org/T284656) (owner: 10Majavah) [10:25:37] (03PS6) 10Majavah: kubeadm: add required config for containerd [puppet] - 10https://gerrit.wikimedia.org/r/968647 (https://phabricator.wikimedia.org/T284656) [10:26:49] !log fnegri@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol1006.eqiad.wmnet with OS bookworm [10:27:42] (03CR) 10Jbond: [C: 03+1] "lgtm, minor nits inline" [puppet] - 10https://gerrit.wikimedia.org/r/970336 (https://phabricator.wikimedia.org/T206653) (owner: 10Majavah) [10:27:54] (03CR) 10Majavah: [C: 03+2] kubeadm: add required config for containerd [puppet] - 10https://gerrit.wikimedia.org/r/968647 (https://phabricator.wikimedia.org/T284656) (owner: 10Majavah) [10:28:56] (03CR) 10Jbond: [C: 03+1] P:diffscan: add support for configuring multiple instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/970336 (https://phabricator.wikimedia.org/T206653) (owner: 10Majavah) [10:31:28] (03CR) 10Hnowlan: [C: 04-1] "Should have said this for changeprop also, but could you please update the configs for values-beta* to use these options also please?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey) [10:32:52] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/970396 (https://phabricator.wikimedia.org/T206653) (owner: 10Majavah) [10:33:30] (03PS1) 10Muehlenhoff: Switch ganeti2014 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/971123 (https://phabricator.wikimedia.org/T349619) [10:35:06] (03PS6) 10Majavah: diffscan: add support for multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/970335 [10:35:08] (03PS7) 10Majavah: P:diffscan: add support for configuring multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/970336 (https://phabricator.wikimedia.org/T206653) [10:35:10] (03PS3) 10Majavah: P:diffscan: add scan for WMCS infrastructure addresses [puppet] - 10https://gerrit.wikimedia.org/r/970396 (https://phabricator.wikimedia.org/T206653) [10:36:06] (03CR) 10Muehlenhoff: [C: 03+2] Switch ganeti2014 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/971123 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [10:38:53] (03CR) 10Majavah: [C: 03+2] diffscan: add support for multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/970335 (owner: 10Majavah) [10:39:32] (03CR) 10Jbond: [C: 04-1] "lgtm but can we add additional checks" [puppet] - 10https://gerrit.wikimedia.org/r/970727 (owner: 10Majavah) [10:39:34] (03CR) 10Majavah: [C: 03+2] P:diffscan: add support for configuring multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/970336 (https://phabricator.wikimedia.org/T206653) (owner: 10Majavah) [10:39:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host ganeti2014.codfw.wmnet [10:40:04] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/970728 (https://phabricator.wikimedia.org/T349687) (owner: 10Majavah) [10:40:29] (03PS6) 10EoghanGaffney: [apt-staging] Add apt-staging host for CI pipeline [puppet] - 10https://gerrit.wikimedia.org/r/968288 [10:40:49] (03PS8) 10Majavah: P:diffscan: add support for configuring multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/970336 (https://phabricator.wikimedia.org/T206653) [10:40:51] (03PS4) 10Majavah: P:diffscan: add scan for WMCS infrastructure addresses [puppet] - 10https://gerrit.wikimedia.org/r/970396 (https://phabricator.wikimedia.org/T206653) [10:42:21] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [10:43:36] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/952459 (owner: 10Muehlenhoff) [10:45:57] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/286/con" [puppet] - 10https://gerrit.wikimedia.org/r/970396 (https://phabricator.wikimedia.org/T206653) (owner: 10Majavah) [10:46:17] (03CR) 10Majavah: [V: 03+1 C: 03+2] P:diffscan: add scan for WMCS infrastructure addresses (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/970396 (https://phabricator.wikimedia.org/T206653) (owner: 10Majavah) [10:52:03] (03CR) 10Hnowlan: "lgtm - this user will need to be added to the users list on the corresponding cluster just fyi" [puppet] - 10https://gerrit.wikimedia.org/r/970848 (https://phabricator.wikimedia.org/T348993) (owner: 10Eevans) [10:53:27] (03CR) 10Marostegui: mariadb: clone db1136 to db1236 (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/970834 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [10:53:41] PROBLEM - Host lsw1-b2-codfw IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [10:54:23] (03PS1) 10Majavah: diffscan: fix state for multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/971126 [10:54:27] PROBLEM - Host lsw1-a5-codfw is DOWN: PING CRITICAL - Packet loss = 100% [10:54:27] PROBLEM - Host lsw1-a8-codfw is DOWN: PING CRITICAL - Packet loss = 100% [10:54:31] PROBLEM - Host lsw1-b7-codfw is DOWN: PING CRITICAL - Packet loss = 100% [10:54:31] PROBLEM - Host lsw1-b8-codfw is DOWN: PING CRITICAL - Packet loss = 100% [10:54:46] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/971119 (owner: 10Muehlenhoff) [10:54:51] PROBLEM - Host lsw1-a4-codfw IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [10:54:51] PROBLEM - Host lsw1-a1-codfw IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [10:54:51] PROBLEM - Host lsw1-b3-codfw IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [10:54:51] PROBLEM - Host lsw1-b6-codfw IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [10:55:03] PROBLEM - Host lsw1-b7-codfw IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [10:55:03] PROBLEM - Host lsw1-a2-codfw IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [10:55:11] is there some ongoing maintenance? [10:55:13] PROBLEM - Host lsw1-a7-codfw IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [10:55:13] PROBLEM - Host lsw1-b4-codfw IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [10:55:13] PROBLEM - Host lsw1-b8-codfw IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [10:55:17] PROBLEM - Host lsw1-b4-codfw is DOWN: PING CRITICAL - Packet loss = 100% [10:55:17] PROBLEM - Host lsw1-a5-codfw IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [10:55:35] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/969331 (owner: 10Muehlenhoff) [10:55:45] PROBLEM - Host lsw1-a2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [10:55:52] 10SRE, 10Infrastructure-Foundations, 10Traffic-Icebox, 10netops, 10Patch-For-Review: Create Generalised blocking strategy - https://phabricator.wikimedia.org/T270618 (10jbond) > think it would be better if we close this and create smaller tickets with more focused scope. i don't think we need to close t... [10:56:00] topranks: ^^^ related to some ongoing work? [10:56:25] PROBLEM - Host lsw1-a6-codfw IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [10:56:25] PROBLEM - Host lsw1-b5-codfw IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [10:56:25] PROBLEM - Host lsw1-a4-codfw is DOWN: PING CRITICAL - Packet loss = 100% [10:56:25] PROBLEM - Host lsw1-a7-codfw is DOWN: PING CRITICAL - Packet loss = 100% [10:56:25] PROBLEM - Host lsw1-b5-codfw is DOWN: PING CRITICAL - Packet loss = 100% [10:56:26] PROBLEM - Host lsw1-b3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [10:56:33] PROBLEM - Host lsw1-b2-codfw is DOWN: PING CRITICAL - Packet loss = 100% [10:56:33] PROBLEM - Host lsw1-a6-codfw is DOWN: PING CRITICAL - Packet loss = 100% [10:56:35] PROBLEM - Host lsw1-a3-codfw IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [10:56:40] volans: thanks, yep, not in service but the result of a test I did no doubt [10:56:43] PROBLEM - Host lsw1-a3-codfw is DOWN: PING CRITICAL - Packet loss = 100% [10:56:43] PROBLEM - Host lsw1-b6-codfw is DOWN: PING CRITICAL - Packet loss = 100% [10:56:49] ok [10:56:54] let me downtime them - sry this wasn't expected I'm scratching my head [10:57:03] PROBLEM - Host lsw1-a8-codfw IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [10:57:50] (03PS12) 10Jbond: sre.puppet.migrate-role: add new cookbook to migrate roles to puppet7 [cookbooks] - 10https://gerrit.wikimedia.org/r/967935 (https://phabricator.wikimedia.org/T340739) [10:57:54] reverted for now sry [10:57:55] RECOVERY - Host lsw1-a8-codfw is UP: PING OK - Packet loss = 0%, RTA = 34.26 ms [10:57:55] (03CR) 10Jbond: [C: 03+2] sre.puppet.migrate-role: add new cookbook to migrate roles to puppet7 [cookbooks] - 10https://gerrit.wikimedia.org/r/967935 (https://phabricator.wikimedia.org/T340739) (owner: 10Jbond) [10:58:10] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/288/con" [puppet] - 10https://gerrit.wikimedia.org/r/971126 (owner: 10Majavah) [10:59:37] RECOVERY - Host lsw1-b2-codfw IPv6 is UP: PING OK - Packet loss = 0%, RTA = 30.45 ms [10:59:41] RECOVERY - Host lsw1-a5-codfw is UP: PING OK - Packet loss = 0%, RTA = 35.85 ms [10:59:45] RECOVERY - Host lsw1-b7-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.50 ms [10:59:45] RECOVERY - Host lsw1-b8-codfw is UP: PING OK - Packet loss = 0%, RTA = 34.24 ms [10:59:57] PROBLEM - Host sretest2004 is DOWN: PING CRITICAL - Packet loss = 100% [11:00:07] RECOVERY - Host lsw1-a1-codfw IPv6 is UP: PING OK - Packet loss = 0%, RTA = 30.43 ms [11:00:07] RECOVERY - Host lsw1-a4-codfw IPv6 is UP: PING OK - Packet loss = 0%, RTA = 30.62 ms [11:00:07] RECOVERY - Host lsw1-b6-codfw IPv6 is UP: PING OK - Packet loss = 0%, RTA = 31.87 ms [11:00:07] RECOVERY - Host lsw1-b3-codfw IPv6 is UP: PING OK - Packet loss = 0%, RTA = 30.47 ms [11:00:17] RECOVERY - Host lsw1-a2-codfw IPv6 is UP: PING OK - Packet loss = 0%, RTA = 30.51 ms [11:00:18] RECOVERY - Host lsw1-b7-codfw IPv6 is UP: PING OK - Packet loss = 0%, RTA = 30.46 ms [11:00:18] RECOVERY - Host lsw1-a7-codfw IPv6 is UP: PING OK - Packet loss = 0%, RTA = 30.51 ms [11:00:18] RECOVERY - Host lsw1-b4-codfw IPv6 is UP: PING OK - Packet loss = 0%, RTA = 30.43 ms [11:00:18] RECOVERY - Host lsw1-b8-codfw IPv6 is UP: PING OK - Packet loss = 0%, RTA = 30.46 ms [11:00:19] RECOVERY - Host lsw1-a5-codfw IPv6 is UP: PING OK - Packet loss = 0%, RTA = 30.48 ms [11:00:19] RECOVERY - Host lsw1-b4-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.80 ms [11:00:26] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 34 hosts with reason: testing new bgp policy [11:00:47] RECOVERY - Host lsw1-a2-codfw is UP: PING OK - Packet loss = 0%, RTA = 35.19 ms [11:00:54] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 34 hosts with reason: testing new bgp policy [11:01:00] 10SRE, 10Infrastructure-Foundations, 10netops: Announce internal/core routes from CRs to L3 switches - https://phabricator.wikimedia.org/T344547 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=c5da2d0a-c4af-4f96-b651-e1b326898629) set by cmooney@cumin1001 for 2:00:00 on 34 host(s) and the... [11:01:27] RECOVERY - Host lsw1-a4-codfw is UP: PING OK - Packet loss = 0%, RTA = 33.97 ms [11:01:27] RECOVERY - Host lsw1-a6-codfw IPv6 is UP: PING OK - Packet loss = 0%, RTA = 30.50 ms [11:01:29] RECOVERY - Host lsw1-b5-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.66 ms [11:01:29] RECOVERY - Host lsw1-a7-codfw is UP: PING OK - Packet loss = 0%, RTA = 35.12 ms [11:01:29] RECOVERY - Host lsw1-b3-codfw is UP: PING OK - Packet loss = 0%, RTA = 34.21 ms [11:01:29] RECOVERY - Host lsw1-b5-codfw IPv6 is UP: PING OK - Packet loss = 0%, RTA = 30.41 ms [11:01:37] RECOVERY - Host lsw1-a6-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.53 ms [11:01:37] RECOVERY - Host lsw1-b2-codfw is UP: PING OK - Packet loss = 0%, RTA = 32.62 ms [11:01:37] RECOVERY - Host lsw1-a3-codfw IPv6 is UP: PING OK - Packet loss = 0%, RTA = 30.44 ms [11:01:47] RECOVERY - Host lsw1-a3-codfw is UP: PING OK - Packet loss = 0%, RTA = 31.10 ms [11:01:47] RECOVERY - Host lsw1-b6-codfw is UP: PING OK - Packet loss = 0%, RTA = 51.74 ms [11:02:05] RECOVERY - Host lsw1-a8-codfw IPv6 is UP: PING OK - Packet loss = 0%, RTA = 30.47 ms [11:02:13] (03CR) 10Vgutierrez: [C: 04-1] haproxy: enable healthcheck-dedicated backend (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur) [11:02:16] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/970912 (owner: 10Muehlenhoff) [11:02:18] (03Merged) 10jenkins-bot: sre.puppet.migrate-role: add new cookbook to migrate roles to puppet7 [cookbooks] - 10https://gerrit.wikimedia.org/r/967935 (https://phabricator.wikimedia.org/T340739) (owner: 10Jbond) [11:02:42] (03PS2) 10Majavah: systemd: allow passing source to a unit [puppet] - 10https://gerrit.wikimedia.org/r/970727 [11:02:44] (03PS2) 10Majavah: ldap: client: auto-restart sssd-nss on failure [puppet] - 10https://gerrit.wikimedia.org/r/970728 (https://phabricator.wikimedia.org/T349687) [11:02:49] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/971117 (https://phabricator.wikimedia.org/T331613) (owner: 10Muehlenhoff) [11:02:51] (03PS1) 10Hnowlan: page-analytics: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/971127 (https://phabricator.wikimedia.org/T348879) [11:02:55] (03CR) 10Majavah: systemd: allow passing source to a unit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/970727 (owner: 10Majavah) [11:03:04] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/971118 (owner: 10Muehlenhoff) [11:03:39] (03CR) 10Jbond: [C: 03+1] "lgtm" [cookbooks] - 10https://gerrit.wikimedia.org/r/970768 (owner: 10Jelto) [11:08:38] (03CR) 10Santiago Faci: [C: 03+1] "It looks good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/971127 (https://phabricator.wikimedia.org/T348879) (owner: 10Hnowlan) [11:09:15] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp4037.ulsfo.wmnet} and A:cp [11:09:25] (03CR) 10Muehlenhoff: [C: 03+2] bastion: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/952459 (owner: 10Muehlenhoff) [11:10:39] !log rolling upgrade of HAProxy to version 2.6.15-1~bpo11+1 in ulsfo [11:10:41] (03PS23) 10Fabfur: haproxy: enable healthcheck-dedicated backend [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) [11:10:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:47] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp4037.ulsfo.wmnet} and A:cp [11:11:43] (03CR) 10Jbond: "See inline i think we need to do a bit more to make tis define solid" [puppet] - 10https://gerrit.wikimedia.org/r/971126 (owner: 10Majavah) [11:11:51] (03CR) 10Hnowlan: [C: 03+2] page-analytics: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/971127 (https://phabricator.wikimedia.org/T348879) (owner: 10Hnowlan) [11:12:40] (03Merged) 10jenkins-bot: page-analytics: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/971127 (https://phabricator.wikimedia.org/T348879) (owner: 10Hnowlan) [11:12:56] !log fnegri@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcontrol1006.eqiad.wmnet with OS bookworm [11:12:58] (03PS2) 10Majavah: diffscan: fix state for multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/971126 [11:13:51] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/289/con" [puppet] - 10https://gerrit.wikimedia.org/r/971126 (owner: 10Majavah) [11:13:56] (03CR) 10Fabfur: haproxy: enable healthcheck-dedicated backend (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur) [11:14:40] (03CR) 10Majavah: [V: 03+1] diffscan: fix state for multiple instances (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/971126 (owner: 10Majavah) [11:15:39] (03CR) 10Jbond: [C: 03+1] "Lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/970727 (owner: 10Majavah) [11:15:57] !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-ulsfo and not P{cp4037.ulsfo.wmnet} and A:cp [11:16:19] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/971126 (owner: 10Majavah) [11:16:26] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/page-analytics: apply [11:16:28] (03CR) 10Majavah: [V: 03+1 C: 03+2] diffscan: fix state for multiple instances [puppet] - 10https://gerrit.wikimedia.org/r/971126 (owner: 10Majavah) [11:16:54] (03CR) 10Majavah: [C: 03+2] systemd: allow passing source to a unit [puppet] - 10https://gerrit.wikimedia.org/r/970727 (owner: 10Majavah) [11:16:55] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/page-analytics: apply [11:17:32] (03CR) 10Muehlenhoff: [C: 03+2] Switch RT to nftables [puppet] - 10https://gerrit.wikimedia.org/r/970912 (owner: 10Muehlenhoff) [11:18:27] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/page-analytics: apply [11:18:53] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/page-analytics: apply [11:19:04] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/page-analytics: apply [11:19:09] (03PS1) 10Giuseppe Lavagetto: Add proper build dependencies on golang [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/971128 (https://phabricator.wikimedia.org/T350366) [11:19:33] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/page-analytics: apply [11:20:04] (03CR) 10Elukey: [C: 03+1] Add proper build dependencies on golang [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/971128 (https://phabricator.wikimedia.org/T350366) (owner: 10Giuseppe Lavagetto) [11:21:30] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] Add proper build dependencies on golang [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/971128 (https://phabricator.wikimedia.org/T350366) (owner: 10Giuseppe Lavagetto) [11:22:49] (03CR) 10Muehlenhoff: [C: 03+2] Switch builder role to nftables [puppet] - 10https://gerrit.wikimedia.org/r/970655 (owner: 10Muehlenhoff) [11:23:07] 10SRE, 10Infrastructure-Foundations, 10netops: Announce internal/core routes from CRs to L3 switches - https://phabricator.wikimedia.org/T344547 (10cmooney) One other observation is that the MED setting does not optimize the outbound path where we are using EVPN. One might hope that a LEAF switch, learning... [11:23:56] (03PS3) 10Giuseppe Lavagetto: docker::builder: switch systemd timer to our new script [puppet] - 10https://gerrit.wikimedia.org/r/970392 (https://phabricator.wikimedia.org/T344478) [11:23:58] (03PS1) 10Hnowlan: rest-gateway: route pageviews api spec [deployment-charts] - 10https://gerrit.wikimedia.org/r/971129 (https://phabricator.wikimedia.org/T348879) [11:26:50] (03CR) 10Cathal Mooney: Adjust reimage cookbook config for DHCP binding clear workaround (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/969175 (https://phabricator.wikimedia.org/T306421) (owner: 10Cathal Mooney) [11:26:59] (03PS1) 10Muehlenhoff: Revert "Switch builder role to nftables" [puppet] - 10https://gerrit.wikimedia.org/r/971130 [11:27:01] (03CR) 10Giuseppe Lavagetto: [C: 03+2] docker::builder: switch systemd timer to our new script [puppet] - 10https://gerrit.wikimedia.org/r/970392 (https://phabricator.wikimedia.org/T344478) (owner: 10Giuseppe Lavagetto) [11:28:19] (03CR) 10Muehlenhoff: [C: 03+2] Revert "Switch builder role to nftables" [puppet] - 10https://gerrit.wikimedia.org/r/971130 (owner: 10Muehlenhoff) [11:28:42] _joe_: I'll merge your patch along [11:28:52] <_joe_> moritzm: oh thanks [11:28:55] <_joe_> I was about to [11:29:21] and done :-) [11:31:08] (03CR) 10Volans: [C: 03+1] "LGTM but make sure to test it with a reimage in a normal rack without EVPN and another one in row E/F" [cookbooks] - 10https://gerrit.wikimedia.org/r/969175 (https://phabricator.wikimedia.org/T306421) (owner: 10Cathal Mooney) [11:31:48] (03CR) 10Muehlenhoff: [C: 03+2] Switch netbox::standalone to nftables [puppet] - 10https://gerrit.wikimedia.org/r/971119 (owner: 10Muehlenhoff) [11:32:01] (03CR) 10EoghanGaffney: [C: 03+2] [apt-staging] Add apt-staging host for CI pipeline [puppet] - 10https://gerrit.wikimedia.org/r/968288 (owner: 10EoghanGaffney) [11:32:35] (03CR) 10EoghanGaffney: [C: 03+2] [apt-staging] Add apt-staging host for CI pipeline (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/968288 (owner: 10EoghanGaffney) [11:33:21] !log fnegri@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol1006.eqiad.wmnet with OS bookworm [11:34:55] (03CR) 10Arnaudb: "patch on bashrc is a piggy back from the same phabricator task id as https://gerrit.wikimedia.org/r/c/operations/puppet/+/970832/, will di" [puppet] - 10https://gerrit.wikimedia.org/r/970834 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [11:37:21] (03CR) 10Marostegui: [C: 03+1] mariadb: clone db1136 to db1236 [puppet] - 10https://gerrit.wikimedia.org/r/970834 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [11:37:53] (03CR) 10Hnowlan: [C: 03+2] rest-gateway: route pageviews api spec [deployment-charts] - 10https://gerrit.wikimedia.org/r/971129 (https://phabricator.wikimedia.org/T348879) (owner: 10Hnowlan) [11:38:05] (03CR) 10Arnaudb: [C: 03+2] mariadb: clone db1136 to db1236 [puppet] - 10https://gerrit.wikimedia.org/r/970834 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [11:38:12] (03PS1) 10Jbond: puppet: try to deal with existing puppet runs [software/spicerack] - 10https://gerrit.wikimedia.org/r/971133 [11:38:42] (03Merged) 10jenkins-bot: rest-gateway: route pageviews api spec [deployment-charts] - 10https://gerrit.wikimedia.org/r/971129 (https://phabricator.wikimedia.org/T348879) (owner: 10Hnowlan) [11:44:47] (03PS2) 10Elukey: changeprop: allow to define Kafka settings for Job Queues [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950) [11:44:49] (03PS2) 10Elukey: services: upgrade changeprop jobqueue eqiad's docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/971114 (https://phabricator.wikimedia.org/T348950) [11:45:17] (03CR) 10CI reject: [V: 04-1] puppet: try to deal with existing puppet runs [software/spicerack] - 10https://gerrit.wikimedia.org/r/971133 (owner: 10Jbond) [11:45:44] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/rest-gateway: apply [11:45:56] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/rest-gateway: apply [11:46:15] !log fnegri@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol1006.eqiad.wmnet with reason: host reimage [11:48:15] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-host for host netflow6001.drmrs.wmnet [11:49:21] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol1006.eqiad.wmnet with reason: host reimage [11:49:35] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [11:49:41] (03PS1) 10Jbond: netflow6001: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/971135 (https://phabricator.wikimedia.org/T349619) [11:49:46] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [11:50:00] (03CR) 10Jbond: [C: 03+2] netflow6001: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/971135 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [11:50:37] !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-ulsfo and not P{cp4037.ulsfo.wmnet} and A:cp [11:51:17] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:51:30] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [11:51:45] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [11:53:38] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/rest-gateway: apply [11:53:51] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/rest-gateway: apply [11:53:55] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/rest-gateway: apply [11:54:05] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/rest-gateway: apply [11:54:25] (03PS2) 10Muehlenhoff: Setup rsync between apt1001/apt1002 [puppet] - 10https://gerrit.wikimedia.org/r/971117 (https://phabricator.wikimedia.org/T331613) [11:55:32] (03PS1) 10Hnowlan: rest-gateway: correct page-analytics spec path [deployment-charts] - 10https://gerrit.wikimedia.org/r/971136 (https://phabricator.wikimedia.org/T348879) [11:55:44] (03CR) 10Muehlenhoff: [C: 03+2] sretest: Enable nftables on the role level [puppet] - 10https://gerrit.wikimedia.org/r/970716 (owner: 10Muehlenhoff) [11:55:58] (03CR) 10Jelto: [C: 03+2] sre.gitlab.upgrade: unpause runners during downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/970768 (owner: 10Jelto) [11:58:50] (03PS1) 10Daniel Kinzler: ParsoidHandler: emit relative URLs in redirects [core] (wmf/1.42.0-wmf.3) - 10https://gerrit.wikimedia.org/r/970764 (https://phabricator.wikimedia.org/T350219) [11:58:54] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host netflow6001.drmrs.wmnet [11:59:06] (03CR) 10Muehlenhoff: [C: 03+2] Setup rsync between apt1001/apt1002 [puppet] - 10https://gerrit.wikimedia.org/r/971117 (https://phabricator.wikimedia.org/T331613) (owner: 10Muehlenhoff) [11:59:10] (03CR) 10Daniel Kinzler: [C: 03+2] "merging for backport deploy" [core] (wmf/1.42.0-wmf.3) - 10https://gerrit.wikimedia.org/r/970764 (https://phabricator.wikimedia.org/T350219) (owner: 10Daniel Kinzler) [12:00:06] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231102T1200) [12:00:42] (03Merged) 10jenkins-bot: sre.gitlab.upgrade: unpause runners during downtime [cookbooks] - 10https://gerrit.wikimedia.org/r/970768 (owner: 10Jelto) [12:01:28] (SwiftObjectCountSiteDisparity) firing: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [12:01:51] (03CR) 10Hnowlan: [C: 03+2] rest-gateway: correct page-analytics spec path [deployment-charts] - 10https://gerrit.wikimedia.org/r/971136 (https://phabricator.wikimedia.org/T348879) (owner: 10Hnowlan) [12:02:02] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans - https://phabricator.wikimedia.org/T347191 (10cmooney) @papaul hoping to tackle these in this order, want to do the asw-a ones first, then the asw-b ones. |Order|ASW... [12:02:37] (03Merged) 10jenkins-bot: rest-gateway: correct page-analytics spec path [deployment-charts] - 10https://gerrit.wikimedia.org/r/971136 (https://phabricator.wikimedia.org/T348879) (owner: 10Hnowlan) [12:04:04] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: ganeti [12:05:14] (03CR) 10Hashar: [C: 03+1] "That can be deployed at anytime. I am not sure whether Gerrit has to be restarted for the change to be reflected. If it does require a res" [puppet] - 10https://gerrit.wikimedia.org/r/970283 (https://phabricator.wikimedia.org/T350124) (owner: 10Aklapper) [12:05:17] (03CR) 10Hnowlan: [V: 03+1 C: 03+2] service_proxy: add rest-gateway to listeners [puppet] - 10https://gerrit.wikimedia.org/r/968617 (https://phabricator.wikimedia.org/T348731) (owner: 10Hnowlan) [12:07:10] (03PS1) 10Gerrit maintenance bot: Add bbc to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/970835 (https://phabricator.wikimedia.org/T350320) [12:08:59] (03PS1) 10Muehlenhoff: Switch Ganeti to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/971137 (https://phabricator.wikimedia.org/T349619) [12:10:49] (03CR) 10Muehlenhoff: [C: 03+2] Switch Ganeti to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/971137 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:14:01] (03PS1) 10Majavah: prometheus: ipmi_exporter: add explicit dependency for sudo rule [puppet] - 10https://gerrit.wikimedia.org/r/971141 [12:14:25] (03CR) 10Majavah: "This should prevent spam like cloudcontrol1006 just sent." [puppet] - 10https://gerrit.wikimedia.org/r/971141 (owner: 10Majavah) [12:15:54] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Add bbc to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/970835 (https://phabricator.wikimedia.org/T350320) (owner: 10Gerrit maintenance bot) [12:16:28] (03Merged) 10jenkins-bot: ParsoidHandler: emit relative URLs in redirects [core] (wmf/1.42.0-wmf.3) - 10https://gerrit.wikimedia.org/r/970764 (https://phabricator.wikimedia.org/T350219) (owner: 10Daniel Kinzler) [12:17:19] (03PS1) 10Majavah: base: add explicit dependency on rasdaemon [puppet] - 10https://gerrit.wikimedia.org/r/971142 [12:18:51] (03PS1) 10Slyngshede: Packaging: Remove CAS plugin for social-auth. [software/bitu] - 10https://gerrit.wikimedia.org/r/971144 [12:19:42] (03PS8) 10JMeybohm: Enable icu67 component on mwmaint hosts [puppet] - 10https://gerrit.wikimedia.org/r/954659 (https://phabricator.wikimedia.org/T345561) (owner: 10Alexandros Kosiaris) [12:19:59] (03CR) 10Slyngshede: [V: 03+2 C: 03+2] Packaging: Remove CAS plugin for social-auth. [software/bitu] - 10https://gerrit.wikimedia.org/r/971144 (owner: 10Slyngshede) [12:20:01] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol1006.eqiad.wmnet with OS bookworm [12:20:07] (03PS8) 10JMeybohm: Enable icu67 component on canary hosts [puppet] - 10https://gerrit.wikimedia.org/r/954657 (https://phabricator.wikimedia.org/T345561) (owner: 10Alexandros Kosiaris) [12:20:22] (03PS8) 10JMeybohm: Enable icu67 fleet wide [puppet] - 10https://gerrit.wikimedia.org/r/954661 (https://phabricator.wikimedia.org/T345561) (owner: 10Alexandros Kosiaris) [12:21:37] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/954657 (https://phabricator.wikimedia.org/T345561) (owner: 10Alexandros Kosiaris) [12:21:51] !log daniel@deploy2002 Started scap: Backport for [[gerrit:970764|ParsoidHandler: emit relative URLs in redirects (T350219 T349001)]] [12:21:59] (03CR) 10JMeybohm: [C: 03+2] Enable icu67 component on mwmaint hosts [puppet] - 10https://gerrit.wikimedia.org/r/954659 (https://phabricator.wikimedia.org/T345561) (owner: 10Alexandros Kosiaris) [12:22:01] (03CR) 10JMeybohm: [C: 03+2] Enable icu67 component on canary hosts [puppet] - 10https://gerrit.wikimedia.org/r/954657 (https://phabricator.wikimedia.org/T345561) (owner: 10Alexandros Kosiaris) [12:22:01] T350219: Page html output on testwiki returns 404 - https://phabricator.wikimedia.org/T350219 [12:22:02] T349001: Use relative URLs in redirects emitted by rest.php - https://phabricator.wikimedia.org/T349001 [12:26:32] (03CR) 10JMeybohm: [C: 03+2] Enable icu67 fleet wide [puppet] - 10https://gerrit.wikimedia.org/r/954661 (https://phabricator.wikimedia.org/T345561) (owner: 10Alexandros Kosiaris) [12:29:43] (03CR) 10Jelto: [C: 03+2] Correct Gerrit Privacy Policy [puppet] - 10https://gerrit.wikimedia.org/r/970283 (https://phabricator.wikimedia.org/T350124) (owner: 10Aklapper) [12:31:42] !log upgrading cloudweb to ICU67 T345561 [12:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:55] T345561: Upgrade the MediaWiki servers to ICU 67 - https://phabricator.wikimedia.org/T345561 [12:33:35] (03PS1) 10Majavah: openstack: add wrapper for the common patch pattern [puppet] - 10https://gerrit.wikimedia.org/r/971158 [12:34:03] (03CR) 10CI reject: [V: 04-1] openstack: add wrapper for the common patch pattern [puppet] - 10https://gerrit.wikimedia.org/r/971158 (owner: 10Majavah) [12:34:24] (03CR) 10Jelto: [C: 03+2] "merged and deployed to Gerrit. The links on the footer are updated without a restart." [puppet] - 10https://gerrit.wikimedia.org/r/970283 (https://phabricator.wikimedia.org/T350124) (owner: 10Aklapper) [12:36:06] !log daniel@deploy2002 daniel: Backport for [[gerrit:970764|ParsoidHandler: emit relative URLs in redirects (T350219 T349001)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:36:10] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 9): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/291/con" [puppet] - 10https://gerrit.wikimedia.org/r/971158 (owner: 10Majavah) [12:36:11] T350219: Page html output on testwiki returns 404 - https://phabricator.wikimedia.org/T350219 [12:36:12] T349001: Use relative URLs in redirects emitted by rest.php - https://phabricator.wikimedia.org/T349001 [12:36:34] (03PS2) 10Majavah: openstack: add wrapper for the common patch pattern [puppet] - 10https://gerrit.wikimedia.org/r/971158 [12:37:49] (03PS3) 10Majavah: openstack: add wrapper for the common patch pattern [puppet] - 10https://gerrit.wikimedia.org/r/971158 [12:37:58] !log daniel@deploy2002 daniel: Continuing with sync [12:38:06] !log upgrading snapshot* to ICU67 T345561 [12:38:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:38:09] T345561: Upgrade the MediaWiki servers to ICU 67 - https://phabricator.wikimedia.org/T345561 [12:39:46] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/292/con" [puppet] - 10https://gerrit.wikimedia.org/r/971158 (owner: 10Majavah) [12:43:28] !log daniel@deploy2002 Finished scap: Backport for [[gerrit:970764|ParsoidHandler: emit relative URLs in redirects (T350219 T349001)]] (duration: 21m 37s) [12:43:34] T350219: Page html output on testwiki returns 404 - https://phabricator.wikimedia.org/T350219 [12:43:34] T349001: Use relative URLs in redirects emitted by rest.php - https://phabricator.wikimedia.org/T349001 [12:43:58] (03PS1) 10Majavah: P:openstack: galera: fix ordering issue [puppet] - 10https://gerrit.wikimedia.org/r/971159 [12:45:20] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/293/con" [puppet] - 10https://gerrit.wikimedia.org/r/971159 (owner: 10Majavah) [12:46:28] !log jmm@cumin2002 END (FAIL) - Cookbook sre.puppet.migrate-role (exit_code=99) for role: ganeti [12:46:36] !log running fleet wide php upgrades - T345561 [12:46:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:46:40] T345561: Upgrade the MediaWiki servers to ICU 67 - https://phabricator.wikimedia.org/T345561 [12:46:54] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/971142 (owner: 10Majavah) [12:47:14] ^ jynus | hnowlan | urandom [12:47:24] ack, thanks [12:47:28] gotcha [12:47:47] (03PS1) 10Majavah: rsync: add explicit dependency on the package [puppet] - 10https://gerrit.wikimedia.org/r/971160 [12:47:54] (03CR) 10Majavah: [C: 03+2] base: add explicit dependency on rasdaemon [puppet] - 10https://gerrit.wikimedia.org/r/971142 (owner: 10Majavah) [12:55:26] (03PS1) 10Majavah: P:openstack::base: fix project_grants ordering [puppet] - 10https://gerrit.wikimedia.org/r/971162 [12:58:14] (03PS1) 10Jbond: sre.puppet.migrate: disable the puppet timer for the run [cookbooks] - 10https://gerrit.wikimedia.org/r/971163 [12:58:25] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 15): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/294/console" [puppet] - 10https://gerrit.wikimedia.org/r/971162 (owner: 10Majavah) [12:59:34] !log upgrading deployment servers to ICU67 T345561 [12:59:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:59:38] T345561: Upgrade the MediaWiki servers to ICU 67 - https://phabricator.wikimedia.org/T345561 [13:00:06] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231102T1300). [13:00:06] No Gerrit patches in the queue for this window AFAICS. [13:00:26] (03PS2) 10Majavah: prometheus: ipmi_exporter: add explicit dependency for sudo rule [puppet] - 10https://gerrit.wikimedia.org/r/971141 [13:00:28] (03PS1) 10Majavah: prometheus: ipmi_exporter: remove absented resources [puppet] - 10https://gerrit.wikimedia.org/r/971164 [13:00:42] indeed, nothing to do [13:01:22] o/ indeed [13:02:07] (03PS2) 10Jbond: sre.puppet.migrate: disable the puppet timer for the run [cookbooks] - 10https://gerrit.wikimedia.org/r/971163 [13:02:46] (03CR) 10Jbond: "ready for review" [cookbooks] - 10https://gerrit.wikimedia.org/r/971163 (owner: 10Jbond) [13:04:33] yay [13:07:32] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [cookbooks] - 10https://gerrit.wikimedia.org/r/971163 (owner: 10Jbond) [13:14:44] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: netinsights [13:16:08] (03PS1) 10Jbond: netinsights: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/971188 [13:16:29] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/971160 (owner: 10Majavah) [13:17:40] (03CR) 10CI reject: [V: 04-1] netinsights: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/971188 (owner: 10Jbond) [13:18:06] (03CR) 10Kamila Součková: "This is verbatim from upstream, changes will be in a separate commit" [deployment-charts] - 10https://gerrit.wikimedia.org/r/970425 (https://phabricator.wikimedia.org/T264625) (owner: 10Kamila Součková) [13:18:18] (03CR) 10Majavah: [C: 03+2] rsync: add explicit dependency on the package [puppet] - 10https://gerrit.wikimedia.org/r/971160 (owner: 10Majavah) [13:19:11] (03PS2) 10Jbond: netinsights: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/971188 [13:20:18] (03CR) 10Jbond: [C: 03+2] netinsights: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/971188 (owner: 10Jbond) [13:26:14] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [13:27:38] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: netinsights [13:27:58] !log jayme@deploy2002 Started scap: upgrading ICU67 [13:28:02] (03CR) 10Ssingh: [C: 03+1] druid: Add druid druid10[09-11] to druid_public_broker VIP [puppet] - 10https://gerrit.wikimedia.org/r/962250 (https://phabricator.wikimedia.org/T336042) (owner: 10Stevemunene) [13:28:05] (03CR) 10Ssingh: [C: 03+2] druid: Add druid druid10[09-11] to druid_public_broker VIP [puppet] - 10https://gerrit.wikimedia.org/r/962250 (https://phabricator.wikimedia.org/T336042) (owner: 10Stevemunene) [13:28:49] (03PS1) 10Muehlenhoff: Also enable icu67 on cloudweb/codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/971189 (https://phabricator.wikimedia.org/T345561) [13:29:46] (03PS2) 10Brouberol: Hide skein private key diff in puppet logs [puppet] - 10https://gerrit.wikimedia.org/r/970408 (https://phabricator.wikimedia.org/T329398) [13:29:48] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [13:29:58] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-host for host install6002.wikimedia.org [13:30:27] (03PS5) 10Fabfur: Basic retry mechanism for specific kafka errors [software/purged] - 10https://gerrit.wikimedia.org/r/970332 (https://phabricator.wikimedia.org/T334078) [13:30:40] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/971189 (https://phabricator.wikimedia.org/T345561) (owner: 10Muehlenhoff) [13:31:36] (03PS1) 10Jbond: install6002: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/971190 (https://phabricator.wikimedia.org/T349619) [13:32:49] (03CR) 10Vgutierrez: [C: 03+1] haproxy: enable healthcheck-dedicated backend [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur) [13:34:27] (03CR) 10Muehlenhoff: [C: 03+2] Also enable icu67 on cloudweb/codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/971189 (https://phabricator.wikimedia.org/T345561) (owner: 10Muehlenhoff) [13:34:40] !log restart pybal on lvs1020 [13:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:43] (03CR) 10Jbond: [C: 03+2] install6002: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/971190 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [13:38:23] (03PS3) 10Jbond: sre.puppet.migrate: disable the puppet timer for the run [cookbooks] - 10https://gerrit.wikimedia.org/r/971163 [13:41:32] (03PS1) 10Muehlenhoff: Enable icu67 for parsoid::testing [puppet] - 10https://gerrit.wikimedia.org/r/971191 (https://phabricator.wikimedia.org/T345561) [13:42:29] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host install6002.wikimedia.org [13:42:44] (03CR) 10CI reject: [V: 04-1] sre.puppet.migrate: disable the puppet timer for the run [cookbooks] - 10https://gerrit.wikimedia.org/r/971163 (owner: 10Jbond) [13:43:09] !log jayme@deploy2002 Finished scap: upgrading ICU67 (duration: 15m 10s) [13:43:49] (03CR) 10Muehlenhoff: [C: 03+2] Enable icu67 for parsoid::testing [puppet] - 10https://gerrit.wikimedia.org/r/971191 (https://phabricator.wikimedia.org/T345561) (owner: 10Muehlenhoff) [13:45:28] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [13:46:56] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/971163 (owner: 10Jbond) [13:47:56] (03CR) 10Volans: [C: 03+1] sre.puppet.migrate: disable the puppet timer for the run (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/971163 (owner: 10Jbond) [13:48:43] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Andrew) [13:50:03] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:50:32] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: installserver [13:51:26] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "I think it's ok as a stopgap measure, but I think we should take the time to invest a bit into purged." [software/purged] - 10https://gerrit.wikimedia.org/r/970332 (https://phabricator.wikimedia.org/T334078) (owner: 10Fabfur) [13:52:00] (03PS1) 10Jbond: installserver: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/971194 (https://phabricator.wikimedia.org/T349619) [13:53:57] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:54:08] (03PS1) 10Muehlenhoff: profile::ci::php Also add the icu67 component following what was done for prod [puppet] - 10https://gerrit.wikimedia.org/r/971195 (https://phabricator.wikimedia.org/T345561) [13:54:09] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:54:43] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:54:50] (03PS2) 10Eevans: cassandra: add grants for new mobileapps tables [puppet] - 10https://gerrit.wikimedia.org/r/970848 (https://phabricator.wikimedia.org/T348993) [13:55:53] (03CR) 10Eevans: cassandra: add grants for new mobileapps tables (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/970848 (https://phabricator.wikimedia.org/T348993) (owner: 10Eevans) [13:56:51] (03PS4) 10Jbond: sre.puppet.migrate: disable the puppet timer for the run [cookbooks] - 10https://gerrit.wikimedia.org/r/971163 [13:56:57] (03PS1) 10Cathal Mooney: Move mr1-codfw OSPF interface to et-1/0/0 on CRs after migration [homer/public] - 10https://gerrit.wikimedia.org/r/971197 (https://phabricator.wikimedia.org/T347191) [13:57:10] (03CR) 10Jbond: sre.puppet.migrate: disable the puppet timer for the run (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/971163 (owner: 10Jbond) [13:57:29] (03CR) 10Jbond: sre.puppet.migrate: disable the puppet timer for the run (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/971163 (owner: 10Jbond) [13:58:28] (03CR) 10Jbond: [C: 03+2] installserver: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/971194 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [13:58:47] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:59:01] (03CR) 10FNegri: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/971159 (owner: 10Majavah) [13:59:25] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.879 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:59:35] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50861 bytes in 0.114 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:00:43] (03PS1) 10Majavah: prometheus: mysqld_exporter: fix puppet ordering issue [puppet] - 10https://gerrit.wikimedia.org/r/971198 [14:01:13] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:04:08] jouncebot: now [14:04:08] No deployments scheduled for the next 1 hour(s) and 55 minute(s) [14:07:35] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [14:07:41] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [cookbooks] - 10https://gerrit.wikimedia.org/r/971163 (owner: 10Jbond) [14:08:55] (03CR) 10FNegri: [C: 03+1] "Nice abstraction, I like this!" [puppet] - 10https://gerrit.wikimedia.org/r/971158 (owner: 10Majavah) [14:14:00] (03PS23) 10Bking: rdf-streaming-updater: update staging values [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [14:14:23] PROBLEM - SSH on stat1005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:14:53] (03PS1) 10Jbond: sre.puppet.migrate-role: also exclude bullseye hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/971204 (https://phabricator.wikimedia.org/T349619) [14:15:16] (03PS4) 10Majavah: openstack: add wrapper for the common patch pattern [puppet] - 10https://gerrit.wikimedia.org/r/971158 [14:15:30] (03CR) 10Majavah: openstack: add wrapper for the common patch pattern (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/971158 (owner: 10Majavah) [14:15:37] !log Restarting CI Jenkins for plugins adjustements [14:15:37] RECOVERY - SSH on stat1005 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [14:15:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:42] (03CR) 10Majavah: [V: 03+1 C: 03+2] P:openstack: galera: fix ordering issue [puppet] - 10https://gerrit.wikimedia.org/r/971159 (owner: 10Majavah) [14:22:48] (03Abandoned) 10Ssingh: dnsdist: update configuration file for version comment [puppet] - 10https://gerrit.wikimedia.org/r/956466 (owner: 10Ssingh) [14:24:05] (03PS1) 10Dzahn: Revert "admin: remove old ssh key from user dzahn" [puppet] - 10https://gerrit.wikimedia.org/r/971167 [14:25:06] (03PS2) 10Cathal Mooney: Move mr1-codfw OSPF interface to et-1/1/5 on CRs after migration [homer/public] - 10https://gerrit.wikimedia.org/r/971197 (https://phabricator.wikimedia.org/T347191) [14:26:04] (03CR) 10FNegri: [C: 03+1] openstack: add wrapper for the common patch pattern (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/971158 (owner: 10Majavah) [14:26:14] (03CR) 10FNegri: [C: 03+1] openstack: add wrapper for the common patch pattern (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/971158 (owner: 10Majavah) [14:27:32] (03PS6) 10Fabfur: Basic retry mechanism for specific kafka errors [software/purged] - 10https://gerrit.wikimedia.org/r/970332 (https://phabricator.wikimedia.org/T334078) [14:31:38] 10SRE, 10Traffic: Allow purged to specify buffer length - https://phabricator.wikimedia.org/T346874 (10Fabfur) 05Stalled→03Resolved [14:32:25] !log Restarting CI Jenkins again for plugins removal [14:32:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:32:39] (03CR) 10Majavah: [C: 03+2] openstack: add wrapper for the common patch pattern [puppet] - 10https://gerrit.wikimedia.org/r/971158 (owner: 10Majavah) [14:32:50] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: installserver [14:35:17] !log jbond@cumin1001 START - Cookbook sre.puppet.migrate-role for role: netbox::standalone [14:37:06] (03PS1) 10Jbond: netbox::standalone: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/971209 (https://phabricator.wikimedia.org/T349619) [14:38:18] (03CR) 10Jbond: [C: 03+2] netbox::standalone: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/971209 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [14:38:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:52] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/971163 (owner: 10Jbond) [14:41:28] (03PS1) 10Ssingh: dnsdist: update dnsdist.conf.erb for 1.8.x [puppet] - 10https://gerrit.wikimedia.org/r/971210 [14:42:03] (03PS2) 10Ssingh: dnsdist: update dnsdist.conf.erb for 1.8.x [puppet] - 10https://gerrit.wikimedia.org/r/971210 [14:42:21] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [14:43:11] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/297/con" [puppet] - 10https://gerrit.wikimedia.org/r/971210 (owner: 10Ssingh) [14:44:25] 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10Volans) My understanding is that all those hosts have been already reimaged into their related `insetup::*` role. I'm wondering why you need to re-image them again instead of just swit... [14:45:42] !log jbond@cumin1001 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: netbox::standalone [14:45:55] (03PS2) 10Hnowlan: rest-gateway: add device-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/970823 [14:47:18] (03PS1) 10Majavah: P:openstack: drop firewall rules made obsolete by cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/971211 [14:47:21] (03CR) 10Volans: sre.puppet.migrate-role: also exclude bullseye hosts (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/971204 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [14:48:49] 10ops-eqiad, 10DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179 (10ssingh) >>! In T350179#9302032, @Volans wrote: > My understanding is that all those hosts have been already reimaged into their related `insetup::*` role. I'm wondering why you need to... [14:49:43] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:50:25] (03CR) 10Ssingh: [V: 03+1 C: 03+2] dnsdist: update dnsdist.conf.erb for 1.8.x [puppet] - 10https://gerrit.wikimedia.org/r/971210 (owner: 10Ssingh) [14:51:12] !log force agent run on A:wikidough [14:51:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:55:15] (03PS24) 10Bking: rdf-streaming-updater: update staging values [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [14:55:55] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/298/con" [puppet] - 10https://gerrit.wikimedia.org/r/971211 (owner: 10Majavah) [14:56:17] !log logstash1025 systemctl restart apache2.service T350402 [14:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:20] T350402: logstash::collector apache high cpu usage - https://phabricator.wikimedia.org/T350402 [14:56:21] (03PS2) 10Jbond: sre.puppet.migrate-role: also exclude buster hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/971204 (https://phabricator.wikimedia.org/T349619) [14:57:43] (03CR) 10Jbond: sre.puppet.migrate-role: also exclude buster hosts (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/971204 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [14:57:50] (03PS3) 10Jbond: sre.puppet.migrate-role: also exclude buster hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/971204 (https://phabricator.wikimedia.org/T349619) [14:58:14] (03CR) 10Fabfur: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/970367 (https://phabricator.wikimedia.org/T336385) (owner: 10Hnowlan) [14:59:22] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 2 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [15:00:49] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:02:56] !log sukhe@cumin2002 START - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns rolling restart_daemons on A:wikidough and A:wikidough [15:03:45] (JobUnavailable) firing: (2) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:04:16] 10SRE, 10ops-eqiad, 10Cloud-VPS, 10DC-Ops, 10cloud-services-team (FY2023/2024-Q1): cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643 (10Jclark-ctr) [15:04:33] (JobUnavailable) firing: (3) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:04:44] (03CR) 10Hnowlan: [C: 03+2] trafficserver: route edit-analytics service via rest-gateway [puppet] - 10https://gerrit.wikimedia.org/r/970367 (https://phabricator.wikimedia.org/T336385) (owner: 10Hnowlan) [15:07:14] (03CR) 10Hashar: [C: 03+1] profile::ci::php Also add the icu67 component following what was done for prod [puppet] - 10https://gerrit.wikimedia.org/r/971195 (https://phabricator.wikimedia.org/T345561) (owner: 10Muehlenhoff) [15:09:33] (JobUnavailable) firing: (4) Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:12:21] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:12:36] BGP alerts expeted, keeping an eye on them [15:16:31] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:17:35] !log cp4037 depooled to be used as canary for https://gerrit.wikimedia.org/r/c/operations/puppet/+/966221/ [15:17:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:02] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (10Jclark-ctr) Servers have been boxed up and shipped out [15:18:45] (JobUnavailable) firing: (4) Reduced availability for job wikidough in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:18:50] (03CR) 10Fabfur: [C: 03+2] haproxy: enable healthcheck-dedicated backend [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur) [15:18:59] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:19:39] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/971204 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [15:19:44] (03CR) 10Fabfur: [C: 03+2] haproxy: remove multiple backends choice [puppet] - 10https://gerrit.wikimedia.org/r/967173 (https://phabricator.wikimedia.org/T349287) (owner: 10Fabfur) [15:22:45] (03CR) 10FNegri: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/971141 (owner: 10Majavah) [15:22:55] (03PS24) 10Fabfur: haproxy: enable healthcheck-dedicated backend [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) [15:23:28] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev - https://phabricator.wikimedia.org/T342455 (10Jclark-ctr) [15:23:36] (03CR) 10Majavah: [C: 03+2] prometheus: ipmi_exporter: add explicit dependency for sudo rule [puppet] - 10https://gerrit.wikimedia.org/r/971141 (owner: 10Majavah) [15:23:45] (JobUnavailable) firing: (5) Reduced availability for job wikidough in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:24:31] (03CR) 10Fabfur: [V: 03+2] haproxy: enable healthcheck-dedicated backend [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur) [15:24:33] (JobUnavailable) firing: (6) Reduced availability for job wikidough in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:24:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10cloud-services-team (Hardware): Q1:rack/setup/install cloudcontrol100[8-10]-dev cloudnet100[7-8]-dev - https://phabricator.wikimedia.org/T342455 (10Jclark-ctr) @Andrew @cmooney dc ops is finished with our side [15:26:48] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [15:26:49] 10SRE, 10Traffic, 10Patch-For-Review: HAProxy should use a single backend for Vanish - https://phabricator.wikimedia.org/T349287 (10Fabfur) The change is been deployed on cp4037.ulsfo.wmnet as test host [15:27:02] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [15:27:49] 10SRE, 10RESTBase: RESTBase page summary is not functional on test.wikipedia.org - https://phabricator.wikimedia.org/T350349 (10JMeybohm) [15:30:15] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:30:51] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:33:39] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:33:51] ^ expected, intermittent [15:34:22] 10SRE, 10RESTBase: RESTBase page summary is not functional on test.wikipedia.org - https://phabricator.wikimedia.org/T350349 (10Urbanecm_WMF) [15:34:29] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: ipmi_exporter: remove absented resources [puppet] - 10https://gerrit.wikimedia.org/r/971164 (owner: 10Majavah) [15:34:38] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [15:34:50] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [15:34:58] (03CR) 10Jbond: [C: 03+2] sre.puppet.migrate: disable the puppet timer for the run [cookbooks] - 10https://gerrit.wikimedia.org/r/971163 (owner: 10Jbond) [15:35:05] (03CR) 10Jbond: [C: 03+2] sre.puppet.migrate-role: also exclude buster hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/971204 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [15:35:16] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: mysqld_exporter: fix puppet ordering issue [puppet] - 10https://gerrit.wikimedia.org/r/971198 (owner: 10Majavah) [15:37:06] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [15:39:04] (03Merged) 10jenkins-bot: sre.puppet.migrate: disable the puppet timer for the run [cookbooks] - 10https://gerrit.wikimedia.org/r/971163 (owner: 10Jbond) [15:39:15] (03PS1) 10Bking: admin_ng: Activate flink-operator for rdf-streaming-updater [deployment-charts] - 10https://gerrit.wikimedia.org/r/971221 (https://phabricator.wikimedia.org/T349095) [15:39:42] (03Merged) 10jenkins-bot: sre.puppet.migrate-role: also exclude buster hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/971204 (https://phabricator.wikimedia.org/T349619) (owner: 10Jbond) [15:40:10] !log cp4037 repooling with changes for dedicated healthcheck backend (haproxy): https://gerrit.wikimedia.org/r/c/operations/puppet/+/966221/ (T348851) [15:40:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:14] T348851: Add custom HAProxy backend only for healthchecks - https://phabricator.wikimedia.org/T348851 [15:41:06] (03PS25) 10Bking: rdf-streaming-updater: update values for application mode [deployment-charts] - 10https://gerrit.wikimedia.org/r/967229 (https://phabricator.wikimedia.org/T349095) [15:45:03] !log sukhe@cumin2002 END (PASS) - Cookbook sre.dns.roll-restart-reboot-wikimedia-dns (exit_code=0) rolling restart_daemons on A:wikidough and A:wikidough [15:46:19] 10SRE, 10Traffic: Refactoring and some other work on purged - https://phabricator.wikimedia.org/T350396 (10Fabfur) p:05Triage→03Low [15:48:44] !log sudo cumin 'O:prometheus' 'run-puppet-agent' [15:48:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:50:31] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [15:50:56] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [15:51:17] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:51:39] !log eventgate-analytics in eqiad: setting service-runner num_workers: 0 to run with one process and reduce # of threads used by container processes. Should reduce throttling and perhaps help with latency. If works, will make this the default in the chart. - T347477 [15:51:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:43] T347477: eventgate: eventstreams: update nodejs and OS - https://phabricator.wikimedia.org/T347477 [15:53:05] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans - https://phabricator.wikimedia.org/T347191 (10Papaul) @cmooney the order works for me [15:55:03] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans - https://phabricator.wikimedia.org/T347191 (10cmooney) Row A Steps Detail: P53131 Row B Steps Detail: P53132 [15:57:28] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [15:57:31] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [15:59:15] 10SRE, 10Traffic, 10GitLab (Project Migration): Move purged repository from Gerrit to GitLab - https://phabricator.wikimedia.org/T346305 (10Fabfur) 05Open→03Invalid Not needed (please refer to T347623 for a list of traffic repositories that needs to be migrated to GitLab) [16:00:05] jbond and rzl: It is that lovely time of the day again! You are hereby commanded to deploy Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231102T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:01:28] (SwiftObjectCountSiteDisparity) firing: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [16:02:18] (03PS1) 10Ottomata: eventgate-analytics - update precache schema URI versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/971222 [16:05:57] (03PS1) 10JMeybohm: Build php7.4 images with icu67 packages [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/971223 (https://phabricator.wikimedia.org/T345561) [16:06:41] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply [16:07:10] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply [16:07:28] (03CR) 10BPirkle: "This change is ready for review." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968384 (https://phabricator.wikimedia.org/T348731) (owner: 10BPirkle) [16:07:36] (03CR) 10BPirkle: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968384 (https://phabricator.wikimedia.org/T348731) (owner: 10BPirkle) [16:07:44] (03CR) 10Ottomata: [C: 03+2] eventgate-analytics - update precache schema URI versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/971222 (owner: 10Ottomata) [16:08:26] (03PS5) 10JMeybohm: Revert "Enable icu67 component on mwdebug1001" [puppet] - 10https://gerrit.wikimedia.org/r/968660 (https://phabricator.wikimedia.org/T345561) [16:08:32] (03Merged) 10jenkins-bot: eventgate-analytics - update precache schema URI versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/971222 (owner: 10Ottomata) [16:10:08] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/299/console" [puppet] - 10https://gerrit.wikimedia.org/r/968660 (https://phabricator.wikimedia.org/T345561) (owner: 10JMeybohm) [16:10:20] (03CR) 10JMeybohm: [V: 03+1 C: 03+2] Revert "Enable icu67 component on mwdebug1001" [puppet] - 10https://gerrit.wikimedia.org/r/968660 (https://phabricator.wikimedia.org/T345561) (owner: 10JMeybohm) [16:12:55] (03PS1) 10Elukey: changeprop: set num_workers to zero [deployment-charts] - 10https://gerrit.wikimedia.org/r/971225 (https://phabricator.wikimedia.org/T348950) [16:16:35] (03PS1) 10Hnowlan: trafficserver: correct pathing for some AQS routes [puppet] - 10https://gerrit.wikimedia.org/r/971226 (https://phabricator.wikimedia.org/T336385) [16:19:22] (03CR) 10Elukey: Initial commit of kube-state-metrics chart from prometheus-community (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/970425 (https://phabricator.wikimedia.org/T264625) (owner: 10Kamila Součková) [16:19:24] (03CR) 10Hnowlan: "One caveat to this change - the pageviews endpoint on the rest gateway requires that the Host header be set to wikimedia.org. Would that b" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968384 (https://phabricator.wikimedia.org/T348731) (owner: 10BPirkle) [16:20:17] (03PS1) 10Fabfur: haproxy: enable healthcheck-dedicated backend in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/971228 (https://phabricator.wikimedia.org/T348851) [16:21:11] (03CR) 10Vgutierrez: [C: 03+1] haproxy: enable healthcheck-dedicated backend in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/971228 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur) [16:21:12] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: Migrate atlas-codfw from asw-a1-codfw to lsw1-a1-codfw - https://phabricator.wikimedia.org/T348159 (10Papaul) @cmooney cable is in place connected to lasw1-a2-codfw ge-0/0/46 ID 00756 [16:22:35] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (NOOP 2 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/971228 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur) [16:25:26] (03CR) 10Fabfur: [V: 03+1 C: 03+2] haproxy: enable healthcheck-dedicated backend in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/971228 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur) [16:26:14] !log haproxy: this change https://gerrit.wikimedia.org/r/c/operations/puppet/+/971228 will be propagated soon to all cp-ulsfo hosts (T348851) [16:26:14] (03CR) 10JMeybohm: Initial commit of kube-state-metrics chart from prometheus-community (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/970425 (https://phabricator.wikimedia.org/T264625) (owner: 10Kamila Součková) [16:26:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:18] T348851: Add custom HAProxy backend only for healthchecks - https://phabricator.wikimedia.org/T348851 [16:27:14] (03CR) 10Fabfur: [C: 03+1] "ok for me!" [puppet] - 10https://gerrit.wikimedia.org/r/971226 (https://phabricator.wikimedia.org/T336385) (owner: 10Hnowlan) [16:27:54] (03CR) 10Hnowlan: [C: 03+2] trafficserver: correct pathing for some AQS routes [puppet] - 10https://gerrit.wikimedia.org/r/971226 (https://phabricator.wikimedia.org/T336385) (owner: 10Hnowlan) [16:29:33] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [16:29:47] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [16:30:29] !log eventgate-analytics in codfw: setting service-runner num_workers: 0 to run with one process and reduce # of threads used by container processes. Should reduce throttling and perhaps help with latency. If works, will make this the default in the chart. - T347477 [16:30:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:35] !log eventgate-analytics-external: setting service-runner num_workers: 0 to run with one process and reduce # of threads used by container processes. Should reduce throttling and perhaps help with latency. If works, will make this the default in the chart. - T347477 [16:30:38] T347477: eventgate: eventstreams: update nodejs and OS - https://phabricator.wikimedia.org/T347477 [16:30:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:30:48] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [16:31:07] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [16:35:25] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [16:35:43] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [16:38:45] (Primary outbound port utilisation over 80% #page) firing: Alert for device cloudsw1-d5-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [16:38:45] (Primary outbound port utilisation over 80% #page) firing: Alert for device cloudsw1-d5-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [16:38:45] (Primary inbound port utilisation over 80% #page) firing: Alert for device cloudsw1-e4-eqiad.mgmt.eqiad.wmnet - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [16:38:50] (Primary inbound port utilisation over 80% #page) firing: Alert for device cloudsw1-e4-eqiad.mgmt.eqiad.wmnet - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [16:39:13] PROBLEM - Check correctness of the icinga configuration on alert1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [16:39:17] andrewbogott: file storage rebalacing again? [16:39:31] acking for now [16:39:34] yeah. Apparently if I drain one node while filling another it really freaks out [16:39:36] ccccccktekkjdlifhgflnctfjcibentgnnjchchekneg [16:39:41] (sorry) [16:39:50] cat on keyboard? [16:39:56] yubikey .) [16:40:01] !log depool cp4051 [16:40:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:40:09] it's time for "cat or yubikey" quiz! [16:40:13] sorry jynus, and thanks [16:40:14] cat on yubikey (a me sized cat) [16:40:16] anyone looking at the broken icinga config? [16:40:25] volans: just checked [16:40:28] I can look at Icinga if it's helpful [16:40:35] Error: 'deployment-sessionstore04' is not a valid parent for host 'cloudvirt1046' (file '/etc/icinga/objects/puppet_hosts.cfg', line 6545)! [16:40:37] andrewbogott: no worries- there is no better new than no impact when attending a page :-D [16:40:41] *news [16:41:23] although I am guessing some throttling there could be useful to not saturate the link (?), long term [16:42:57] uhh. I think the icinga parent comes from lldpd [16:43:34] Nov 02 16:42:34 cloudvirt1046 lldpd[827743]: unable to send packet on real device for tapc00b99b9-d0: No buffer space available [16:43:45] (Primary outbound port utilisation over 80% #page) resolved: Device cloudsw1-d5-eqiad.mgmt.eqiad.wmnet recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [16:43:45] (Primary outbound port utilisation over 80% #page) resolved: Device cloudsw1-d5-eqiad.mgmt.eqiad.wmnet recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [16:43:45] (Primary inbound port utilisation over 80% #page) resolved: Device cloudsw1-e4-eqiad.mgmt.eqiad.wmnet recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [16:43:50] (Primary inbound port utilisation over 80% #page) resolved: Device cloudsw1-e4-eqiad.mgmt.eqiad.wmnet recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [16:45:22] I wonder if andrewbogott's ceph rebalancing is DOSing the switch's ability to respond to lldp? [16:45:54] I don't know, but certainly those things tend to happen when things are saturated or close to [16:46:11] connections fail, very high packet errors, etc [16:47:13] unfortunately I can't stop the current commotion but I will give it a nice long break before I do anything else [16:47:38] restarting lldpd and running puppet seems to have been enough to fix the icinga config issue at least for now [16:47:41] I wonder why the message is duplicate- if we have 2 instances of the same irc bot running or something? [16:48:23] as there was, that I can see, only 2 alerts [16:48:45] (JobUnavailable) firing: (6) Reduced availability for job wikidough in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:49:09] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:49:15] RECOVERY - Check correctness of the icinga configuration on alert1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [16:51:29] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:52:42] (03PS1) 10BryanDavis: developer-portal: Bump container to 2023-11-02-122447-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/971236 [16:53:46] (JobUnavailable) firing: (6) Reduced availability for job wikidough in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:54:50] (03PS1) 10BryanDavis: toolhub: Bump container to 2023-11-02-122223-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/971238 [16:55:14] (03CR) 10DLynch: [C: 03+1] Turn off DiscussionTools A/B test, and enable features on those wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/954920 (https://phabricator.wikimedia.org/T341491) (owner: 10Esanders) [16:55:50] (03CR) 10BryanDavis: [C: 03+2] developer-portal: Bump container to 2023-11-02-122447-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/971236 (owner: 10BryanDavis) [16:56:37] (03Merged) 10jenkins-bot: developer-portal: Bump container to 2023-11-02-122447-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/971236 (owner: 10BryanDavis) [16:56:41] (03CR) 10BryanDavis: [C: 03+2] toolhub: Bump container to 2023-11-02-122223-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/971238 (owner: 10BryanDavis) [16:57:27] (03PS1) 10Majavah: P:openstack: fix fernet key rotation in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/971240 [16:57:29] (03PS1) 10Majavah: openstack: replace openstack_controllers variable [puppet] - 10https://gerrit.wikimedia.org/r/971241 (https://phabricator.wikimedia.org/T347554) [16:57:33] (03Merged) 10jenkins-bot: toolhub: Bump container to 2023-11-02-122223-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/971238 (owner: 10BryanDavis) [16:57:36] !log fnegri@cumin1001 START - Cookbook sre.hosts.reimage for host cloudcontrol1005.eqiad.wmnet with OS bookworm [16:57:57] (03CR) 10CI reject: [V: 04-1] P:openstack: fix fernet key rotation in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/971240 (owner: 10Majavah) [16:58:12] (03CR) 10CI reject: [V: 04-1] openstack: replace openstack_controllers variable [puppet] - 10https://gerrit.wikimedia.org/r/971241 (https://phabricator.wikimedia.org/T347554) (owner: 10Majavah) [16:59:33] (JobUnavailable) firing: (6) Reduced availability for job wikidough in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:59:51] !log bd808@deploy2002 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231102T1700) [17:00:10] !log bd808@deploy2002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:00:51] !log bd808@deploy2002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:00:51] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:01:18] !log bd808@deploy2002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:01:51] !log bd808@deploy2002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:02:18] !log bd808@deploy2002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:03:45] (JobUnavailable) firing: (6) Reduced availability for job wikidough in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:04:15] ^ this should be resolving, if not, I will look at it after lunch [17:05:02] !log cmooney@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 13 hosts with reason: Move row A/B CR uplinks to SPINE switches [17:05:26] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 13 hosts with reason: Move row A/B CR uplinks to SPINE switches [17:05:35] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans - https://phabricator.wikimedia.org/T347191 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=0a8384b5-aa0d-44df-bf5c-aa9e191ed... [17:06:35] (03PS2) 10Majavah: P:openstack: fix fernet key rotation in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/971240 [17:06:37] (03PS2) 10Majavah: openstack: replace openstack_controllers variable [puppet] - 10https://gerrit.wikimedia.org/r/971241 (https://phabricator.wikimedia.org/T347554) [17:06:41] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:06:44] !log shutting down uplink from asw-a-codfw et-7/0/52 to cr2-codfw et-1/0/0 (T347191) [17:06:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:06:59] T347191: Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans - https://phabricator.wikimedia.org/T347191 [17:07:03] (03PS1) 10Ottomata: eventgate chart - set default service-runner num_workers to 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/971243 (https://phabricator.wikimedia.org/T347477) [17:07:22] (03CR) 10CI reject: [V: 04-1] openstack: replace openstack_controllers variable [puppet] - 10https://gerrit.wikimedia.org/r/971241 (https://phabricator.wikimedia.org/T347554) (owner: 10Majavah) [17:08:45] (JobUnavailable) firing: (5) Reduced availability for job wikidough in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:09:33] (JobUnavailable) resolved: (5) Reduced availability for job wikidough in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:10:02] !log bd808@deploy2002 helmfile [staging] START helmfile.d/services/toolhub: apply [17:11:25] PROBLEM - Host mwdebug2001 is DOWN: PING CRITICAL - Packet loss = 100% [17:11:27] PROBLEM - Host netboxdb2002 is DOWN: PING CRITICAL - Packet loss = 100% [17:11:37] RECOVERY - Host netboxdb2002 is UP: PING OK - Packet loss = 0%, RTA = 33.44 ms [17:11:37] RECOVERY - Host mwdebug2001 is UP: PING OK - Packet loss = 0%, RTA = 35.41 ms [17:11:42] !log bd808@deploy2002 helmfile [staging] DONE helmfile.d/services/toolhub: apply [17:11:49] !log fnegri@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcontrol1005.eqiad.wmnet with reason: host reimage [17:12:14] (03PS1) 10Physikerwelt: Enable native math rendering mode on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971244 [17:12:15] PROBLEM - Check systemd state on puppetmaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: session-c107960.scope,session-c107961.scope,session-c107962.scope,session-c107963.scope,session-c107964.scope https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:12:29] !log bd808@deploy2002 helmfile [codfw] START helmfile.d/services/toolhub: apply [17:12:40] (03PS3) 10Majavah: openstack: replace openstack_controllers variable [puppet] - 10https://gerrit.wikimedia.org/r/971241 (https://phabricator.wikimedia.org/T347554) [17:13:20] !log bd808@deploy2002 helmfile [codfw] DONE helmfile.d/services/toolhub: apply [17:13:35] !log bd808@deploy2002 helmfile [eqiad] START helmfile.d/services/toolhub: apply [17:14:01] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:14:31] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcontrol1005.eqiad.wmnet with reason: host reimage [17:14:40] !log bd808@deploy2002 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply [17:15:41] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:33] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [17:16:34] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/304/con" [puppet] - 10https://gerrit.wikimedia.org/r/971240 (owner: 10Majavah) [17:16:38] (SwiftObjectCountSiteDisparity) resolved: MediaWiki swift object counts site diffs - https://wikitech.wikimedia.org/wiki/Swift/How_To - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift - https://alerts.wikimedia.org/?q=alertname%3DSwiftObjectCountSiteDisparity [17:16:41] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/305/con" [puppet] - 10https://gerrit.wikimedia.org/r/971211 (owner: 10Majavah) [17:17:09] PROBLEM - CirrusSearch full_text codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38 [17:17:41] PROBLEM - CirrusSearch comp_suggest codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [250.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=50 [17:19:03] !log restart haproxy on cp4051 [17:19:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:23] PROBLEM - CirrusSearch more_like codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39 [17:21:13] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:21:19] RECOVERY - CirrusSearch full_text codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38 [17:21:25] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [17:21:29] (03CR) 10Majavah: [C: 03+2] prometheus: mysqld_exporter: fix puppet ordering issue [puppet] - 10https://gerrit.wikimedia.org/r/971198 (owner: 10Majavah) [17:21:29] (KubernetesRsyslogDown) firing: rsyslog on kubernetes2039:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2039 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [17:21:41] (03PS2) 10Majavah: prometheus: ipmi_exporter: remove absented resources [puppet] - 10https://gerrit.wikimedia.org/r/971164 [17:21:57] (03CR) 10Majavah: [C: 03+2] prometheus: ipmi_exporter: remove absented resources [puppet] - 10https://gerrit.wikimedia.org/r/971164 (owner: 10Majavah) [17:22:00] (03CR) 10Majavah: [V: 03+2 C: 03+2] prometheus: ipmi_exporter: remove absented resources [puppet] - 10https://gerrit.wikimedia.org/r/971164 (owner: 10Majavah) [17:23:19] RECOVERY - CirrusSearch comp_suggest codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [100.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=50 [17:23:29] !log depool cp5030 [17:23:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:39] RECOVERY - CirrusSearch more_like codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39 [17:24:45] (03CR) 10FNegri: [C: 03+1] "Looks good, would be nice to test it in codfw first, but I guess in the worst case if something breaks we can revert it :)" [puppet] - 10https://gerrit.wikimedia.org/r/971211 (owner: 10Majavah) [17:25:09] (03PS4) 10Majavah: openstack: replace openstack_controllers variable [puppet] - 10https://gerrit.wikimedia.org/r/971241 (https://phabricator.wikimedia.org/T347554) [17:25:45] (03CR) 10CI reject: [V: 04-1] openstack: replace openstack_controllers variable [puppet] - 10https://gerrit.wikimedia.org/r/971241 (https://phabricator.wikimedia.org/T347554) (owner: 10Majavah) [17:26:36] (03CR) 10Majavah: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/971241 (https://phabricator.wikimedia.org/T347554) (owner: 10Majavah) [17:27:43] (03PS1) 10Ebernhardson: cirrus updater: Update container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/971252 [17:27:55] (03CR) 10Cathal Mooney: [C: 03+2] Move mr1-codfw OSPF interface to et-1/1/5 on CRs after migration [homer/public] - 10https://gerrit.wikimedia.org/r/971197 (https://phabricator.wikimedia.org/T347191) (owner: 10Cathal Mooney) [17:28:06] (03PS1) 10Vgutierrez: hiera: Disable haproxy limit-by-path experiments on cp4041 and cp5030 [puppet] - 10https://gerrit.wikimedia.org/r/971253 (https://phabricator.wikimedia.org/T317799) [17:29:53] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 21 CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node" [puppet] - 10https://gerrit.wikimedia.org/r/971241 (https://phabricator.wikimedia.org/T347554) (owner: 10Majavah) [17:30:00] (03PS2) 10Physikerwelt: Enable native math rendering mode on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971244 (https://phabricator.wikimedia.org/T311620) [17:30:34] (03CR) 10FNegri: "the fix for the FQDN looks fine, but why is "keystone_sync_keys_from_" always added, whereas before it was only added in the "else" branch" [puppet] - 10https://gerrit.wikimedia.org/r/971240 (owner: 10Majavah) [17:30:49] (03Merged) 10jenkins-bot: Move mr1-codfw OSPF interface to et-1/1/5 on CRs after migration [homer/public] - 10https://gerrit.wikimedia.org/r/971197 (https://phabricator.wikimedia.org/T347191) (owner: 10Cathal Mooney) [17:31:06] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/307/con" [puppet] - 10https://gerrit.wikimedia.org/r/971253 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [17:31:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:31:16] (KubernetesRsyslogDown) resolved: rsyslog on kubernetes2039:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2039 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [17:31:31] (03CR) 10Majavah: [V: 03+1] P:openstack: fix fernet key rotation in codfw1dev (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/971240 (owner: 10Majavah) [17:31:36] (03CR) 10Majavah: [V: 03+1 C: 03+2] P:openstack: drop firewall rules made obsolete by cloudlb [puppet] - 10https://gerrit.wikimedia.org/r/971211 (owner: 10Majavah) [17:31:51] (03CR) 10Ottomata: [C: 03+2] eventgate chart - set default service-runner num_workers to 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/971243 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata) [17:33:08] (03Merged) 10jenkins-bot: eventgate chart - set default service-runner num_workers to 0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/971243 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata) [17:33:45] (Device rebooted) firing: Alert for device ps1-a3-codfw.mgmt.codfw.wmnet - Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [17:36:57] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Update container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/971252 (owner: 10Ebernhardson) [17:37:45] (03Merged) 10jenkins-bot: cirrus updater: Update container version [deployment-charts] - 10https://gerrit.wikimedia.org/r/971252 (owner: 10Ebernhardson) [17:38:07] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:38:19] (03CR) 10Fabfur: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/971253 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [17:38:25] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Disable haproxy limit-by-path experiments on cp4041 and cp5030 [puppet] - 10https://gerrit.wikimedia.org/r/971253 (https://phabricator.wikimedia.org/T317799) (owner: 10Vgutierrez) [17:38:57] 10SRE, 10SRE-Access-Requests: Requesting access to `discovery.processed_external_sparql_query` for AndrewTavis_WMDE - https://phabricator.wikimedia.org/T350426 (10AndrewTavis_WMDE) [17:39:01] (03PS1) 10Cathal Mooney: Fix error in interface for MR1 uplinks [homer/public] - 10https://gerrit.wikimedia.org/r/971257 (https://phabricator.wikimedia.org/T347191) [17:39:46] (03CR) 10Cathal Mooney: [C: 03+2] Fix error in interface for MR1 uplinks [homer/public] - 10https://gerrit.wikimedia.org/r/971257 (https://phabricator.wikimedia.org/T347191) (owner: 10Cathal Mooney) [17:40:03] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [17:40:29] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:41:10] (03Merged) 10jenkins-bot: Fix error in interface for MR1 uplinks [homer/public] - 10https://gerrit.wikimedia.org/r/971257 (https://phabricator.wikimedia.org/T347191) (owner: 10Cathal Mooney) [17:41:17] (03CR) 10FNegri: [C: 03+1] P:openstack: fix fernet key rotation in codfw1dev (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/971240 (owner: 10Majavah) [17:41:31] (03CR) 10Majavah: [V: 03+1 C: 03+2] P:openstack: fix fernet key rotation in codfw1dev [puppet] - 10https://gerrit.wikimedia.org/r/971240 (owner: 10Majavah) [17:42:47] (03PS5) 10Majavah: openstack: replace openstack_controllers variable [puppet] - 10https://gerrit.wikimedia.org/r/971241 (https://phabricator.wikimedia.org/T347554) [17:42:58] !log repool cp4051 and cp5030 [17:43:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:45] (Device rebooted) resolved: Device ps1-a3-codfw.mgmt.codfw.wmnet recovered from Device rebooted - https://alerts.wikimedia.org/?q=alertname%3DDevice+rebooted [17:45:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:45:15] !log fnegri@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcontrol1005.eqiad.wmnet with OS bookworm [17:45:33] !log Moving row A outbound traffic from direct CR link to routing via Spinie (T347191) [17:45:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:37] T347191: Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans - https://phabricator.wikimedia.org/T347191 [17:47:30] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 23 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node" [puppet] - 10https://gerrit.wikimedia.org/r/971241 (https://phabricator.wikimedia.org/T347554) (owner: 10Majavah) [17:47:47] (03PS6) 10Majavah: openstack: replace openstack_controllers variable [puppet] - 10https://gerrit.wikimedia.org/r/971241 (https://phabricator.wikimedia.org/T347554) [17:50:37] !log Shutting asw-a-codfw uplink to cr1-codfw down in advance of cable move (T347191) [17:50:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:41] T347191: Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans - https://phabricator.wikimedia.org/T347191 [17:50:41] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:52:09] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 22 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node" [puppet] - 10https://gerrit.wikimedia.org/r/971241 (https://phabricator.wikimedia.org/T347554) (owner: 10Majavah) [17:52:23] (03PS7) 10Majavah: openstack: replace openstack_controllers variable [puppet] - 10https://gerrit.wikimedia.org/r/971241 (https://phabricator.wikimedia.org/T347554) [17:56:55] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 21 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node" [puppet] - 10https://gerrit.wikimedia.org/r/971241 (https://phabricator.wikimedia.org/T347554) (owner: 10Majavah) [18:00:06] dduvall and dancy: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for MediaWiki train - Utc-7 Version . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231102T1800). [18:00:25] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:07:18] (03PS1) 10TrainBranchBot: group2 wikis to 1.42.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971260 (https://phabricator.wikimedia.org/T348356) [18:07:20] (03CR) 10TrainBranchBot: [C: 03+2] group2 wikis to 1.42.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971260 (https://phabricator.wikimedia.org/T348356) (owner: 10TrainBranchBot) [18:07:46] !log Making cr1-codfw VRRP Master for row A traffic again on ssw1-a1-codfw interface (T347191) [18:07:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:04] T347191: Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans - https://phabricator.wikimedia.org/T347191 [18:08:17] (03Merged) 10jenkins-bot: group2 wikis to 1.42.0-wmf.3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971260 (https://phabricator.wikimedia.org/T348356) (owner: 10TrainBranchBot) [18:09:04] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [18:09:11] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:11:39] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:12:25] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:12:53] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.262 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:13:39] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.073 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:15:01] (03PS1) 10Cathal Mooney: Change definitions for OSPF stub interfaces codfw CRs [homer/public] - 10https://gerrit.wikimedia.org/r/971263 (https://phabricator.wikimedia.org/T347191) [18:16:31] (03CR) 10Cathal Mooney: [C: 03+2] Change definitions for OSPF stub interfaces codfw CRs [homer/public] - 10https://gerrit.wikimedia.org/r/971263 (https://phabricator.wikimedia.org/T347191) (owner: 10Cathal Mooney) [18:17:08] (03Merged) 10jenkins-bot: Change definitions for OSPF stub interfaces codfw CRs [homer/public] - 10https://gerrit.wikimedia.org/r/971263 (https://phabricator.wikimedia.org/T347191) (owner: 10Cathal Mooney) [18:18:42] (03PS1) 10Cathal Mooney: Enable DHCP relay on et-1/1/5 subints following switch move [homer/public] - 10https://gerrit.wikimedia.org/r/971264 (https://phabricator.wikimedia.org/T347191) [18:21:03] !log Shutting asw-b-codfw uplink to cr2-codfw down in advance of cable move (T347191) [18:21:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:16] T347191: Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans - https://phabricator.wikimedia.org/T347191 [18:22:04] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host doh1001.wikimedia.org with OS bookworm [18:22:22] !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: group2 wikis to 1.42.0-wmf.3 refs T348356 [18:22:25] T348356: 1.42.0-wmf.3 deployment blockers - https://phabricator.wikimedia.org/T348356 [18:25:47] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:25:51] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:28:45] (JobUnavailable) firing: Reduced availability for job wikidough in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:31:14] ^ expected, reimaging [18:32:00] !log sukhe@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on doh1001.wikimedia.org with reason: host reimage [18:34:45] Hi, when is next run of updating special pages for Serbian Wikipedia? [18:35:05] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on doh1001.wikimedia.org with reason: host reimage [18:38:00] I've created https://phabricator.wikimedia.org/T350431, please check. :) [18:43:45] (JobUnavailable) resolved: Reduced availability for job wikidough in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:44:29] !log Making cr2-codfw VRRP Master for row B traffic over new link from ssw1-a8-codfw (T347191) [18:44:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:33] T347191: Bring codfw row A-B EVPN switches live and make them gateway for existing Vlans - https://phabricator.wikimedia.org/T347191 [18:46:08] !log shutting down uplink from asw-b-codfw et-2/0/51 to cr1-codfw in advance of cable move (T347191) [18:46:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:35] (03CR) 10Stegmujo: Enable native math rendering mode on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971244 (https://phabricator.wikimedia.org/T311620) (owner: 10Physikerwelt) [18:49:53] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:51:01] RECOVERY - BFD status on cr2-eqiad is OK: UP: 19 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:51:09] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:52:37] !log sukhe@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host doh1001.wikimedia.org with OS bookworm [18:52:47] (03CR) 10Eevans: [C: 03+2] cassandra: add grants for new mobileapps tables [puppet] - 10https://gerrit.wikimedia.org/r/970848 (https://phabricator.wikimedia.org/T348993) (owner: 10Eevans) [18:56:37] (03CR) 10Physikerwelt: Enable native math rendering mode on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971244 (https://phabricator.wikimedia.org/T311620) (owner: 10Physikerwelt) [18:56:41] (03PS1) 10Eevans: Revert "cassandra: add grants for new mobileapps tables" [puppet] - 10https://gerrit.wikimedia.org/r/971286 [18:57:34] (03CR) 10Eevans: [C: 03+2] Revert "cassandra: add grants for new mobileapps tables" [puppet] - 10https://gerrit.wikimedia.org/r/971286 (owner: 10Eevans) [18:58:17] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:58:25] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [18:58:47] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:00:09] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:00:50] (03PS1) 10Eevans: cassandra: add grants for new mobileapps tables (redux) [puppet] - 10https://gerrit.wikimedia.org/r/971269 (https://phabricator.wikimedia.org/T348993) [19:02:25] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sun 17 Dec 2023 03:07:37 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:02:35] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 8.810 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:02:47] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50860 bytes in 0.095 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:03:18] (03PS1) 10Cathal Mooney: Change OSPF stub ints and dhcp relay on CRs for codfw row B [homer/public] - 10https://gerrit.wikimedia.org/r/971271 (https://phabricator.wikimedia.org/T347191) [19:05:13] PROBLEM - VRRP status on cr1-codfw is CRITICAL: VRRP CRITICAL - 4 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [19:05:59] (03PS2) 10Cathal Mooney: Change OSPF stub ints and dhcp relay on CRs for codfw row B [homer/public] - 10https://gerrit.wikimedia.org/r/971271 (https://phabricator.wikimedia.org/T347191) [19:06:37] RECOVERY - VRRP status on cr1-codfw is OK: VRRP OK - 0 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [19:07:07] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:07:18] (03CR) 10Cathal Mooney: [C: 03+2] Change OSPF stub ints and dhcp relay on CRs for codfw row B [homer/public] - 10https://gerrit.wikimedia.org/r/971271 (https://phabricator.wikimedia.org/T347191) (owner: 10Cathal Mooney) [19:15:00] (03CR) 10Stegmujo: [C: 03+1] Enable native math rendering mode on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971244 (https://phabricator.wikimedia.org/T311620) (owner: 10Physikerwelt) [19:15:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:16:51] (03CR) 10Cathal Mooney: [C: 03+2] Change OSPF stub ints and dhcp relay on CRs for codfw row B (031 comment) [homer/public] - 10https://gerrit.wikimedia.org/r/971271 (https://phabricator.wikimedia.org/T347191) (owner: 10Cathal Mooney) [19:17:20] (03CR) 10Daniel Kinzler: Enable native math rendering mode on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971244 (https://phabricator.wikimedia.org/T311620) (owner: 10Physikerwelt) [19:17:57] I've asked dduvall about https://phabricator.wikimedia.org/T350431, but he told me that it's better to move discussion here in the channel. [19:18:07] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [19:18:19] (03CR) 10BPirkle: [C: 04-1] "Per Hugh's comment regarding Host header, this change is likely insufficient. Giving it a -1 for now until we sort that out." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968384 (https://phabricator.wikimedia.org/T348731) (owner: 10BPirkle) [19:18:52] (03PS1) 10Cathal Mooney: Make cr1-codfw VRRP primary for rows A and B after link move [homer/public] - 10https://gerrit.wikimedia.org/r/971273 (https://phabricator.wikimedia.org/T347191) [19:18:59] Now that https://gerrit.wikimedia.org/r/c/mediawiki/core/+/969428 has reached Serbian Wikipedia, conversion in Serbian language is working much better, but things aren't fully updated, there are some maintenance scripts required to run probably. [19:20:33] And Libera.chat has kicked me again... [19:20:34] For example: Special:LintErrors has bognus images section, and it states that some parameters are invalid, but now there are valid. MediaWiki can now recognize Serbian (Latin script) parameters for image, so I'm unsure if I should run touch.py script of Pywikibot in order to refresh links and such things, or it is better to run runLinks.php [19:20:35] MediaWiki's maintenance script with --dfn-only option. [19:22:47] thanks, Kizule. i'm asking around in #mediawiki_security as well. if i can get a go ahead from just one other knowledgeable person, i will run them. i just need confirmation first as i'm not that experienced with the mentioned scripts and what kind of load will be incurred [19:26:46] dduvall: Thanks, I don't know from where actually I would start. :D [19:26:58] (03CR) 10Cathal Mooney: [C: 03+2] Enable DHCP relay on et-1/1/5 subints following switch move [homer/public] - 10https://gerrit.wikimedia.org/r/971264 (https://phabricator.wikimedia.org/T347191) (owner: 10Cathal Mooney) [19:27:17] (03CR) 10Cathal Mooney: [C: 03+2] Make cr1-codfw VRRP primary for rows A and B after link move [homer/public] - 10https://gerrit.wikimedia.org/r/971273 (https://phabricator.wikimedia.org/T347191) (owner: 10Cathal Mooney) [19:27:39] (03Merged) 10jenkins-bot: Enable DHCP relay on et-1/1/5 subints following switch move [homer/public] - 10https://gerrit.wikimedia.org/r/971264 (https://phabricator.wikimedia.org/T347191) (owner: 10Cathal Mooney) [19:27:41] (03Merged) 10jenkins-bot: Change OSPF stub ints and dhcp relay on CRs for codfw row B [homer/public] - 10https://gerrit.wikimedia.org/r/971271 (https://phabricator.wikimedia.org/T347191) (owner: 10Cathal Mooney) [19:27:53] (03Merged) 10jenkins-bot: Make cr1-codfw VRRP primary for rows A and B after link move [homer/public] - 10https://gerrit.wikimedia.org/r/971273 (https://phabricator.wikimedia.org/T347191) (owner: 10Cathal Mooney) [19:28:00] (03PS3) 10Physikerwelt: Enable native math rendering mode on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971244 (https://phabricator.wikimedia.org/T311620) [19:37:36] (03PS1) 10Cathal Mooney: Remove temp filter stopping codfw ssw's announcing row a/b networks [homer/public] - 10https://gerrit.wikimedia.org/r/971276 (https://phabricator.wikimedia.org/T347191) [19:39:12] (03CR) 10Cathal Mooney: [C: 03+2] Remove temp filter stopping codfw ssw's announcing row a/b networks [homer/public] - 10https://gerrit.wikimedia.org/r/971276 (https://phabricator.wikimedia.org/T347191) (owner: 10Cathal Mooney) [19:39:33] (03CR) 10Eevans: [C: 03+2] cassandra: add grants for new mobileapps tables (redux) [puppet] - 10https://gerrit.wikimedia.org/r/971269 (https://phabricator.wikimedia.org/T348993) (owner: 10Eevans) [19:40:11] (03Merged) 10jenkins-bot: Remove temp filter stopping codfw ssw's announcing row a/b networks [homer/public] - 10https://gerrit.wikimedia.org/r/971276 (https://phabricator.wikimedia.org/T347191) (owner: 10Cathal Mooney) [19:44:23] PROBLEM - Check systemd state on aqs1010 is CRITICAL: CRITICAL - degraded: The following units failed: aqs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:44:45] (03PS1) 10Eevans: cassandra_dev: add mediawiki_services_mobileapps role & grants [puppet] - 10https://gerrit.wikimedia.org/r/971278 (https://phabricator.wikimedia.org/T348993) [19:49:26] 10SRE, 10SRE-Access-Requests: Requesting access to root for dzahn - https://phabricator.wikimedia.org/T350435 (10Dzahn) [19:50:08] 10SRE, 10SRE-Access-Requests: Requesting access to root for dzahn - https://phabricator.wikimedia.org/T350435 (10Dzahn) The request is to revert https://gerrit.wikimedia.org/r/c/operations/puppet/+/934634 This should be merged to resolve it please: https://gerrit.wikimedia.org/r/c/operations/puppet/+/971167 [19:50:15] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:51:17] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:51:22] (03CR) 10Eevans: [C: 03+2] cassandra_dev: add mediawiki_services_mobileapps role & grants [puppet] - 10https://gerrit.wikimedia.org/r/971278 (https://phabricator.wikimedia.org/T348993) (owner: 10Eevans) [19:52:14] 3]KGQ|PxtDD8 [19:52:29] welp, there's a password burned. :) [19:52:33] (03CR) 10RhinosF1: [C: 03+1] "welcome back Daniel!" [puppet] - 10https://gerrit.wikimedia.org/r/971167 (owner: 10Dzahn) [19:52:39] thanks focus-follows-mouse. [19:52:46] (03PS1) 10Ejegg: Allow crawling FundraiserLandingPage in robots.txt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971281 (https://phabricator.wikimedia.org/T254808) [19:53:54] brennen: :D been there [19:54:06] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [19:54:21] i'm still waiting for that focus-follows-eyes feature after all these years. [19:54:50] i think that'd get me into more trouble on average. [19:59:23] !log cmooney@cumin1001 START - Cookbook sre.dns.netbox [20:00:06] brennen, TheresNoTime, and thcipriani: Dear deployers, time to do the UTC late backport and config training deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231102T2000). [20:00:06] physikerwelt and Kizule: A patch you scheduled for UTC late backport and config training is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:01:15] RECOVERY - Check systemd state on aqs1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:01:34] !log cmooney@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update DNS entries for codfw CR IPs moved to new interfaces. - cmooney@cumin1001" [20:02:23] !log cmooney@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update DNS entries for codfw CR IPs moved to new interfaces. - cmooney@cumin1001" [20:02:23] !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:02:57] o/ [20:03:07] \o [20:05:43] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:07:06] @physikerweltare you around for [config] 971244 ? [20:07:47] may need physikerwelt as a standalone string for the ping [20:08:15] yes. Could you have a look if the config change is technically ok, before deploying. I never worked with this kind of config before [20:09:01] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (DIFF 24): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/311/console" [puppet] - 10https://gerrit.wikimedia.org/r/971241 (https://phabricator.wikimedia.org/T347554) (owner: 10Majavah) [20:10:02] physikerwelt: I'm reading the reviews now [20:10:03] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.305 second response time https://wikitech.wikimedia.org/wiki/Swift [20:11:21] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.130 second response time https://wikitech.wikimedia.org/wiki/Swift [20:11:52] (03CR) 10RhinosF1: [C: 03+1] Enable native math rendering mode on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971244 (https://phabricator.wikimedia.org/T311620) (owner: 10Physikerwelt) [20:12:34] physikerwelt, brennen: it looks like valid config format. I am trusting physikerwelt with understanding what it does. [20:14:54] Thank you RhinosF1. It adds one option to https://test.wikipedia.org/wiki/Special:Preferences#mw-prefsection-rendering-math [20:15:31] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:16:08] If there are three options afterwards and all the other wikis remain unchanged it did was it was supposed to do [20:17:01] I suggest testing it works too [20:17:22] ^ +1 we'll deploy to test servers and you can check it works right [20:17:27] brennen: are you deploying ? [20:17:35] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by mabualruz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971244 (https://phabricator.wikimedia.org/T311620) (owner: 10Physikerwelt) [20:17:52] cool [20:17:53] RhinosF1: mo_abualruz is taking deployment training this week [20:17:58] Ah I see mo_abualruz is training [20:18:01] Good luck :) [20:18:04] :D [20:18:34] (03Merged) 10jenkins-bot: Enable native math rendering mode on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971244 (https://phabricator.wikimedia.org/T311620) (owner: 10Physikerwelt) [20:18:49] !log mabualruz@deploy2002 Started scap: Backport for [[gerrit:971244|Enable native math rendering mode on testwiki (T311620)]] [20:18:58] T311620: Update mathoid to node 16 - https://phabricator.wikimedia.org/T311620 [20:19:43] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:20:11] !log mabualruz@deploy2002 mabualruz and physikerwelt: Backport for [[gerrit:971244|Enable native math rendering mode on testwiki (T311620)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:21:36] physikerwelt can you check please with the wikimediadebug browser extension and confirm the behaviour is as intended [20:22:08] Kizule: are there specific post-deploy concerns you're trying to address with these maintenance scripts? Did something change you expect to have caused problems here? (asking since we normally run them after doing a backport that we expect will change stuff) [20:22:27] ok [20:23:44] thcipriani: There aren't problems which can break the wiki, it just needs cleanup via maintenance script. There was a bunch of things that didn't work well, while I made the patch and it got deployed. [20:24:19] And I want to make sure that there aren't surprises. [20:25:01] Everything works well, nothing isn't broken, just want to cleanup possible duplications in DB. [20:25:31] We should run namespaceDupes.php to make sure that there aren't any broken duplicates. [20:25:42] updateSpecialPages.php can wait, I guess. [20:25:58] it works as expected on testwiki, and I don't see a change on enwiki, which is also expected. [20:26:06] refreshLinks.php can be run if needed, I planned to use Pywikibot to go through each page and make null edits, in order to refresh links, magic words and such things. [20:26:18] But I think that refreshLinks.php is better. [20:27:31] mo_abualruz: moving forward [20:27:35] !log mabualruz@deploy2002 mabualruz and physikerwelt: Continuing with sync [20:27:53] physikerwelt: moving forward :P [20:28:31] sounds good, however, I don't understand exactly what this means:-) [20:29:11] ah, physikerwelt we're rolling out your change live to all servers now that you've verified it works on mwdebug, it'll be live everywhere in a few minutes. [20:29:51] 10SRE-swift-storage, 10API Platform, 10Commons, 10MediaWiki-File-management, and 4 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10Yann) index.php?action=raw&ctype=text/javascript&title=User:Rillke/MwJSBo... [20:30:10] great. If I understand it correctly this is the last step? [20:30:53] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:30:57] physikerwelt: yes indeed :) [20:32:55] !log mabualruz@deploy2002 Finished scap: Backport for [[gerrit:971244|Enable native math rendering mode on testwiki (T311620)]] (duration: 14m 06s) [20:32:59] T311620: Update mathoid to node 16 - https://phabricator.wikimedia.org/T311620 [20:34:14] physikerwelt: Done can you please check on live servers [20:35:27] Yes. I realized that changes done to my user settings done with mwdebug are also on the live server [20:35:40] everything looks great thank you [20:37:19] physikerwelt: you are welcome :) [20:42:21] Kizule: so looking at these scripts, it looks like refreshLinks and updateSpecialPages are both run on a cron. https://github.com/wikimedia/operations-puppet/blob/production/modules/profile/manifests/mediawiki/maintenance/update_special_pages.pp and [20:42:23] https://github.com/wikimedia/operations-puppet/blob/production/modules/profile/manifests/mediawiki/maintenance/refreshlinks/periodic_job.pp so I'd be fine running namespacedupes now and let the others run in due time, cool? [20:42:55] I got kicked again. thcipriani: Sounds good to me. [20:43:00] cool :) [20:43:07] running namespaceDupes now [20:44:32] thcipriani: Okay [20:46:49] Oh finally... [20:48:25] oh, it seems like there are a a bunch, where do you want the output, in a paste on phab? [20:48:39] thcipriani: Where is easier for you. [20:50:26] I'm hoping that I won't get kicked again... [20:50:33] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:51:11] (03PS7) 10Ottomata: eventgate: Update mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/959181 (https://phabricator.wikimedia.org/T345244) (owner: 10Clément Goubert) [20:51:31] (03CR) 10Ottomata: "Rebased on some recent eventgate chart changes, only had to resolve conflict on Chart.yaml version" [deployment-charts] - 10https://gerrit.wikimedia.org/r/959181 (https://phabricator.wikimedia.org/T345244) (owner: 10Clément Goubert) [20:51:38] Kizule11: https://phabricator.wikimedia.org/P53134 [20:51:41] (03CR) 10CI reject: [V: 04-1] eventgate: Update mesh module [deployment-charts] - 10https://gerrit.wikimedia.org/r/959181 (https://phabricator.wikimedia.org/T345244) (owner: 10Clément Goubert) [20:52:07] thcipriani: Woah, run a --fix please. [20:53:15] Kizule: do you know what prefix we want to add? [20:53:29] thcipriani: If it has to add prefix, add BROKEN. [20:53:51] thcipriani: you don't need to add prefix on first run [20:54:06] 10SRE, 10Infrastructure-Foundations, 10Mail: Rspamd module - https://phabricator.wikimedia.org/T325397 (10jhathaway) 05Open→03Resolved [20:54:06] There's only 2 in that list showing can't be auto resolved [20:54:10] 10SRE, 10Infrastructure-Foundations, 10Mail: Puppetry - https://phabricator.wikimedia.org/T325395 (10jhathaway) [20:54:16] 10SRE, 10Infrastructure-Foundations, 10Mail: Puppetry - https://phabricator.wikimedia.org/T325395 (10jhathaway) [20:54:19] alright, trying with --fix [20:54:22] 10SRE, 10Infrastructure-Foundations, 10Mail: Postfix Module - https://phabricator.wikimedia.org/T325396 (10jhathaway) 05Open→03Resolved [20:54:47] thcipriani: run it dry after that fixes again so we get a smaller output to read [20:55:01] PROBLEM - Swift https frontend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.347 second response time https://wikitech.wikimedia.org/wiki/Swift [20:56:11] brennen, thcipriani so, looking at this grafana chart .. https://grafana.wikimedia.org/d/000000106/parser-cache?orgId=1&from=now-2h&to=now&var-contentModel=wikitext&var-dc=codfw&var-cache=parsoid&viewPanel=31 ... after the train rolled out everywhere, I would have expected those accesses to go to zero. [20:56:19] RECOVERY - Swift https frontend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 245 bytes in 0.130 second response time https://wikitech.wikimedia.org/wiki/Swift [20:56:38] So, I wonder if there is any process that didn't restart or if any server didn't get a code update. [20:57:16] thcipriani: You can paste outputs again in Phabricator, don't worry. :) [20:57:27] I'll link them later when I'm closing tasks anyway. [20:57:29] it is a very small number of requests to the 'parsoid' parser-cache instance (about 5-10 a minute) to be sure ... (gradually petered down from many 10s of thousands to a few hundreds and should have gone to zero now). [20:57:37] subbu: seems unlikely. we have in the past had a situation where something was depooled and then repooled without having gotten the update, i think? [20:58:16] subbu: is there any pattern to the requests? [20:58:36] i don't know where to look ... that graph only has the request counts. [20:59:05] (03Abandoned) 10EoghanGaffney: doc: Add option to quickdatacopy for --ignore-missing-args [puppet] - 10https://gerrit.wikimedia.org/r/933864 (owner: 10EoghanGaffney) [20:59:18] Kizule: RhinosF1 still waiting....output from the script seems to have stopped, but script still running [20:59:24] it is a very minimscule number of requests .. so not a real problem .. but it is bugging me that it is not zero. [20:59:30] hmm, yeah [20:59:47] subbu: any chance of... something long-running? [20:59:53] thcipriani: it's got over 3000 to do. That's a lot. [21:00:27] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:00:33] brennen, hmm ... will not sure if there are background job runners that are still running with old code ... i am not very familiar with this part of hte infrastructure to ansewr that. [21:03:20] thcipriani: How is it going at this moment? [21:03:22] :) [21:08:16] Kizule: status is the same, showed a whole bunch of page links initially, now just hanging there [21:11:50] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10CodeReviewBot) brett merged https://gitlab.wikimedia.org/repos/sre/acme-chief/-/merge_requests/2 Release 0.36-2 for Bookworm [21:12:01] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.399 second response time https://wikitech.wikimedia.org/wiki/Swift [21:12:51] subbu: spot check seems to indicate that new versions are rolled out everywhere; however, jobrunners don't get restarted, new jobs pick the new wiki version. So long running jobs may still be using the previous version. And long-running maintenance scripts may also be using the old version. Could that account for the level of traffic you're seeing? [21:12:53] thcipriani: I'm really hoping that it is doing something. [21:13:14] thcipriani, ya .. that could be it. [21:13:17] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.142 second response time https://wikitech.wikimedia.org/wiki/Swift [21:13:39] thanks .. good to at least have some plausible explanation for it. :) [21:13:41] subbu: if you have specific servers its coming from, we could confirm... [21:14:06] (which you might not if it's a counter somewhere :)) [21:14:11] I don't .. it is just a counter. [21:14:16] ya. [21:14:53] yeah, alright, I think that if you're still seeing it after a day, might be more reason to worry :) [21:15:17] got it! I'll sign off from here for now. :) [21:15:18] yeah, set a reminder to check if this is still happening in a week and investigate? [21:15:20] :) [21:15:20] thanks for looking. [21:18:10] Kizule: well. strace is telling me operation not permitted, so I have no idea why it's stalled out :\ [21:18:27] thcipriani: It got timed out probably, try again? [21:18:54] Rhinos1: What do you think? :) [21:19:02] RhinosF1 sorry [21:19:48] 10SRE, 10Traffic, 10GitLab (Project Migration), 10Patch-For-Review: Migrate Traffic repositories from Gerrit to Gitlab - https://phabricator.wikimedia.org/T347623 (10CodeReviewBot) brett opened https://gitlab.wikimedia.org/repos/sre/acme-chief/-/merge_requests/5 ci: Automatically build Debian packages [21:20:14] Kizule: not really sure [21:20:23] It should be harmless to restart [21:20:30] 10SRE, 10Traffic, 10GitLab (Project Migration), 10Patch-For-Review: Migrate Traffic repositories from Gerrit to Gitlab - https://phabricator.wikimedia.org/T347623 (10CodeReviewBot) brett closed https://gitlab.wikimedia.org/repos/sre/acme-chief/-/merge_requests/5 ci: Automatically build Debian packages [21:21:51] PROBLEM - Swift https backend on ms-fe2013 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 309 bytes in 7.344 second response time https://wikitech.wikimedia.org/wiki/Swift [21:22:29] !log bking@cumin2002 enabling elastic snapshots on eqiad clusters T348686 [21:22:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:33] T348686: Standardize/document Elastic snapshot configuration - https://phabricator.wikimedia.org/T348686 [21:23:09] RECOVERY - Swift https backend on ms-fe2013 is OK: HTTP OK: HTTP/1.1 200 OK - 456 bytes in 0.139 second response time https://wikitech.wikimedia.org/wiki/Swift [21:23:59] (SwaggerProbeHasFailures) firing: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.eqiad.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [21:25:44] well. there is a wait for replicationlag in this script, but replication lag looks to be at zero per grafana [21:26:27] Looks fine in WHMCS as well. [21:28:59] (SwaggerProbeHasFailures) resolved: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.eqiad.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [21:32:26] alright, on the hunch that it's querying in a useless loop, ctrl-c, a dry run still shows 1149 to fix [21:33:38] thcipriani: So, it actually did something? [21:33:57] yeah, there was a bunch of fixes and then it just ... stopped [21:34:18] Probably it got shocked. :D [21:34:33] and the dry run indicates that the next pagelink id (that would have been after the one where it hung) is still unfixed. Dunno what it was doing. [21:36:31] ah, I see, yeah, seeing some lock wait timeouts on this one [21:37:39] That's not good [21:38:41] (03CR) 10Damilare Adedoyin: [C: 03+1] Allow crawling FundraiserLandingPage in robots.txt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971281 (https://phabricator.wikimedia.org/T254808) (owner: 10Ejegg) [21:38:49] Ićm hoping that my cleanup of Draft namespace (a lof of deleting and small number of moving pages out of it) isn't causing this... :thinking: [21:38:54] *I'm [21:38:56] seems like some job is collading with a maintenance script [21:41:12] let's pause on this. It seems like the jobqueue is doing something to the pagelinks table and so the script is just hanging waiting, meanwhile we're fighting for locks with jobrunners [21:41:34] I'm pausing... [21:42:03] Let's try in a few minutes. [21:49:07] hrm, I'm still seeing them come in for srwiki. Given we're now 45 minutes over the window and there's still cleanup winding its way through the jobqueue, let's try running this again in a different window. [21:52:33] thcipriani: What about morning one? [21:52:56] Oh, tomorrow is Friday, there aren't any windows on Fridays. [21:56:55] yeah, we could try it again on Monday, I suppose. [22:01:48] (03CR) 10Bartosz Dziewoński: [C: 03+1] "I'm not sure how you wanted to do it, so leaving it for you to schedule deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/969401 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza) [22:01:50] thcipriani: I added it for Monday's morning window. [22:04:00] (03CR) 10Bartosz Dziewoński: Don't remove current wiki family from $wgCentralAuthAutoLoginWikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967295 (owner: 10Bartosz Dziewoński) [22:04:04] (03PS4) 10Bartosz Dziewoński: Don't remove current wiki family from $wgCentralAuthAutoLoginWikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967295 [22:04:10] (03PS3) 10Bartosz Dziewoński: Clean up $wgCentralAuthAutoLoginWikis configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967977 [22:04:50] Kizule: thanks <3 [22:05:07] (03CR) 10Bartosz Dziewoński: "Scheduled: https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231106T1400" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967295 (owner: 10Bartosz Dziewoński) [22:05:13] (03CR) 10Bartosz Dziewoński: "Scheduled: https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231106T1400" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967977 (owner: 10Bartosz Dziewoński) [22:05:53] (03CR) 10Bartosz Dziewoński: [C: 03+1] "I'm getting a few other config changes deployed on Monday, should I do this one as well?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966798 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza) [22:12:17] WARNING: API error readonly: The database has been automatically locked while the replica database servers catch up to the primary [22:12:17] ERROR: Detected MediaWiki API exception internal_api_error_readonly: The database has been automatically locked while the replica database servers catch up to the primary [22:12:18] [readonlyreason: Waiting for 7 lagged database(s); [22:12:18]  servedby: mw-api-ext.codfw.main-66cfb7d5d7-rswb4; [22:12:19]  help: See https://sr.wikipedia.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.org/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/> for notice of API deprecations and breaking changes.]; retrying [22:13:35] !log import acme-chief 0.36-2 into bookworm-wikimedia repo [22:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:03] brett: nice :) [22:15:21] as easy as $(seq 1 1000) [22:22:34] There is something wrong with Serbian Wikipedia, can someone check? [22:23:06] Special:Contributions is showing me that database is overloaded and that changes newer than 660 seconds won't be shown. [22:24:31] grafana is showing db1183 having replication lang with value of 8 minutes. https://grafana.wikimedia.org/d/000000303/mysql-replication-lag?orgId=1 [22:24:39] Serbian Wikipedia is on s5. [22:24:59] PROBLEM - MariaDB Replica Lag: s5 on clouddb1021 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 830.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:25:18] PROBLEM - MariaDB Replica Lag: s5 #page on db2137 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 847.28 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:25:25] PROBLEM - MariaDB Replica Lag: s5 on db1183 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 855.69 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:25:27] PROBLEM - MariaDB Replica Lag: s5 #page on db2128 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 856.41 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:25:29] PROBLEM - MariaDB Replica Lag: s5 on db1216 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 860.18 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:25:36] PROBLEM - MariaDB Replica Lag: s5 #page on db2171 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 865.47 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:25:39] PROBLEM - MariaDB Replica Lag: s5 on clouddb1016 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 869.25 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:25:39] PROBLEM - MariaDB Replica Lag: s5 on clouddb1020 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 870.32 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:25:48] PROBLEM - MariaDB Replica Lag: s5 #page on db2178 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 877.24 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:25:48] PROBLEM - MariaDB Replica Lag: s5 on db2101 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 877.26 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:25:49] PROBLEM - MariaDB Replica Lag: s5 #page on db2111 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 877.28 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:25:49] PROBLEM - MariaDB Replica Lag: s5 #page on db2123 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 877.81 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:25:51] PROBLEM - MariaDB Replica Lag: s5 on db1154 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 881.42 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:26:00] PROBLEM - MariaDB Replica Lag: s5 #page on db2157 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 888.43 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:26:03] PROBLEM - MariaDB Replica Lag: s5 on dbstore1003 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 894.41 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:26:05] PROBLEM - MariaDB Replica Lag: s5 on db1145 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 894.93 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [22:26:24] Woah, woah [22:26:31] I'm around [22:26:33] let me see [22:27:20] woha [22:27:22] I’ll ack the alerts [22:27:48] here too if you need more hands [22:28:05] here too, though I'd probably break more stuff than I'd fix [22:28:53] the replication is not broken, something is putting a lot of pressure on writes [22:29:06] I stopped the pagelinks migration [22:29:46] Amir1: would that explain the mix of db servers? [22:30:04] I don't know how the replication works, tbh [22:30:14] it's only s5 https://orchestrator.wikimedia.org/web/cluster/alias/s5 [22:30:42] Jayme was running some maint scripts on s5 [22:31:16] !log killed update collation on s5 [22:31:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:38] there is growth experiments as well [22:31:50] Grafana (https://grafana.wikimedia.org/d/000000303/mysql-replication-lag?orgId=1) is showing db1183:9104 only that has replag. [22:32:52] https://orchestrator.wikimedia.org/web/cluster/alias/s5 [22:32:58] I can see replag here better [22:33:08] recovering finally [22:33:23] Kizule: most of the affected machines are in codfw, try switching the site dropdown at the top :) [22:33:27] sigh spoke too soon [22:34:01] rzl: Oh, right, I forgot that codfw is primary at the moment. [22:34:06] checking binlogs [22:37:05] DELETE /* NamespaceDupes::checkLinkTable */ FROM `pagelinks` WHERE [22:37:11] this is choking things [22:37:27] is anyone running NamespaceDupes? [22:37:36] Amir1: thcipriani was running that. [22:37:40] on srwiki [22:37:49] see in security [22:38:01] Amir1: He was running that but stopped, I think. https://phabricator.wikimedia.org/T350431 [22:38:13] hi - not sure if this is relevant - i deployed a Draft namespace change for bnwiki yesterday and didn't realize i had to run the namespaceDupes maintenance script which I haven't yet run - not sure if someone has tho [22:39:37] the problem is that I can't kill that write operation, if I kill it it basically corrupts the whole section [22:39:46] I have to let it go through [22:40:10] no on runs namespacedupe until further notice [22:40:47] | xxx | system user | | srwiki | Slave_SQL | 1755 | Updating | DELETE /* NamespaceDupes::checkLinkTable */ FROM `pagelinks` WHERE > [22:40:54] time 1755 seconds [22:41:31] oh good. I ran namespace dupes but it seemed hung. [22:42:01] please kill it [22:42:02] I killed it [22:42:15] a long time ago though [22:42:25] so it hasn't been running for a long while [22:42:44] killed at 21:32 [22:43:06] Info: DELETE /* NamespaceDupes::checkLinkTable */ FROM `pagelinks` WHERE (pl_from = '927158' AND pl_names [22:43:07] so, older than that 1755 seconds [22:43:30] rzl: it finished on master of s5, now it's running on replicas [22:43:38] (which would be about half an hour, ~22:10) [22:43:39] replication is serial sooooo [22:43:39] 4140 seconds ago [22:43:40] ah nod [22:44:20] I think the query of it is broken and it's probably deleting every row in pagelinks of srwiki [22:44:38] let me try something [22:46:16] 2101s [22:48:44] thcipriani: did you run it on any other wiki? [22:48:56] He didn't, only on srwiki. [22:49:08] okay [22:49:33] I think a broken write operation in the maint script has broken things, I hope it doesn't corrupt too much [22:49:38] nope, ran mwscript namespaceDupes.php srwiki --fix at 20:54, killed at 21:32 after it had been hung for a good long while [22:49:39] fixing it is not too hard [22:51:38] sigh, I think I know why [22:52:00] at the time I wondered if the waitingforreplication in the script was what was hanging, checked the grafana graphs and didn't see anything. I started to see lockwait timeouts at that time from jobrunners. I assumed there was some long running job I was fighting and so thought to reschedule. [22:52:02] the delete doesn't have any limits. Kizule does the change of namespace affected many pages? [22:52:43] this is what I saw pre running --fix: https://phabricator.wikimedia.org/P53134 [22:52:54] Amir1: There wasn't actual namespace change (like you add/remove/change some via uploading patch in operations/mediawiki-config and someone deploys patch). [22:52:56] https://phabricator.wikimedia.org/T350431 [22:53:29] the dry run implies many pages have been affected [22:53:57] thank you for doing the dry run, that's quite valuable [22:53:59] Because MediaWiki finally started to understand Cyrillic and Latin namespaces that they are the same. [22:54:00] https://gerrit.wikimedia.org/r/c/mediawiki/core/+/969428 [22:54:18] That's why is affection so high. [22:54:24] that still means many pages are affected [22:54:50] (03CR) 10Wfan: [C: 03+1] Allow crawling FundraiserLandingPage in robots.txt [mediawiki-config] - 10https://gerrit.wikimedia.org/r/971281 (https://phabricator.wikimedia.org/T254808) (owner: 10Ejegg) [22:55:17] I killed the --fix at https://phabricator.wikimedia.org/P53134$1664 which is where it hung [22:55:18] we have no choice but to keep s5 in this state until the large transaction goes through the throat of replicas [22:55:39] and then I will have to add some limits to the script [22:57:13] I think "Crystal_Clear_action_apply.png" is linked in many many pages [22:57:17] that's going to be fun [22:57:53] the good part is that this table will be much smaller soon and the write transactions will be shorter [22:58:03] but still... [22:58:13] Amir1: It's a icon from latin variant of old template for welcoming users. [22:58:22] I guesed [22:58:49] And it was broken because MediaWiki thinked that "Slika" is nothing, but it's actually alias for NS_FILE. [22:59:47] About much smaller soon part: I think it's going to be even smaller on Serbian Wikipedia. :D [23:00:04] Once we solve this, ofc. [23:05:28] T350443 [23:05:29] T350443: namespaceDupes.php doesn't have limit on write queries - https://phabricator.wikimedia.org/T350443 [23:06:08] I need to add more tags [23:09:21] There is so many task about it, wow. https://phabricator.wikimedia.org/search/query/jfFs3VQcWM6e/#R [23:09:25] *tasks [23:10:15] it has caused data corruption issues before [23:10:23] it's a big mess [23:11:32] How big is pagelinks table on srwiki? [23:11:57] 31GB [23:12:50] I'm trying to figure out roughly how it took for the transaction to finish in master so I could panic reasonably when that passes [23:17:55] finally [23:18:07] RECOVERY - MariaDB Replica Lag: s5 #page on db2157 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:18:42] \o/ [23:18:50] T350445 just got opened, what do? [23:18:51] T350445: not possible to edit in the german language wikipedia - https://phabricator.wikimedia.org/T350445 [23:18:57] RECOVERY - MariaDB Replica Lag: s5 #page on db2137 is OK: OK slave_sql_lag Replication lag: 0.41 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:19:06] should just recover on its own, right? [23:19:28] apparently yes :D [23:20:15] kamila_: yeah [23:20:47] nice :) [23:20:48] different replicas take different times to process a large write, if they are multiinstance, etc. it'll take longer [23:21:55] gah, so hang in the script was a massive write, and I started noticing the hang at 21:00 killed 21:2 and replag started at 22:10, so it took 70 minutes for write to complete for replicas to notice how out of date they were. Is that the shape of what happened? [23:24:25] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 333 probes of 726 (alerts on 90) - https://atlas.ripe.net/measurements/59935539/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:26:07] thcipriani: roughly, basically a large write on master happens concurrently so it doesn't cause any issue until it's committed and sent to replicas. Replication is serial so it holds until that transaction is gone through [23:26:18] so master can take writes, but replicas are busy doing that [23:26:54] that seventy minutes was being busy doing the write [23:27:01] (on master) [23:27:12] Amir1: So, if there were any edits made, they will be shown once replag solves on its own? [23:27:19] yes [23:29:14] given that editing looked down, do we need to do something wrt public comms? there is a grumpy person in -tech for instance [23:29:38] Edits have started appearing again in RC on Serbian Wikipedia. \o/ [23:29:49] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 67 probes of 726 (alerts on 90) - https://atlas.ripe.net/measurements/59935539/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:31:02] given that two replicas already recovered, the rest should follow soon so more comm shouldn't be needed IMHO [23:31:16] but we need to make sure no one runs this script until further notice [23:31:25] no need to retrospectively update statuspage? [23:32:50] It's only s5: dewiki, srwiki, enwikivoyage and a couple more large wikis [23:32:56] I don't know what's policy on that [23:33:17] FTR, 176,220,703 rows exist in pagelinks of srwiki [23:34:55] RECOVERY - MariaDB Replica Lag: s5 on db2101 is OK: OK slave_sql_lag Replication lag: 0.41 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:35:11] finally [23:35:14] all recovered [23:35:48] nope, spoke too soon again, only one, pt-heartbeat sometimes show zero [23:36:01] RECOVERY - MariaDB Replica Lag: s5 #page on db2128 is OK: OK slave_sql_lag Replication lag: 0.11 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [23:36:53] Everything is working when I'm accessing via WikimediaDebug. :joy: [23:37:54] oh I forgot another fun aspect, since we have chained replication, eqiad replicas will need another 70 minutes on top to catch up, and cloud replicas need around a couple more hours [23:40:07] I think we might want to update statuspage... [23:40:19] I've posted sitenotice on srwiki. [23:41:24] is chained replication going to trigger more delayed updates? [23:41:38] yes [23:41:44] kamila_: yeah, let's do it [23:42:08] on it, unless you want to write it [23:45:21] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:48:09] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:48:19] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:49:45] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:51:17] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [23:52:25] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:54:21] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [23:55:07] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:57:21] RECOVERY - MariaDB Replica Lag: s5 #page on db2178 is OK: OK slave_sql_lag Replication lag: 0.30 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica