[00:07:18] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [00:28:39] (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [00:33:11] PROBLEM - Check systemd state on logstash1023 is CRITICAL: CRITICAL - degraded: The following units failed: opensearch_2@production-elk7-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:33:19] PROBLEM - OpenSearch health check for shards on 9200 on logstash1023 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f926e15b280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.wi [00:33:19] org/wiki/Search%23Administration [00:34:33] RECOVERY - Check systemd state on logstash1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:34:41] RECOVERY - OpenSearch health check for shards on 9200 on logstash1023 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: green, timed_out: False, number_of_nodes: 18, number_of_data_nodes: 12, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 646, active_shards: 1494, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shar [00:34:41] umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [00:38:49] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/967913 [00:38:55] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/967913 (owner: 10TrainBranchBot) [00:56:32] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/967913 (owner: 10TrainBranchBot) [01:00:20] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [01:03:50] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T349576 (10phaultfinder) [01:15:43] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:51:05] (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231024T0200) [02:08:07] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.42.0-wmf.2 [core] (wmf/1.42.0-wmf.2) - 10https://gerrit.wikimedia.org/r/967914 (https://phabricator.wikimedia.org/T348355) [02:08:13] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.42.0-wmf.2 [core] (wmf/1.42.0-wmf.2) - 10https://gerrit.wikimedia.org/r/967914 (https://phabricator.wikimedia.org/T348355) (owner: 10TrainBranchBot) [02:21:05] (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads [02:25:31] (03Merged) 10jenkins-bot: Branch commit for wmf/1.42.0-wmf.2 [core] (wmf/1.42.0-wmf.2) - 10https://gerrit.wikimedia.org/r/967914 (https://phabricator.wikimedia.org/T348355) (owner: 10TrainBranchBot) [02:34:51] PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 131 probes of 719 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [02:38:39] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:53:18] (NodeTextfileStale) firing: Stale textfile for puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231024T0300) [03:02:06] (03PS1) 10TrainBranchBot: testwikis wikis to 1.42.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968012 (https://phabricator.wikimedia.org/T348355) [03:02:08] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.42.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968012 (https://phabricator.wikimedia.org/T348355) (owner: 10TrainBranchBot) [03:02:52] (03Merged) 10jenkins-bot: testwikis wikis to 1.42.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968012 (https://phabricator.wikimedia.org/T348355) (owner: 10TrainBranchBot) [03:03:22] !log mwpresync@deploy2002 Started scap: testwikis wikis to 1.42.0-wmf.2 refs T348355 [03:03:27] T348355: 1.42.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T348355 [03:03:39] (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:06:53] RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 51 probes of 719 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [03:11:37] (03PS1) 10Tim Starling: Increase Lua memory limit to 100MB on Wiktionary only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968013 (https://phabricator.wikimedia.org/T165935) [03:51:15] !log mwpresync@deploy2002 Finished scap: testwikis wikis to 1.42.0-wmf.2 refs T348355 (duration: 47m 53s) [03:51:20] T348355: 1.42.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T348355 [04:07:18] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [04:45:46] 10SRE, 10Math, 10RESTBase-API, 10Wikimedia-production-error: "Math extension cannot connect to Restbase." error in Wikimedia projects - https://phabricator.wikimedia.org/T343648 (10Physikerwelt) >>! In T343648#9274377, @matmarex wrote: > I'm finding it hard to believe, as the rates of errors I linked in T3... [05:00:20] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [05:20:58] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 39180 [05:22:00] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 39180 [05:23:43] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:23:45] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:24:17] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 145, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:28:01] 10SRE, 10Math, 10RESTBase-API, 10Wikimedia-production-error: "Math extension cannot connect to Restbase." error in Wikimedia projects - https://phabricator.wikimedia.org/T343648 (10daniel) >>! In T343648#9274943, @Physikerwelt wrote: > @daniel do you think it is possible that the number of retries has chan... [05:36:27] PROBLEM - Ubuntu mirror in sync with upstream on mirror1001 is CRITICAL: /srv/mirrors/ubuntu is over 14 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [05:41:35] (03CR) 10Marostegui: [C: 03+1] mariadb: Replace db1127 with db1227 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967907 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [05:43:14] (03PS1) 10Marostegui: install_server: Do not reimage db1228 [puppet] - 10https://gerrit.wikimedia.org/r/968017 [05:43:32] 10SRE, 10Math, 10RESTBase-API, 10Wikimedia-production-error: "Math extension cannot connect to Restbase." error in Wikimedia projects - https://phabricator.wikimedia.org/T343648 (10daniel) The error handling in MathRestbaseInterface::evaluateRestbaseCheckresponse looks like this: `... [05:44:09] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1228 [puppet] - 10https://gerrit.wikimedia.org/r/968017 (owner: 10Marostegui) [05:44:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200: 4.920089664641323s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [05:49:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200: 5.625917861089405s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:00:06] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231024T0600) [06:00:06] kormat, marostegui, and Amir1: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231024T0600). [06:02:11] PROBLEM - CirrusSearch more_like codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39 [06:04:26] (03PS1) 10Marostegui: pc1016: Install 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/968018 [06:05:53] (03CR) 10Marostegui: [C: 03+2] pc1016: Install 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/968018 (owner: 10Marostegui) [06:08:23] RECOVERY - CirrusSearch more_like codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39 [06:23:15] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:23:23] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:25:11] (03PS1) 10Marostegui: pc1015: Move it to new pc4 [puppet] - 10https://gerrit.wikimedia.org/r/968059 [06:26:04] (03CR) 10Marostegui: [C: 03+2] pc1015: Move it to new pc4 [puppet] - 10https://gerrit.wikimedia.org/r/968059 (owner: 10Marostegui) [06:27:13] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:27:21] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:28:21] (03PS1) 10Marostegui: pc2016: Migrate to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/968107 [06:29:37] (03CR) 10Marostegui: [C: 03+2] pc2016: Migrate to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/968107 (owner: 10Marostegui) [06:31:30] 10SRE, 10Infrastructure-Foundations, 10netops: Announce internal/core routes from CRs to L3 switches - https://phabricator.wikimedia.org/T344547 (10ayounsi) My understanding is that the LVS term is a fix for that one issue (now that I'm aware of T348446). If we start adding `anycast` then we need to add all... [06:32:08] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 15435 [06:32:34] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 15435 [06:33:50] !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device lsw1-e8-eqiad [06:33:50] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-e8-eqiad [06:40:33] (03CR) 10Filippo Giunchedi: modules: cleanup last dispatch renmants (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967952 (https://phabricator.wikimedia.org/T344937) (owner: 10Filippo Giunchedi) [06:42:24] !log marostegui@cumin1001 START - Cookbook sre.dns.netbox [06:42:45] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:43:23] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 146, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:43:43] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:44:28] !log marostegui@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove ipv6 from pc1015 - marostegui@cumin1001" [06:45:35] !log marostegui@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove ipv6 from pc1015 - marostegui@cumin1001" [06:45:35] !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:46:55] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 145, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:47:17] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:47:27] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:47:53] (03PS1) 10Filippo Giunchedi: data-engineering: fix deploy-tag for skein cert expiry [alerts] - 10https://gerrit.wikimedia.org/r/968112 (https://phabricator.wikimedia.org/T329398) [06:53:18] (NodeTextfileStale) firing: Stale textfile for puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [06:53:34] (03PS4) 10Giuseppe Lavagetto: mediawiki: add support for a prometheus-statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/960568 (https://phabricator.wikimedia.org/T343025) [06:53:59] (03PS1) 10Marostegui: install_server: Do not reimage db1227 [puppet] - 10https://gerrit.wikimedia.org/r/968113 [06:54:27] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 146, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:54:32] !log +50G to prometheus/analytics in eqiad [06:54:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:46] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1227 [puppet] - 10https://gerrit.wikimedia.org/r/968113 (owner: 10Marostegui) [06:54:47] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:54:59] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:57:07] (03CR) 10Filippo Giunchedi: [C: 03+2] data-engineering: fix deploy-tag for skein cert expiry [alerts] - 10https://gerrit.wikimedia.org/r/968112 (https://phabricator.wikimedia.org/T329398) (owner: 10Filippo Giunchedi) [06:58:42] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: provisionning db1227 - T344036 [06:58:45] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: provisionning db1227 - T344036 [06:58:46] T344036: Productionize db12[26-49] - https://phabricator.wikimedia.org/T344036 [06:58:47] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1227.eqiad.wmnet with reason: provisionning db1227 - T344036 [06:58:50] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1227.eqiad.wmnet with reason: provisionning db1227 - T344036 [06:59:45] (03CR) 10Arnaudb: [C: 03+2] mariadb: Replace db1127 with db1227 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967907 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [07:00:04] Amir1, Urbanecm, and taavi: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231024T0700). [07:00:04] No Gerrit patches in the queue for this window AFAICS. [07:00:35] (03PS1) 10Ayounsi: Remove Coherence report check [puppet] - 10https://gerrit.wikimedia.org/r/968114 [07:02:12] (03PS5) 10Giuseppe Lavagetto: mediawiki: add support for a prometheus-statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/960568 (https://phabricator.wikimedia.org/T343025) [07:06:09] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:06:47] (03PS1) 10Ayounsi: Ignore missing VM from PuppetDB with a tenant [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/968116 [07:07:13] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:08:12] !log marostegui@cumin1001 START - Cookbook sre.mysql.clone of db1127.eqiad.wmnet onto db1227.eqiad.wmnet [07:10:01] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 145, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:10:25] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:10:41] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:11:49] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:12:15] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 146, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:16:15] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:26:53] RECOVERY - Ubuntu mirror in sync with upstream on mirror1001 is OK: /srv/mirrors/ubuntu is over 1 hours old. https://wikitech.wikimedia.org/wiki/Mirrors [07:26:57] (03PS1) 10Filippo Giunchedi: alertmanager: let karma use apache to access AM [puppet] - 10https://gerrit.wikimedia.org/r/968119 (https://phabricator.wikimedia.org/T321579) [07:27:35] !log repool db2109 [07:27:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:46] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2109 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P53030 and previous config saved to /var/cache/conftool/dbconfig/20231024-072745-arnaudb.json [07:28:59] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:29:32] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 145, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:34:45] (03PS1) 10Arnaudb: mariadb: re-enable db2109 [puppet] - 10https://gerrit.wikimedia.org/r/967915 (https://phabricator.wikimedia.org/T347318) [07:38:22] (03CR) 10Marostegui: mariadb: re-enable db2109 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967915 (https://phabricator.wikimedia.org/T347318) (owner: 10Arnaudb) [07:41:14] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 146, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:41:56] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:42:29] (03PS2) 10Arnaudb: mariadb: re-enable db2109 [puppet] - 10https://gerrit.wikimedia.org/r/967915 (https://phabricator.wikimedia.org/T347318) [07:42:51] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2109 (re)pooling @ 20%: Maint over', diff saved to https://phabricator.wikimedia.org/P53031 and previous config saved to /var/cache/conftool/dbconfig/20231024-074250-arnaudb.json [07:42:57] (03CR) 10Marostegui: [C: 03+1] mariadb: re-enable db2109 [puppet] - 10https://gerrit.wikimedia.org/r/967915 (https://phabricator.wikimedia.org/T347318) (owner: 10Arnaudb) [07:49:00] 10SRE, 10ops-eqiad: Add test server to rack E8 - https://phabricator.wikimedia.org/T349168 (10ayounsi) Thanks! [07:56:54] (03CR) 10Arnaudb: [C: 03+2] mariadb: re-enable db2109 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967915 (https://phabricator.wikimedia.org/T347318) (owner: 10Arnaudb) [07:57:56] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2109 (re)pooling @ 30%: Maint over', diff saved to https://phabricator.wikimedia.org/P53032 and previous config saved to /var/cache/conftool/dbconfig/20231024-075755-arnaudb.json [08:07:18] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [08:07:27] (03PS1) 10DDesouza: miscweb: update research-landing-page [deployment-charts] - 10https://gerrit.wikimedia.org/r/968225 (https://phabricator.wikimedia.org/T219903) [08:09:50] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:09:56] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:13:00] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2109 (re)pooling @ 40%: Maint over', diff saved to https://phabricator.wikimedia.org/P53033 and previous config saved to /var/cache/conftool/dbconfig/20231024-081300-arnaudb.json [08:28:05] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2109 (re)pooling @ 50%: Maint over', diff saved to https://phabricator.wikimedia.org/P53034 and previous config saved to /var/cache/conftool/dbconfig/20231024-082805-arnaudb.json [08:29:03] 10SRE-swift-storage, 10Commons, 10MediaWiki-Uploading, 10Unstewarded-production-error, 10Wikimedia-production-error: An unknown error occurred in storage backend "local-swift-eqiad" - https://phabricator.wikimedia.org/T341007 (10Beao) Sounds reasonable. I was also able to purge the remaining preview imag... [08:33:23] PROBLEM - SSH on wdqs1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:33:59] !log marostegui@cumin1001 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1127.eqiad.wmnet onto db1227.eqiad.wmnet [08:43:10] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2109 (re)pooling @ 60%: Maint over', diff saved to https://phabricator.wikimedia.org/P53035 and previous config saved to /var/cache/conftool/dbconfig/20231024-084310-arnaudb.json [08:52:24] (03PS2) 10Majavah: wmnet: drop cloudmetrics CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/967898 (https://phabricator.wikimedia.org/T326266) [08:52:35] (03CR) 10Majavah: [C: 03+2] wmnet: drop cloudmetrics CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/967898 (https://phabricator.wikimedia.org/T326266) (owner: 10Majavah) [08:53:09] 10ops-eqiad, 10serviceops: deploy1002 lost connectivity - https://phabricator.wikimedia.org/T349587 (10elukey) [08:56:27] 10ops-eqiad, 10serviceops: deploy1002 lost connectivity - https://phabricator.wikimedia.org/T349587 (10hashar) [08:58:05] (03CR) 10JMeybohm: [C: 03+2] mw-on-k8s: Globally enable certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/967940 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [08:58:15] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2109 (re)pooling @ 70%: Maint over', diff saved to https://phabricator.wikimedia.org/P53036 and previous config saved to /var/cache/conftool/dbconfig/20231024-085815-arnaudb.json [09:00:05] 10ops-eqiad, 10serviceops: deploy1002 lost connectivity - https://phabricator.wikimedia.org/T349587 (10hashar) [09:00:09] 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10hashar) [09:00:17] (03Merged) 10jenkins-bot: mw-on-k8s: Globally enable certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/967940 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [09:00:20] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [09:00:27] 10ops-eqiad, 10serviceops: deploy1002 lost connectivity - https://phabricator.wikimedia.org/T349587 (10hashar) [09:00:55] (03PS1) 10Kevin Bazira: ml-services: update recommendation-api-ng image [deployment-charts] - 10https://gerrit.wikimedia.org/r/967917 (https://phabricator.wikimedia.org/T348607) [09:01:45] PROBLEM - Check systemd state on wdqs1024 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:01:53] (03CR) 10David Caro: [C: 03+2] openstack: add antelope to the tests [puppet] - 10https://gerrit.wikimedia.org/r/967934 (owner: 10David Caro) [09:03:36] !log taavi@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host deploy1002 [09:03:43] (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:04:48] !log taavi@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host deploy1002 [09:04:49] RECOVERY - Host deploy1002 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [09:05:51] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service,rsync-deployment_module.service,rsync-patches_module.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:06:35] taavi: eh, was in the middle of doing the same with homer [09:06:39] (KeyholderUnarmed) firing: 19 unarmed Keyholder key(s) on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [09:09:03] XioNoX: sorry, I saw releng asking about it on -serviceops and connected it to the move yesterday, homer was showing an unrelated diff with some cp nodes so I figured I'd use the cookbook instead [09:09:16] taavi: yep you did right [09:09:45] PROBLEM - Check whether ferm is active by checking the default input chain on deploy1002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:10:04] 10ops-eqiad, 10serviceops: deploy1002 lost connectivity - https://phabricator.wikimedia.org/T349587 (10taavi) The host was moved in {T308339} but the switch ports were not updated. I ran `sre.network.configure-switch-interface` to configure the port as Homer was showing an unrelated diff. That does mean that t... [09:11:50] !log restart ferm on deploy1002 T349587 [09:11:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:54] T349587: deploy1002 lost connectivity - https://phabricator.wikimedia.org/T349587 [09:13:20] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2109 (re)pooling @ 80%: Maint over', diff saved to https://phabricator.wikimedia.org/P53037 and previous config saved to /var/cache/conftool/dbconfig/20231024-091319-arnaudb.json [09:16:01] (03PS1) 10Filippo Giunchedi: alertmanager: also allow local access to the API [puppet] - 10https://gerrit.wikimedia.org/r/968231 (https://phabricator.wikimedia.org/T321579) [09:16:03] (03PS1) 10Filippo Giunchedi: alertmanager: ship amtool.yml for AM api access [puppet] - 10https://gerrit.wikimedia.org/r/968232 (https://phabricator.wikimedia.org/T321579) [09:16:25] !log upload golang-github-florianl-go-tc to apt.wm.o (bookworm) - T348837 [09:16:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:16:30] T348837: Investigate IPVS IPIP encapsulation support - https://phabricator.wikimedia.org/T348837 [09:16:40] (03CR) 10CI reject: [V: 04-1] alertmanager: ship amtool.yml for AM api access [puppet] - 10https://gerrit.wikimedia.org/r/968232 (https://phabricator.wikimedia.org/T321579) (owner: 10Filippo Giunchedi) [09:18:04] (03PS2) 10Filippo Giunchedi: alertmanager: ship amtool.yml for AM api access [puppet] - 10https://gerrit.wikimedia.org/r/968232 (https://phabricator.wikimedia.org/T321579) [09:19:42] (03CR) 10Filippo Giunchedi: "https://puppet-compiler.wmflabs.org/output/968119/150/alert1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/968119 (https://phabricator.wikimedia.org/T321579) (owner: 10Filippo Giunchedi) [09:20:10] (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: also allow local access to the API [puppet] - 10https://gerrit.wikimedia.org/r/968231 (https://phabricator.wikimedia.org/T321579) (owner: 10Filippo Giunchedi) [09:22:17] (03CR) 10Filippo Giunchedi: "https://puppet-compiler.wmflabs.org/output/968232/151/" [puppet] - 10https://gerrit.wikimedia.org/r/968232 (https://phabricator.wikimedia.org/T321579) (owner: 10Filippo Giunchedi) [09:23:43] (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:23:45] RECOVERY - Check systemd state on wdqs1024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:23:53] RECOVERY - SSH on wdqs1024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:25:48] (03CR) 10Filippo Giunchedi: "This will work, however I'd rather not maintain yet another list of datacenters in remote_syslog_tls. Since the functionality to add per-d" [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) (owner: 10Andrea Denisse) [09:28:05] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/967951 (owner: 10Muehlenhoff) [09:28:25] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2109 (re)pooling @ 90%: Maint over', diff saved to https://phabricator.wikimedia.org/P53038 and previous config saved to /var/cache/conftool/dbconfig/20231024-092824-arnaudb.json [09:28:32] (03CR) 10Jbond: [C: 03+1] idp::memcached Avoid Ferm-specific syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967951 (owner: 10Muehlenhoff) [09:28:59] (03CR) 10Elukey: [C: 03+1] ml-services: update recommendation-api-ng image [deployment-charts] - 10https://gerrit.wikimedia.org/r/967917 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira) [09:32:35] (03PS1) 10Jbond: cas: improve error messages [puppet] - 10https://gerrit.wikimedia.org/r/968235 [09:33:27] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/968114 (owner: 10Ayounsi) [09:33:57] (03CR) 10Jbond: [C: 03+1] Ignore missing VM from PuppetDB with a tenant [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/968116 (owner: 10Ayounsi) [09:34:55] (03PS17) 10Fabfur: haproxy: enable healthcheck-dedicated backend [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) [09:36:53] !log jnuche@deploy2002 Started scap: testwikis wikis to 1.42.0-wmf.2 refs T348355 [09:36:58] T348355: 1.42.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T348355 [09:37:11] (03PS1) 10Filippo Giunchedi: prometheus: enable thanos upload for cloud instance [puppet] - 10https://gerrit.wikimedia.org/r/968238 (https://phabricator.wikimedia.org/T336854) [09:37:13] (03PS1) 10Filippo Giunchedi: prometheus: enable pint for 'cloud' instance [puppet] - 10https://gerrit.wikimedia.org/r/968239 (https://phabricator.wikimedia.org/T336854) [09:37:32] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/153/console" [puppet] - 10https://gerrit.wikimedia.org/r/968235 (owner: 10Jbond) [09:37:57] (03CR) 10Jbond: [V: 03+1 C: 03+2] cas: improve error messages [puppet] - 10https://gerrit.wikimedia.org/r/968235 (owner: 10Jbond) [09:38:32] (03CR) 10Ayounsi: [C: 03+2] Ignore missing VM from PuppetDB with a tenant [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/968116 (owner: 10Ayounsi) [09:38:53] (03CR) 10Jbond: [V: 03+1 C: 03+2] graphite: migrate configparse to new puppet API [puppet] - 10https://gerrit.wikimedia.org/r/967474 (https://phabricator.wikimedia.org/T349291) (owner: 10Jbond) [09:39:04] (03CR) 10Jbond: [V: 03+1 C: 03+2] graphite: migrate configparse to new puppet API (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967474 (https://phabricator.wikimedia.org/T349291) (owner: 10Jbond) [09:39:09] !log ayounsi@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [09:39:15] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [09:40:04] (03CR) 10Hashar: Add a json representation of the build (032 comments) [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967407 (owner: 10Hashar) [09:40:07] RECOVERY - Check whether ferm is active by checking the default input chain on deploy1002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [09:40:21] (03PS1) 10Majavah: P:wmcs::metricsinfra: update karma config to match alerts.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/968240 [09:40:56] ^^ in case anyone is wondering, train presync failed last night, so I'm rerunning [09:41:04] (03CR) 10Ayounsi: [C: 03+2] Remove Coherence report check [puppet] - 10https://gerrit.wikimedia.org/r/968114 (owner: 10Ayounsi) [09:43:11] (03CR) 10Kevin Bazira: [C: 03+2] ml-services: update recommendation-api-ng image [deployment-charts] - 10https://gerrit.wikimedia.org/r/967917 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira) [09:43:13] (03PS5) 10Hashar: Add a json representation of the build [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967407 [09:43:30] !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2109 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P53039 and previous config saved to /var/cache/conftool/dbconfig/20231024-094329-arnaudb.json [09:43:47] (03Merged) 10jenkins-bot: ml-services: update recommendation-api-ng image [deployment-charts] - 10https://gerrit.wikimedia.org/r/967917 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira) [09:44:47] (03CR) 10Hashar: "https://gerrit.wikimedia.org/r/c/operations/software/puppet-compiler/+/967407/4..5" [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967407 (owner: 10Hashar) [09:45:24] !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [09:46:27] (03CR) 10CI reject: [V: 04-1] Add a json representation of the build [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967407 (owner: 10Hashar) [09:48:41] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [09:49:17] !log kevinbazira@deploy2002 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' . [09:49:29] (03CR) 10Jelto: [C: 03+2] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/968225 (https://phabricator.wikimedia.org/T219903) (owner: 10DDesouza) [09:50:05] (03CR) 10Jbond: "also adding jesses who is another good resource for general puppet reviews" [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [09:50:19] (03Merged) 10jenkins-bot: miscweb: update research-landing-page [deployment-charts] - 10https://gerrit.wikimedia.org/r/968225 (https://phabricator.wikimedia.org/T219903) (owner: 10DDesouza) [09:55:49] (03PS6) 10Hashar: Add a json representation of the build [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967407 [09:57:18] (03CR) 10Filippo Giunchedi: [C: 03+1] P:wmcs::metricsinfra: update karma config to match alerts.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/968240 (owner: 10Majavah) [09:59:02] (03CR) 10CI reject: [V: 04-1] Add a json representation of the build [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967407 (owner: 10Hashar) [10:00:05] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231024T1000) [10:01:03] RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:02:20] !log jnuche@deploy2002 Finished scap: testwikis wikis to 1.42.0-wmf.2 refs T348355 (duration: 25m 27s) [10:02:25] T348355: 1.42.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T348355 [10:04:11] PROBLEM - MariaDB Replica SQL: s2 on dbstore1007 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:04:31] !log jnuche@deploy2002 Pruned MediaWiki: 1.41.0-wmf.30 (duration: 02m 08s) [10:04:35] PROBLEM - mysqld processes on dbstore1007 is CRITICAL: PROCS CRITICAL: 2 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [10:06:10] (03CR) 10Effie Mouzeli: [C: 03+2] Update recommendation-api to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/967406 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [10:07:03] (03Merged) 10jenkins-bot: Update recommendation-api to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/967406 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [10:07:57] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply [10:08:33] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: apply [10:10:33] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/recommendation-api: apply [10:10:50] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/recommendation-api: apply [10:13:22] (03PS6) 10Giuseppe Lavagetto: mediawiki: add support for a prometheus-statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/960568 (https://phabricator.wikimedia.org/T343025) [10:14:00] (03CR) 10Majavah: [C: 03+2] P:wmcs::metricsinfra: update karma config to match alerts.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/968240 (owner: 10Majavah) [10:14:23] PROBLEM - MariaDB Replica Lag: s2 on dbstore1007 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:14:45] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/recommendation-api: apply [10:15:15] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/recommendation-api: apply [10:18:56] (03PS1) 10Majavah: P:wmcs::metricsinfra: fix karma config [puppet] - 10https://gerrit.wikimedia.org/r/968245 [10:23:49] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/154/con" [puppet] - 10https://gerrit.wikimedia.org/r/968238 (https://phabricator.wikimedia.org/T336854) (owner: 10Filippo Giunchedi) [10:26:23] (03CR) 10Majavah: [V: 03+1 C: 03+1] "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/968238 (https://phabricator.wikimedia.org/T336854) (owner: 10Filippo Giunchedi) [10:26:37] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply [10:26:55] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/recommendation-api: apply [10:27:09] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: apply [10:27:22] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/recommendation-api: apply [10:27:28] (03CR) 10Majavah: "What would enabling pint mean in practice for us? We get alerts on some types of issues in the laert rules?" [puppet] - 10https://gerrit.wikimedia.org/r/968239 (https://phabricator.wikimedia.org/T336854) (owner: 10Filippo Giunchedi) [10:29:09] (03CR) 10Samtar: [C: 03+1] Increase Lua memory limit to 100MB on Wiktionary only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968013 (https://phabricator.wikimedia.org/T165935) (owner: 10Tim Starling) [10:30:25] <_joe_> jouncebot: next [10:30:26] In 1 hour(s) and 29 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231024T1200) [10:30:32] <_joe_> jouncebot: now [10:30:32] For the next 0 hour(s) and 29 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231024T1000) [10:31:05] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: add support for a prometheus-statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/960568 (https://phabricator.wikimedia.org/T343025) (owner: 10Giuseppe Lavagetto) [10:32:19] (03Merged) 10jenkins-bot: mediawiki: add support for a prometheus-statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/960568 (https://phabricator.wikimedia.org/T343025) (owner: 10Giuseppe Lavagetto) [10:37:50] (03CR) 10Effie Mouzeli: [C: 03+2] Update shellbox to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/967410 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [10:38:44] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10User-RhinosF1: Volunteer NDA for RhinosF1 - https://phabricator.wikimedia.org/T341272 (10Aklapper) > Was this done @KFrancis ? Let's move the general NDA workflow discussion to {T349595}, as this specific request for RhinosF1 is resolved. Thanks! [10:38:54] (03Merged) 10jenkins-bot: Update shellbox to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/967410 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm) [10:39:23] !log oblivian@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [10:39:48] (03CR) 10Jbond: "overall loks good, most comments are style nits but there is one error around the use of site_name which is not getting based to the core " [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [10:40:10] !log oblivian@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [10:41:41] !log oblivian@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [10:42:02] !log oblivian@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [10:42:32] (03CR) 10Jbond: [C: 03+1] "lgtm" [labs/private] - 10https://gerrit.wikimedia.org/r/967519 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt) [10:42:59] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on an-test-client1002.eqiad.wmnet with reason: Cold booting with ganeti to increase RAM [10:43:24] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on an-test-client1002.eqiad.wmnet with reason: Cold booting with ganeti to increase RAM [10:43:47] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/shellbox: apply [10:44:15] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox: apply [10:46:44] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox: apply [10:47:39] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [10:49:23] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:53:10] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox: apply [10:53:11] (03PS1) 10Giuseppe Lavagetto: mesh: add new minor for configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/968247 [10:53:13] (03PS1) 10Giuseppe Lavagetto: mesh: fix parameter [deployment-charts] - 10https://gerrit.wikimedia.org/r/968248 [10:53:18] (NodeTextfileStale) firing: Stale textfile for puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [10:54:01] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [10:54:29] (03PS1) 10Giuseppe Lavagetto: mediawiki: update to mesh.configuration:1.4.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/968249 [10:54:38] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mesh: add new minor for configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/968247 (owner: 10Giuseppe Lavagetto) [10:55:25] (03Merged) 10jenkins-bot: mesh: add new minor for configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/968247 (owner: 10Giuseppe Lavagetto) [10:56:00] (03PS1) 10Jbond: prometheus: realise blackbox::check's instantly on prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/968250 [10:56:54] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [10:57:09] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [10:57:34] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [10:57:54] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [10:57:55] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [10:58:11] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [10:59:14] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-media: apply [10:59:28] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [11:00:16] (03CR) 10JMeybohm: [C: 03+1] mesh: fix parameter [deployment-charts] - 10https://gerrit.wikimedia.org/r/968248 (owner: 10Giuseppe Lavagetto) [11:00:20] (03CR) 10JMeybohm: [C: 03+1] mediawiki: update to mesh.configuration:1.4.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/968249 (owner: 10Giuseppe Lavagetto) [11:01:01] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:01:24] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mesh: fix parameter [deployment-charts] - 10https://gerrit.wikimedia.org/r/968248 (owner: 10Giuseppe Lavagetto) [11:03:14] (03Merged) 10jenkins-bot: mesh: fix parameter [deployment-charts] - 10https://gerrit.wikimedia.org/r/968248 (owner: 10Giuseppe Lavagetto) [11:03:29] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: update to mesh.configuration:1.4.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/968249 (owner: 10Giuseppe Lavagetto) [11:03:48] (03CR) 10Jbond: "I'm not sure this is currently useful right now but it replicates what we do in `monitoring::service` and came about from a different revi" [puppet] - 10https://gerrit.wikimedia.org/r/968250 (owner: 10Jbond) [11:04:38] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [11:04:45] (03Merged) 10jenkins-bot: mediawiki: update to mesh.configuration:1.4.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/968249 (owner: 10Giuseppe Lavagetto) [11:04:59] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [11:05:01] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [11:05:17] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [11:07:51] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [11:07:53] (03CR) 10Jbond: "yuo also need to tox -e py3-format" [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967407 (owner: 10Hashar) [11:08:07] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [11:08:28] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [11:08:52] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [11:08:53] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [11:09:17] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [11:11:29] !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [11:11:51] !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [11:12:18] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [11:12:45] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [11:12:46] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [11:13:10] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [11:15:24] !log oblivian@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [11:16:34] !log oblivian@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [11:17:03] !log oblivian@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [11:17:21] !log oblivian@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [11:26:43] RECOVERY - mysqld processes on dbstore1007 is OK: PROCS OK: 3 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [11:27:19] RECOVERY - MariaDB Replica Lag: s2 on dbstore1007 is OK: OK slave_sql_lag not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [11:27:37] RECOVERY - MariaDB Replica SQL: s2 on dbstore1007 is OK: OK slave_sql_state not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231024T1200) [12:07:18] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [12:12:33] (03PS1) 10Brouberol: Fix: remove duplicate /topicmappr path in zk URL and disable compression [puppet] - 10https://gerrit.wikimedia.org/r/968257 [12:34:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200: 3.5258375591962716s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:35:25] (03PS2) 10Brouberol: Fix: remove duplicate /topicmappr path in zk URL [puppet] - 10https://gerrit.wikimedia.org/r/968257 [12:39:15] (03PS2) 10Jbond: sre.puppet.migrate-role: add new cookbook to migrate roles to puppet7 [cookbooks] - 10https://gerrit.wikimedia.org/r/967935 [12:39:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200: 2.13681268189364s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:41:27] !log migrate idp_test to puppet7 [12:41:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:42:39] (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/155/con" [puppet] - 10https://gerrit.wikimedia.org/r/968257 (owner: 10Brouberol) [12:43:26] (03PS3) 10Brouberol: Fix: remove duplicate /topicmappr path in zk URL [puppet] - 10https://gerrit.wikimedia.org/r/968257 [12:43:35] (03PS4) 10Brouberol: Fix: remove duplicate /topicmappr path in zk URL [puppet] - 10https://gerrit.wikimedia.org/r/968257 [12:44:03] (03PS3) 10Jbond: sre.puppet.migrate-role: add new cookbook to migrate roles to puppet7 [cookbooks] - 10https://gerrit.wikimedia.org/r/967935 (https://phabricator.wikimedia.org/T340739) [12:44:19] (03PS1) 10Jbond: idp_test: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/968258 (https://phabricator.wikimedia.org/T340739) [12:44:34] (03PS5) 10Brouberol: Fix: remove duplicate /topicmappr path in zk URL [puppet] - 10https://gerrit.wikimedia.org/r/968257 [12:45:12] (03CR) 10Hashar: Add a json representation of the build (031 comment) [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967407 (owner: 10Hashar) [12:46:17] (03PS7) 10Hashar: Add a json representation of the build [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967407 [12:48:19] (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/156/con" [puppet] - 10https://gerrit.wikimedia.org/r/968257 (owner: 10Brouberol) [12:48:25] (03CR) 10Jbond: [C: 03+2] idp_test: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/968258 (https://phabricator.wikimedia.org/T340739) (owner: 10Jbond) [12:48:53] (03PS6) 10Brouberol: Fix: remove duplicate /topicmappr path in zk URL [puppet] - 10https://gerrit.wikimedia.org/r/968257 [12:50:20] (03PS1) 10Tsevener: Add stream config for iOS schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968259 (https://phabricator.wikimedia.org/T347122) [12:50:28] (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/157/con" [puppet] - 10https://gerrit.wikimedia.org/r/968257 (owner: 10Brouberol) [12:55:11] (03PS4) 10Samtar: InitialiseSettings-labs: Set values for renamed PageTriage variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965395 (https://phabricator.wikimedia.org/T331595) (owner: 10MPGuy2824) [12:56:30] (03CR) 10Samtar: [C: 03+2] "beta-only change, +2ing prior to window" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965395 (https://phabricator.wikimedia.org/T331595) (owner: 10MPGuy2824) [12:57:06] (03CR) 10Filippo Giunchedi: [C: 03+1] "doh!" [puppet] - 10https://gerrit.wikimedia.org/r/968245 (owner: 10Majavah) [12:57:16] (03CR) 10Majavah: [C: 03+2] P:wmcs::metricsinfra: fix karma config [puppet] - 10https://gerrit.wikimedia.org/r/968245 (owner: 10Majavah) [12:57:48] (03Merged) 10jenkins-bot: InitialiseSettings-labs: Set values for renamed PageTriage variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965395 (https://phabricator.wikimedia.org/T331595) (owner: 10MPGuy2824) [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231024T1300). [13:00:05] JSherman, TheresNoTime, and dcausse: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:08] * TheresNoTime can deploy [13:00:18] o/ [13:00:20] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [13:00:26] JSherman: I've already +2'd your change as it was beta-only, so it should be live on beta in a few minutes [13:01:16] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968013 (https://phabricator.wikimedia.org/T165935) (owner: 10Tim Starling) [13:01:55] thanks! [13:02:07] (03Merged) 10jenkins-bot: Increase Lua memory limit to 100MB on Wiktionary only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968013 (https://phabricator.wikimedia.org/T165935) (owner: 10Tim Starling) [13:03:00] !log samtar@deploy2002 Started scap: Backport for [[gerrit:968013|Increase Lua memory limit to 100MB on Wiktionary only (T165935)]] [13:03:17] T165935: "Lua error: not enough memory" on certain en.wiktionary pages - https://phabricator.wikimedia.org/T165935 [13:04:28] !log samtar@deploy2002 samtar and tstarling: Backport for [[gerrit:968013|Increase Lua memory limit to 100MB on Wiktionary only (T165935)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:04:31] * TheresNoTime testing [13:05:45] !log samtar@deploy2002 samtar and tstarling: Continuing with sync [13:06:29] (03PS7) 10Samtar: cirrus: add the mediawiki.cirrussearch.page_rerender.v1 stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957726 (https://phabricator.wikimedia.org/T325565) (owner: 10DCausse) [13:06:38] (03PS7) 10Samtar: cirrus: add wgCirrusSearchUseEventBusBridge and enable it on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957727 (https://phabricator.wikimedia.org/T325565) (owner: 10DCausse) [13:06:40] (KeyholderUnarmed) firing: 19 unarmed Keyholder key(s) on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [13:10:10] hi all, FYI I added my patch to the deploy calendar right as this window was starting, hope that's okay [13:10:51] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:968013|Increase Lua memory limit to 100MB on Wiktionary only (T165935)]] (duration: 07m 51s) [13:10:56] T165935: "Lua error: not enough memory" on certain en.wiktionary pages - https://phabricator.wikimedia.org/T165935 [13:11:19] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 3 others: Netbox: use the netbox to also sync networks and network devices - https://phabricator.wikimedia.org/T329272 (10ayounsi) Following up a bit on that, I think the easiest next step is to also add `$netbox_infra_devices = lookup('profile:... [13:12:32] (just checking T349612 was indeed only a temporary bump..) [13:12:32] T349612: LuaSandboxMemoryError: not enough memory - https://phabricator.wikimedia.org/T349612 [13:13:11] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957726 (https://phabricator.wikimedia.org/T325565) (owner: 10DCausse) [13:13:25] proceeding with your patches dcausse [13:13:33] TheresNoTime: thanks! [13:13:35] toni_: (ack, that's fine!) [13:14:02] (03Merged) 10jenkins-bot: cirrus: add the mediawiki.cirrussearch.page_rerender.v1 stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957726 (https://phabricator.wikimedia.org/T325565) (owner: 10DCausse) [13:14:25] !log samtar@deploy2002 Started scap: Backport for [[gerrit:957726|cirrus: add the mediawiki.cirrussearch.page_rerender.v1 stream (T325565)]] [13:14:30] T325565: Add support for page re-renders - https://phabricator.wikimedia.org/T325565 [13:15:47] !log samtar@deploy2002 samtar and dcausse: Backport for [[gerrit:957726|cirrus: add the mediawiki.cirrussearch.page_rerender.v1 stream (T325565)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:15:58] dcausse: live on mwdebug, can you test that? [13:15:59] looking ^ [13:16:02] yes [13:16:06] (ack) [13:16:58] TheresNoTime: all good, I'll need to restart eventgate-main once this one is deployed and before shipping the next one [13:17:07] okay [13:17:10] !log samtar@deploy2002 samtar and dcausse: Continuing with sync [13:18:20] (03PS4) 10Jbond: sre.puppet.migrate-role: add new cookbook to migrate roles to puppet7 [cookbooks] - 10https://gerrit.wikimedia.org/r/967935 (https://phabricator.wikimedia.org/T340739) [13:19:52] TheresNoTime: is it too late to do a backport that I forgot to put on the list earlier? [13:20:08] (03PS2) 10Jforrester: [Staging only] wikifunctions: Raise PyWASM CPU limits by 4x [deployment-charts] - 10https://gerrit.wikimedia.org/r/968004 [13:20:15] (03CR) 10Jforrester: [C: 03+2] [Staging only] wikifunctions: Raise PyWASM CPU limits by 4x [deployment-charts] - 10https://gerrit.wikimedia.org/r/968004 (owner: 10Jforrester) [13:20:22] cormacparle: it'll probably be okay, depends what :) [13:20:51] it's this one https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MediaSearch/+/967411/ [13:21:07] was hoping to backport it last night but there were no deployers around [13:21:13] (03Merged) 10jenkins-bot: [Staging only] wikifunctions: Raise PyWASM CPU limits by 4x [deployment-charts] - 10https://gerrit.wikimedia.org/r/968004 (owner: 10Jforrester) [13:21:43] cormacparle: yeah that's fine, can you cherry-pick it & add it to the backport calendar? [13:22:04] sure, gimme a sec [13:22:11] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:957726|cirrus: add the mediawiki.cirrussearch.page_rerender.v1 stream (T325565)]] (duration: 07m 45s) [13:22:15] T325565: Add support for page re-renders - https://phabricator.wikimedia.org/T325565 [13:22:17] dcausse: that first patch is deployed, let me know when I can start the next [13:22:22] sure [13:22:36] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [13:22:50] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [13:22:50] !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-main: sync [13:23:18] !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-main: sync [13:23:31] (03PS1) 10Cparle: Fix typo (undefined event) [extensions/MediaSearch] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/967780 (https://phabricator.wikimedia.org/T349271) [13:23:51] !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-main: sync [13:24:18] !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: sync [13:24:21] !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: sync [13:24:36] !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: sync [13:25:05] !log dcausse@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-main: sync [13:25:22] !log dcausse@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: sync [13:25:38] (03PS5) 10Jbond: sre.puppet.migrate-role: add new cookbook to migrate roles to puppet7 [cookbooks] - 10https://gerrit.wikimedia.org/r/967935 (https://phabricator.wikimedia.org/T340739) [13:25:40] 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T349576 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm known issue with no impact [13:25:59] TheresNoTime: should be done [13:26:05] ack, starting the next [13:26:11] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957727 (https://phabricator.wikimedia.org/T325565) (owner: 10DCausse) [13:26:45] this one touches jobrunners so can't be tested [13:26:52] (03Merged) 10jenkins-bot: cirrus: add wgCirrusSearchUseEventBusBridge and enable it on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957727 (https://phabricator.wikimedia.org/T325565) (owner: 10DCausse) [13:27:15] !log samtar@deploy2002 Started scap: Backport for [[gerrit:957727|cirrus: add wgCirrusSearchUseEventBusBridge and enable it on testwiki (T325565)]] [13:27:19] T325565: Add support for page re-renders - https://phabricator.wikimedia.org/T325565 [13:28:39] !log samtar@deploy2002 samtar and dcausse: Backport for [[gerrit:957727|cirrus: add wgCirrusSearchUseEventBusBridge and enable it on testwiki (T325565)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:28:45] !log samtar@deploy2002 samtar and dcausse: Continuing with sync [13:28:56] (syncing as can't be tested) [13:29:10] (03PS2) 10Samtar: Add stream config for iOS schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968259 (https://phabricator.wikimedia.org/T347122) (owner: 10Tsevener) [13:29:39] !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [13:29:41] (03CR) 10Matthias Mullie: [C: 03+1] Fix typo (undefined event) [extensions/MediaSearch] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/967780 (https://phabricator.wikimedia.org/T349271) (owner: 10Cparle) [13:30:00] !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [13:30:03] !log dani@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [13:30:31] !log dani@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [13:30:33] !log dani@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [13:30:52] TheresNoTime: that cherry-pick is done and in the deploy calendar [13:31:02] cormacparle: thanks :) [13:31:03] !log dani@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [13:33:55] (03CR) 10Filippo Giunchedi: "Thank you for the followup!" [puppet] - 10https://gerrit.wikimedia.org/r/968250 (owner: 10Jbond) [13:34:10] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:957727|cirrus: add wgCirrusSearchUseEventBusBridge and enable it on testwiki (T325565)]] (duration: 06m 55s) [13:34:12] dcausse: live on prod [13:34:19] TheresNoTime: thanks! [13:34:29] toni_: doing yours now [13:34:39] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968259 (https://phabricator.wikimedia.org/T347122) (owner: 10Tsevener) [13:35:07] (03CR) 10Filippo Giunchedi: prometheus: enable pint for 'cloud' instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/968239 (https://phabricator.wikimedia.org/T336854) (owner: 10Filippo Giunchedi) [13:35:20] (03Merged) 10jenkins-bot: Add stream config for iOS schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968259 (https://phabricator.wikimedia.org/T347122) (owner: 10Tsevener) [13:35:43] !log samtar@deploy2002 Started scap: Backport for [[gerrit:968259|Add stream config for iOS schema (T347122)]] [13:36:03] (03CR) 10Samtar: [C: 03+2] "starting CI for backport" [extensions/MediaSearch] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/967780 (https://phabricator.wikimedia.org/T349271) (owner: 10Cparle) [13:36:12] T325565: Add support for page re-renders - https://phabricator.wikimedia.org/T325565 [13:36:17] T347122: Document Instrumentation and Schema Needs for Suggested Edits on iOS Proof of Concept - https://phabricator.wikimedia.org/T347122 [13:37:06] !log samtar@deploy2002 samtar and tsev: Backport for [[gerrit:968259|Add stream config for iOS schema (T347122)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:37:08] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: enable thanos upload for cloud instance [puppet] - 10https://gerrit.wikimedia.org/r/968238 (https://phabricator.wikimedia.org/T336854) (owner: 10Filippo Giunchedi) [13:37:11] toni_: live on mwdebug, can you test? [13:37:49] (03CR) 10Majavah: [C: 03+1] "ok, sounds good!" [puppet] - 10https://gerrit.wikimedia.org/r/968239 (https://phabricator.wikimedia.org/T336854) (owner: 10Filippo Giunchedi) [13:38:23] TheresNoTime tested, looks good! [13:38:28] !log samtar@deploy2002 samtar and tsev: Continuing with sync [13:41:33] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): Migarate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond) [13:43:16] (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200: 2.395174465963646s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:43:35] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:968259|Add stream config for iOS schema (T347122)]] (duration: 07m 52s) [13:43:41] T347122: Document Instrumentation and Schema Needs for Suggested Edits on iOS Proof of Concept - https://phabricator.wikimedia.org/T347122 [13:43:41] toni_: live on prod :) [13:43:43] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): update acme chief access - https://phabricator.wikimedia.org/T349620 (10jbond) [13:44:18] cormacparle: just waiting for your patch to merge, shouldn't be long [13:44:22] TheresNoTime great, thank you! [13:44:24] 👍 [13:46:55] (03PS1) 10Jbond: acme_chief: add pki root certificate to list of trusted roots [puppet] - 10https://gerrit.wikimedia.org/r/968269 (https://phabricator.wikimedia.org/T349620) [13:48:16] (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200: 2.395174465963646s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:49:16] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: enable pint for 'cloud' instance [puppet] - 10https://gerrit.wikimedia.org/r/968239 (https://phabricator.wikimedia.org/T336854) (owner: 10Filippo Giunchedi) [13:49:22] (03PS2) 10Filippo Giunchedi: prometheus: enable pint for 'cloud' instance [puppet] - 10https://gerrit.wikimedia.org/r/968239 (https://phabricator.wikimedia.org/T336854) [13:50:48] (03Merged) 10jenkins-bot: Fix typo (undefined event) [extensions/MediaSearch] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/967780 (https://phabricator.wikimedia.org/T349271) (owner: 10Cparle) [13:51:28] !log samtar@deploy2002 Started scap: Backport for [[gerrit:967780|Fix typo (undefined event) (T349271)]] [13:51:33] T349271: Errors in at reportLoadTiming: Cannot read properties of undefined (reading 'loadEventEnd') / event is undefined / Cannot read property 'loadEventEnd' of undefined - https://phabricator.wikimedia.org/T349271 [13:52:48] !log samtar@deploy2002 samtar and cparle: Backport for [[gerrit:967780|Fix typo (undefined event) (T349271)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:52:52] cormacparle: live on mwdebug, can you test? [13:53:31] sure ... [13:55:42] works! [13:55:46] !log samtar@deploy2002 samtar and cparle: Continuing with sync [13:55:50] syncing :) [13:55:58] (03CR) 10Jbond: prometheus: realise blackbox::check's instantly on prometheus hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/968250 (owner: 10Jbond) [14:00:55] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:967780|Fix typo (undefined event) (T349271)]] (duration: 09m 26s) [14:00:56] cormacparle: live on prod :) [14:01:10] T349271: Errors in at reportLoadTiming: Cannot read properties of undefined (reading 'loadEventEnd') / event is undefined / Cannot read property 'loadEventEnd' of undefined - https://phabricator.wikimedia.org/T349271 [14:01:47] !log close backport window [14:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:06] (03CR) 10Stevemunene: [C: 03+1] Fix: remove duplicate /topicmappr path in zk URL [puppet] - 10https://gerrit.wikimedia.org/r/968257 (owner: 10Brouberol) [14:03:13] TheresNoTime: do I need to wait a few mins for the js cache to clear? still seeing the bug on prod (and not on debug) [14:04:44] cormacparle: potentially (I'm not quite sure myself), let me see if there's any docs on that.. [14:06:19] (03CR) 10Brouberol: [V: 03+1 C: 03+2] Fix: remove duplicate /topicmappr path in zk URL [puppet] - 10https://gerrit.wikimedia.org/r/968257 (owner: 10Brouberol) [14:07:04] ah yes, it "can take up to five minutes" [14:07:17] aha ok, will wait a little while so ... [14:08:27] (or does `?debug=true` work?) [14:09:32] it does indeed [14:09:53] great, thank you! [14:10:24] you're welcome :) [14:11:42] 10SRE-OnFire, 10Cloud-VPS, 10Observability-Alerting, 10cloud-services-team, and 2 others: monitoring: find out how we could have been paged for outage "Multiple CloudVPS instances lost their IPs" - https://phabricator.wikimedia.org/T347694 (10lmata) [14:12:11] (03PS2) 10Jbond: acme_chief: add new puppet intermediate CA to list of trusted clients [puppet] - 10https://gerrit.wikimedia.org/r/968269 (https://phabricator.wikimedia.org/T349620) [14:13:34] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/158/con" [puppet] - 10https://gerrit.wikimedia.org/r/968269 (https://phabricator.wikimedia.org/T349620) (owner: 10Jbond) [14:14:33] (03CR) 10Vgutierrez: [C: 04-1] acme_chief: add new puppet intermediate CA to list of trusted clients (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/968269 (https://phabricator.wikimedia.org/T349620) (owner: 10Jbond) [14:16:14] (03PS3) 10Jbond: acme_chief: add new puppet intermediate CA to list of trusted clients [puppet] - 10https://gerrit.wikimedia.org/r/968269 (https://phabricator.wikimedia.org/T349620) [14:16:24] (03CR) 10Jbond: "fixed thanks" [puppet] - 10https://gerrit.wikimedia.org/r/968269 (https://phabricator.wikimedia.org/T349620) (owner: 10Jbond) [14:17:48] (03CR) 10Majavah: [C: 04-1] "hardcoding the certs will break acme-chief cert validation in cloud vps" [puppet] - 10https://gerrit.wikimedia.org/r/968269 (https://phabricator.wikimedia.org/T349620) (owner: 10Jbond) [14:19:22] (03CR) 10Filippo Giunchedi: prometheus: realise blackbox::check's instantly on prometheus hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/968250 (owner: 10Jbond) [14:19:57] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:07] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:31:19] (03PS6) 10Jbond: sre.puppet.migrate-role: add new cookbook to migrate roles to puppet7 [cookbooks] - 10https://gerrit.wikimedia.org/r/967935 (https://phabricator.wikimedia.org/T340739) [14:31:33] (03PS1) 10Majavah: site: Re-image cloudmetrics hosts as insetup [puppet] - 10https://gerrit.wikimedia.org/r/968277 (https://phabricator.wikimedia.org/T336774) [14:31:35] (03PS1) 10Majavah: hieradata: drop prometheus access for cloudmetrics1003/4 [puppet] - 10https://gerrit.wikimedia.org/r/968278 (https://phabricator.wikimedia.org/T336854) [14:31:37] (03PS1) 10Majavah: P:alertmanager: drop cloudmetrics hosts [puppet] - 10https://gerrit.wikimedia.org/r/968279 (https://phabricator.wikimedia.org/T336854) [14:31:39] (03PS1) 10Majavah: P:wmcs::prometheus: drop profile [puppet] - 10https://gerrit.wikimedia.org/r/968280 (https://phabricator.wikimedia.org/T336854) [14:31:41] (03PS1) 10Majavah: P:wmcs: drop graphite manifests [puppet] - 10https://gerrit.wikimedia.org/r/968281 [14:31:43] (03PS1) 10Majavah: O:wmcs::monitoring: drop role [puppet] - 10https://gerrit.wikimedia.org/r/968282 (https://phabricator.wikimedia.org/T336774) [14:32:04] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Adding db1227 depooled', diff saved to https://phabricator.wikimedia.org/P53041 and previous config saved to /var/cache/conftool/dbconfig/20231024-143204-arnaudb.json [14:38:39] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:16] (03CR) 10Fabfur: [C: 03+2] hiera: enable dual disk storage for new cp hosts in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/967235 (https://phabricator.wikimedia.org/T349244) (owner: 10Fabfur) [14:39:41] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10Aklapper) [14:39:56] (03CR) 10Fabfur: [C: 03+2] haproxy: enable healthcheck-dedicated backend [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur) [14:40:12] (03CR) 10Fabfur: "wrong window" [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur) [14:41:07] (03PS8) 10Hashar: Add a json representation of the build [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967407 [14:41:11] (03PS3) 10Hashar: Add a json representation for each host [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967479 [14:42:04] (03PS4) 10Jbond: acme_chief: add new puppet intermediate CA to list of trusted clients [puppet] - 10https://gerrit.wikimedia.org/r/968269 (https://phabricator.wikimedia.org/T349620) [14:42:19] (03CR) 10Hashar: Add a json representation for each host (031 comment) [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967479 (owner: 10Hashar) [14:42:31] (03CR) 10CI reject: [V: 04-1] acme_chief: add new puppet intermediate CA to list of trusted clients [puppet] - 10https://gerrit.wikimedia.org/r/968269 (https://phabricator.wikimedia.org/T349620) (owner: 10Jbond) [14:43:17] (03CR) 10Hashar: Add a json representation of the build (031 comment) [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967407 (owner: 10Hashar) [14:45:37] (03PS5) 10Jbond: acme_chief: add new puppet intermediate CA to list of trusted clients [puppet] - 10https://gerrit.wikimedia.org/r/968269 (https://phabricator.wikimedia.org/T349620) [14:47:32] (03PS6) 10Jbond: acme_chief: add new puppet intermediate CA to list of trusted clients [puppet] - 10https://gerrit.wikimedia.org/r/968269 (https://phabricator.wikimedia.org/T349620) [14:48:36] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1100.eqiad.wmnet with OS bullseye [14:50:08] (03PS7) 10Jbond: acme_chief: add new puppet intermediate CA to list of trusted clients [puppet] - 10https://gerrit.wikimedia.org/r/968269 (https://phabricator.wikimedia.org/T349620) [14:50:14] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host snapshot1016.eqiad.wmnet [14:50:41] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:51:30] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/162/con" [puppet] - 10https://gerrit.wikimedia.org/r/968269 (https://phabricator.wikimedia.org/T349620) (owner: 10Jbond) [14:53:18] (NodeTextfileStale) firing: Stale textfile for puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [14:53:20] (03PS8) 10Jbond: acme_chief: add new puppet intermediate CA to list of trusted clients [puppet] - 10https://gerrit.wikimedia.org/r/968269 (https://phabricator.wikimedia.org/T349620) [14:53:39] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:54:47] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/968269 (https://phabricator.wikimedia.org/T349620) (owner: 10Jbond) [14:55:19] (03CR) 10Jbond: [V: 03+1] acme_chief: add new puppet intermediate CA to list of trusted clients (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/968269 (https://phabricator.wikimedia.org/T349620) (owner: 10Jbond) [14:57:08] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host snapshot1016.eqiad.wmnet [14:58:44] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1100.eqiad.wmnet with OS bullseye [14:59:28] (03CR) 10Herron: [C: 03+1] alertmanager: ship amtool.yml for AM api access [puppet] - 10https://gerrit.wikimedia.org/r/968232 (https://phabricator.wikimedia.org/T321579) (owner: 10Filippo Giunchedi) [14:59:42] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1100.eqiad.wmnet with OS bullseye [15:00:04] eoghan, jelto, and arnoldokoth: #bothumor I � Unicode. All rise for SRE Collaboration Services office hours deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231024T1500). [15:00:40] (03PS1) 10Filippo Giunchedi: prometheus: add cloud replica label [puppet] - 10https://gerrit.wikimedia.org/r/968284 (https://phabricator.wikimedia.org/T336854) [15:00:50] (03CR) 10Herron: [C: 03+1] alertmanager: let karma use apache to access AM [puppet] - 10https://gerrit.wikimedia.org/r/968119 (https://phabricator.wikimedia.org/T321579) (owner: 10Filippo Giunchedi) [15:02:22] (03PS1) 10Aqu: [WIP] Send metrics from Airflow analytics test [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) [15:02:59] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:04:13] (03PS2) 10Filippo Giunchedi: prometheus: add cloud replica label [puppet] - 10https://gerrit.wikimedia.org/r/968284 (https://phabricator.wikimedia.org/T336854) [15:04:54] (03CR) 10CI reject: [V: 04-1] [WIP] Send metrics from Airflow analytics test [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu) [15:05:06] (03CR) 10Aqu: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu) [15:06:04] (03PS7) 10Jbond: sre.puppet.migrate-role: add new cookbook to migrate roles to puppet7 [cookbooks] - 10https://gerrit.wikimedia.org/r/967935 (https://phabricator.wikimedia.org/T340739) [15:07:06] (03CR) 10Majavah: [C: 03+1] prometheus: add cloud replica label [puppet] - 10https://gerrit.wikimedia.org/r/968284 (https://phabricator.wikimedia.org/T336854) (owner: 10Filippo Giunchedi) [15:07:24] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: add cloud replica label [puppet] - 10https://gerrit.wikimedia.org/r/968284 (https://phabricator.wikimedia.org/T336854) (owner: 10Filippo Giunchedi) [15:09:08] (03PS8) 10Jbond: sre.puppet.migrate-role: add new cookbook to migrate roles to puppet7 [cookbooks] - 10https://gerrit.wikimedia.org/r/967935 (https://phabricator.wikimedia.org/T340739) [15:10:07] !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1100.eqiad.wmnet with OS bullseye [15:11:48] !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1100.eqiad.wmnet with OS bullseye [15:14:37] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:15:51] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:19:21] (03PS9) 10Jbond: sre.puppet.migrate-role: add new cookbook to migrate roles to puppet7 [cookbooks] - 10https://gerrit.wikimedia.org/r/967935 (https://phabricator.wikimedia.org/T340739) [15:22:21] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:22:31] !log clean up overlapping blocks from thanos for instance 'cloud' [15:22:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:27] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:26:21] PROBLEM - Check systemd state on titan2001 is CRITICAL: CRITICAL - degraded: The following units failed: thanos-compact.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:26:33] !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1100.eqiad.wmnet with reason: host reimage [15:26:41] the thanos-compact is me [15:27:43] RECOVERY - Check systemd state on titan2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:32:12] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1100.eqiad.wmnet with reason: host reimage [15:45:27] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:46:01] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:47:05] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 2.672 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:47:25] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:41] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50713 bytes in 0.062 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:48:47] !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1100.eqiad.wmnet with OS bullseye [15:49:15] (03CR) 10Bartosz Dziewoński: [C: 03+1] "https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231024T2000" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967995 (owner: 10Gergő Tisza) [15:55:55] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:00:05] jbond and rzl: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231024T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:07:18] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:09:01] (03PS1) 10MVernon: profile::tlsproxy::envoy fix docstring typo [puppet] - 10https://gerrit.wikimedia.org/r/968291 [16:12:38] (03CR) 10RLazarus: [C: 03+1] profile::tlsproxy::envoy fix docstring typo [puppet] - 10https://gerrit.wikimedia.org/r/968291 (owner: 10MVernon) [16:13:07] (03CR) 10MVernon: [C: 03+2] profile::tlsproxy::envoy fix docstring typo [puppet] - 10https://gerrit.wikimedia.org/r/968291 (owner: 10MVernon) [16:19:11] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:19:49] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:20:25] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.279 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:21:01] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50713 bytes in 0.061 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:21:40] (03PS1) 10Jbond: systemd::service: Add service owner parameter [puppet] - 10https://gerrit.wikimedia.org/r/968293 (https://phabricator.wikimedia.org/T349176) [16:38:58] 10SRE-swift-storage, 10API Platform, 10Commons, 10MediaWiki-File-management, and 4 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10aaron) [16:39:19] 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) [16:44:04] !log xcollazo@deploy2002 Started deploy [airflow-dags/analytics@cc56357]: Deploying latest DAGs to analytics Airflow instance [16:46:00] !log xcollazo@deploy2002 Finished deploy [airflow-dags/analytics@cc56357]: Deploying latest DAGs to analytics Airflow instance (duration: 01m 55s) [16:46:04] (03PS1) 10BCornwall: hiera: remove dns5003 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/968294 (https://phabricator.wikimedia.org/T342154) [16:47:16] (03CR) 10BCornwall: [C: 03+2] hiera: remove dns5003 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/968294 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [16:53:35] (03CR) 10Btullis: [WIP] Send metrics from Airflow analytics test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu) [16:59:16] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host dns5003.wikimedia.org with OS bookworm [16:59:28] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host dns5003.wikimedia.org with OS bookworm [17:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231024T1700) [17:00:20] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [17:03:03] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:04:05] PROBLEM - BFD status on cr3-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [17:06:47] PROBLEM - Host 2001:df2:e500:1:103:102:166:10 is DOWN: PING CRITICAL - Packet loss = 100% [17:06:54] (KeyholderUnarmed) firing: 19 unarmed Keyholder key(s) on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [17:08:39] PROBLEM - Recursive DNS on 103.102.166.10 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [17:08:39] (JobUnavailable) firing: (2) Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:16:58] (03PS6) 10Hnowlan: Upgrade container and dependencies for bullseye [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/920760 (https://phabricator.wikimedia.org/T336881) [17:17:35] (03PS1) 10Andrew Bogott: trove-guestagent: include service credentials [puppet] - 10https://gerrit.wikimedia.org/r/968299 (https://phabricator.wikimedia.org/T349651) [17:17:38] (03PS1) 10Andrew Bogott: codfw1dev keystone/swift: make endpoints public [puppet] - 10https://gerrit.wikimedia.org/r/968300 (https://phabricator.wikimedia.org/T349651) [17:17:42] (03PS1) 10Fabfur: hiera: added new cp hosts for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/968301 (https://phabricator.wikimedia.org/T349244) [17:23:39] (JobUnavailable) resolved: (2) Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:26:35] (03CR) 10Jforrester: "Apparently caused T349648." [dns] - 10https://gerrit.wikimedia.org/r/967898 (https://phabricator.wikimedia.org/T326266) (owner: 10Majavah) [17:28:10] (03CR) 10Ssingh: "Looks good overall for the missing bits, sorry for overlooking them in the last review. One comment we should fix in this CR in-line and t" [puppet] - 10https://gerrit.wikimedia.org/r/968301 (https://phabricator.wikimedia.org/T349244) (owner: 10Fabfur) [17:29:52] (03CR) 10Andrew Bogott: "Taavi, Francesco and I discussed this today. Since these creds wind up only in the Trove service project (which has limited access) and th" [puppet] - 10https://gerrit.wikimedia.org/r/968299 (https://phabricator.wikimedia.org/T349651) (owner: 10Andrew Bogott) [17:32:56] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:36:05] (03CR) 10Andrew Bogott: "For this patch and the previous one: https://puppet-compiler.wmflabs.org/output/968300/167/" [puppet] - 10https://gerrit.wikimedia.org/r/968300 (https://phabricator.wikimedia.org/T349651) (owner: 10Andrew Bogott) [17:46:00] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:46:08] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns5003.wikimedia.org with reason: host reimage [17:46:08] (03PS1) 10Ottomata: eventgate-logging-external - upgrade to debian bookworm and nodejs 18 [deployment-charts] - 10https://gerrit.wikimedia.org/r/968304 (https://phabricator.wikimedia.org/T347477) [17:47:57] (03CR) 10Ottomata: [C: 03+2] eventgate-logging-external - upgrade to debian bookworm and nodejs 18 [deployment-charts] - 10https://gerrit.wikimedia.org/r/968304 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata) [17:48:49] (03Merged) 10jenkins-bot: eventgate-logging-external - upgrade to debian bookworm and nodejs 18 [deployment-charts] - 10https://gerrit.wikimedia.org/r/968304 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata) [17:49:14] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns5003.wikimedia.org with reason: host reimage [17:50:06] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [17:51:57] PROBLEM - Disk space on Hadoop worker on an-worker1128 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/c 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [17:53:39] (JobUnavailable) firing: Reduced availability for job pdnsrec in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:54:29] (JobUnavailable) firing: (2) Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:55:35] PROBLEM - Recursive DNS on 103.102.166.10 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [18:00:05] dancy and brennen: Time to snap out of that daydream and deploy MediaWiki train - Utc-7 Version. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231024T1800). [18:00:14] o/ [18:00:41] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [18:02:33] o/ [18:03:03] PROBLEM - Recursive DNS on 2001:df2:e500:1:103:102:166:10 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [18:03:35] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [18:04:48] * dancy reads https://phabricator.wikimedia.org/T349310 [18:05:51] looks safe to proceed. [18:06:37] (03PS1) 10TrainBranchBot: group0 wikis to 1.42.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968326 (https://phabricator.wikimedia.org/T348355) [18:06:39] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.42.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968326 (https://phabricator.wikimedia.org/T348355) (owner: 10TrainBranchBot) [18:06:41] (03PS1) 10Ottomata: eventgate-logging-external - use 127.0.0.1 instead of localhost [deployment-charts] - 10https://gerrit.wikimedia.org/r/968327 (https://phabricator.wikimedia.org/T347477) [18:07:27] (03Merged) 10jenkins-bot: group0 wikis to 1.42.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968326 (https://phabricator.wikimedia.org/T348355) (owner: 10TrainBranchBot) [18:08:13] PROBLEM - Disk space on Hadoop worker on an-worker1146 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/f 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [18:08:45] (03CR) 10Ottomata: [C: 03+2] eventgate-logging-external - use 127.0.0.1 instead of localhost [deployment-charts] - 10https://gerrit.wikimedia.org/r/968327 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata) [18:09:31] (03Merged) 10jenkins-bot: eventgate-logging-external - use 127.0.0.1 instead of localhost [deployment-charts] - 10https://gerrit.wikimedia.org/r/968327 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata) [18:11:14] (03CR) 10Majavah: [C: 04-1] "Everything is open to the cloud vps VM ranges by default, so I don't think this is needed?" [puppet] - 10https://gerrit.wikimedia.org/r/968300 (https://phabricator.wikimedia.org/T349651) (owner: 10Andrew Bogott) [18:12:23] (03PS1) 10Jdrewniak: Follow-up to 74b5834: Add language prefix to Readability survey [skins/Vector] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/968311 (https://phabricator.wikimedia.org/T349232) [18:12:49] (03PS1) 10Jdrewniak: Follow-up to 74b5834: Add language prefix to Readability survey [skins/Vector] (wmf/1.42.0-wmf.2) - 10https://gerrit.wikimedia.org/r/968312 (https://phabricator.wikimedia.org/T349232) [18:13:33] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [18:13:40] (JobUnavailable) firing: (2) Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:13:48] !log dancy@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.42.0-wmf.2 refs T348355 [18:13:50] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [18:13:53] T348355: 1.42.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T348355 [18:14:01] RECOVERY - Recursive DNS on 103.102.166.10 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [18:14:30] (JobUnavailable) resolved: (2) Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:14:31] RECOVERY - Recursive DNS on 2001:df2:e500:1:103:102:166:10 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [18:15:32] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply [18:16:00] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply [18:18:12] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply [18:18:56] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply [18:21:04] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:23:55] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [18:24:43] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [18:25:58] (03PS1) 10Jdrewniak: Enable Vector readability survey on select wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968328 (https://phabricator.wikimedia.org/T349232) [18:26:03] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:28:21] RECOVERY - Disk space on Hadoop worker on an-worker1146 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [18:29:53] (03CR) 10Eevans: [C: 03+2] Decommission restbase2012 [puppet] - 10https://gerrit.wikimedia.org/r/968006 (https://phabricator.wikimedia.org/T349526) (owner: 10Eevans) [18:31:35] (03PS1) 10Ottomata: eventgate-analytics - upgrade to debian bookworm and nodejs 18 [deployment-charts] - 10https://gerrit.wikimedia.org/r/968330 (https://phabricator.wikimedia.org/T347477) [18:31:41] !log eevans@cumin1001 START - Cookbook sre.hosts.decommission for hosts restbase2012.codfw.wmnet [18:34:09] (03PS1) 10Majavah: openstack: encapi: don't try to hold a single connection open [puppet] - 10https://gerrit.wikimedia.org/r/968331 (https://phabricator.wikimedia.org/T349195) [18:34:39] (03CR) 10CI reject: [V: 04-1] openstack: encapi: don't try to hold a single connection open [puppet] - 10https://gerrit.wikimedia.org/r/968331 (https://phabricator.wikimedia.org/T349195) (owner: 10Majavah) [18:35:11] RECOVERY - Disk space on Hadoop worker on an-worker1128 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [18:35:12] (03PS2) 10Majavah: openstack: encapi: don't try to hold a single connection open [puppet] - 10https://gerrit.wikimedia.org/r/968331 (https://phabricator.wikimedia.org/T349195) [18:35:14] (03CR) 10Ottomata: [C: 03+2] eventgate-analytics - upgrade to debian bookworm and nodejs 18 [deployment-charts] - 10https://gerrit.wikimedia.org/r/968330 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata) [18:36:43] (03Merged) 10jenkins-bot: eventgate-analytics - upgrade to debian bookworm and nodejs 18 [deployment-charts] - 10https://gerrit.wikimedia.org/r/968330 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata) [18:37:34] !log eevans@cumin1001 START - Cookbook sre.dns.netbox [18:38:43] !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [18:39:15] !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [18:39:46] !log eevans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: restbase2012.codfw.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1001" [18:41:02] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [18:41:06] !log eevans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: restbase2012.codfw.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1001" [18:41:06] !log eevans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:41:06] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts restbase2012.codfw.wmnet [18:41:18] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [18:42:06] !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [18:42:07] !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload [18:42:08] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [18:42:59] !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [18:47:04] !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload [18:47:05] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [18:47:27] !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply [18:48:14] !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply [18:48:24] (03PS1) 10Andrew Bogott: developer-portal: update version tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/968333 (https://phabricator.wikimedia.org/T349045) [18:48:38] !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload [18:48:39] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [18:49:01] (03CR) 10Alex Paskulin: [C: 03+1] developer-portal: update version tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/968333 (https://phabricator.wikimedia.org/T349045) (owner: 10Andrew Bogott) [18:49:18] (03CR) 10Andrew Bogott: [C: 03+2] developer-portal: update version tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/968333 (https://phabricator.wikimedia.org/T349045) (owner: 10Andrew Bogott) [18:50:02] !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload [18:50:03] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [18:50:07] (03Merged) 10jenkins-bot: developer-portal: update version tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/968333 (https://phabricator.wikimedia.org/T349045) (owner: 10Andrew Bogott) [18:50:29] RECOVERY - BFD status on cr3-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [18:50:42] !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload [18:50:53] (03CR) 10Ssingh: hiera: added new cp hosts for eqiad (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/968301 (https://phabricator.wikimedia.org/T349244) (owner: 10Fabfur) [18:52:05] 10ops-codfw, 10Cassandra, 10decommission-hardware: decommission restbase2012.codfw.wmnet - https://phabricator.wikimedia.org/T349526 (10Eevans) [18:53:06] (03PS1) 10Ottomata: eventgate-analytics-external - upgrade to debian bookworm and nodejs 18 [deployment-charts] - 10https://gerrit.wikimedia.org/r/968334 (https://phabricator.wikimedia.org/T347477) [18:53:18] (NodeTextfileStale) firing: Stale textfile for puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [18:54:39] 10ops-codfw, 10Cassandra, 10decommission-hardware: decommission restbase2012.codfw.wmnet - https://phabricator.wikimedia.org/T349526 (10Eevans) [18:54:47] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns5003.wikimedia.org with OS bookworm [18:54:54] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [18:54:56] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host dns5003.wikimedia.org with OS bookworm completed: - dns5003 (**PASS**) - Downtimed on Icinga/Al... [18:55:08] !log andrew@deploy2002 helmfile [staging] START helmfile.d/services/developer-portal: apply [18:55:26] !log andrew@deploy2002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [18:59:05] !log andrew@deploy2002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [18:59:29] !log andrew@deploy2002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [19:00:02] !log andrew@deploy2002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [19:00:39] !log andrew@deploy2002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [19:03:47] (03PS1) 10BCornwall: Revert "hiera: remove dns5003 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/968313 [19:04:15] (03CR) 10Ssingh: [C: 03+1] Revert "hiera: remove dns5003 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/968313 (owner: 10BCornwall) [19:05:22] (03CR) 10BCornwall: [C: 03+2] Revert "hiera: remove dns5003 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/968313 (owner: 10BCornwall) [19:08:22] (03Abandoned) 10Andrew Bogott: codfw1dev keystone/swift: make endpoints public [puppet] - 10https://gerrit.wikimedia.org/r/968300 (https://phabricator.wikimedia.org/T349651) (owner: 10Andrew Bogott) [19:10:52] (03PS1) 10Eevans: Decommission restbase1016 [puppet] - 10https://gerrit.wikimedia.org/r/968339 (https://phabricator.wikimedia.org/T349526) [19:10:54] (03PS1) 10Eevans: Decommission restbase1017 [puppet] - 10https://gerrit.wikimedia.org/r/968340 (https://phabricator.wikimedia.org/T349526) [19:10:56] (03PS1) 10Eevans: Decommission restbase1018 [puppet] - 10https://gerrit.wikimedia.org/r/968341 (https://phabricator.wikimedia.org/T349526) [19:11:42] (03PS1) 10BCornwall: hiera: remove dns5004 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/968342 (https://phabricator.wikimedia.org/T342154) [19:12:30] (03CR) 10Majavah: [C: 03+1] "as discussed earlier I think this is fine. users don't have direct access to the VMs and this is a trove-specific password" [puppet] - 10https://gerrit.wikimedia.org/r/968299 (https://phabricator.wikimedia.org/T349651) (owner: 10Andrew Bogott) [19:13:13] (03CR) 10Ssingh: [C: 03+1] hiera: remove dns5004 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/968342 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [19:13:54] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/968341 (https://phabricator.wikimedia.org/T349526) (owner: 10Eevans) [19:14:05] (03CR) 10BCornwall: [C: 03+2] hiera: remove dns5004 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/968342 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall) [19:16:19] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:20:39] (03PS1) 10Ottomata: Enable canary events for all MediaWiki event streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968344 (https://phabricator.wikimedia.org/T266798) [19:23:28] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host dns5004.wikimedia.org with OS bookworm [19:23:38] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host dns5004.wikimedia.org with OS bookworm [19:27:27] PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:28:33] PROBLEM - BFD status on cr3-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [19:31:45] (03PS1) 10Bking: wdqs.data-reload: add logic for graph_split hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/968346 (https://phabricator.wikimedia.org/T349011) [19:33:39] (JobUnavailable) firing: (2) Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:35:33] (03CR) 10CI reject: [V: 04-1] wdqs.data-reload: add logic for graph_split hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/968346 (https://phabricator.wikimedia.org/T349011) (owner: 10Bking) [19:36:25] (03PS2) 10Bking: wdqs.data-reload: add logic for graph_split hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/968346 (https://phabricator.wikimedia.org/T349011) [19:40:01] (03PS2) 10Ottomata: Enable canary events for all MediaWiki event streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968344 (https://phabricator.wikimedia.org/T266798) [19:40:25] (03CR) 10CI reject: [V: 04-1] wdqs.data-reload: add logic for graph_split hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/968346 (https://phabricator.wikimedia.org/T349011) (owner: 10Bking) [19:42:12] (03PS3) 10Bking: wdqs.data-reload: add logic for graph_split hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/968346 (https://phabricator.wikimedia.org/T349011) [19:43:03] !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload [19:43:04] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [19:44:59] (03PS4) 10Bking: wdqs.data-reload: add logic for graph_split hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/968346 (https://phabricator.wikimedia.org/T349011) [19:45:27] (03PS2) 10C. Scott Ananian: Enable Parsoid interal REST API only on Parsoid cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965608 (https://phabricator.wikimedia.org/T334980) [19:45:27] !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload [19:45:28] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [19:46:46] (03PS5) 10Bking: wdqs.data-reload: add logic for graph_split hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/968346 (https://phabricator.wikimedia.org/T349011) [19:47:24] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/968119 (https://phabricator.wikimedia.org/T321579) (owner: 10Filippo Giunchedi) [19:47:27] !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload [19:47:28] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [19:48:37] (03PS3) 10C. Scott Ananian: Disable Parsoid internal REST API everywhere except on Parsoid cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965608 (https://phabricator.wikimedia.org/T334980) [19:48:51] (03CR) 10Ottomata: [C: 04-1] "-1 until we coordinate with some folks, and send an announcement." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968344 (https://phabricator.wikimedia.org/T266798) (owner: 10Ottomata) [19:49:08] (03CR) 10C. Scott Ananian: Disable Parsoid internal REST API everywhere except on Parsoid cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965608 (https://phabricator.wikimedia.org/T334980) (owner: 10C. Scott Ananian) [19:49:20] (03PS6) 10Bking: wdqs.data-reload: add logic for graph_split hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/968346 (https://phabricator.wikimedia.org/T349011) [19:49:30] !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload [19:53:10] (03CR) 10CI reject: [V: 04-1] wdqs.data-reload: add logic for graph_split hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/968346 (https://phabricator.wikimedia.org/T349011) (owner: 10Bking) [19:53:40] (JobUnavailable) resolved: (2) Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:56:36] !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@c585842]: T346373: Update mjolnir to use python 3.10 [19:56:43] T346373: Ensure mjolnir can work on Python 3.9 or later - https://phabricator.wikimedia.org/T346373 [19:57:05] !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@c585842]: T346373: Update mjolnir to use python 3.10 (duration: 00m 28s) [19:57:48] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/968232 (https://phabricator.wikimedia.org/T321579) (owner: 10Filippo Giunchedi) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231024T2000). [20:00:05] MatmaRex, jan_drewniak, and cscott: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:20] (03CR) 10Andrea Denisse: prometheus: Fail Puppet execution for unlisted sites (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) (owner: 10Andrea Denisse) [20:00:25] O/ [20:00:30] hi. all of my changes today are no-ops [20:00:59] (so, feel free to ship them all at once without testing) [20:02:17] jan_drewniak: cscott: yesterday there was no deployer for this window, so if either of you are able to deploy, you might want to get started [20:03:17] thcipriani: ^ [20:06:14] MatmaRex: cscott: ok, I can do the deploys in that case [20:07:18] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [20:07:41] MatmaRex: I'm doing yours first [20:08:39] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967208 (owner: 10Bartosz Dziewoński) [20:08:41] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967394 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza) [20:08:43] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967973 (https://phabricator.wikimedia.org/T344757) (owner: 10Bartosz Dziewoński) [20:08:45] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967995 (owner: 10Gergő Tisza) [20:09:50] MatmaRex: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/967208 needs a rebase [20:10:15] (03PS3) 10Bartosz Dziewoński: Update comment about EditAttemptStep instruments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967208 [20:10:37] well, they all do, but they rebase cleanly, so you can just click the button in gerrit [20:10:58] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns5004.wikimedia.org with reason: host reimage [20:11:03] (03PS7) 10Bartosz Dziewoński: CentralAuth: Clarify why we don't use second-level domain for some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967394 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza) [20:11:12] (03PS3) 10Bartosz Dziewoński: Remove unused VisualEditor config settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967973 (https://phabricator.wikimedia.org/T344757) [20:11:20] (03PS2) 10Bartosz Dziewoński: [noop] Explain more thoroughly how the '-' prefix works [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967995 (owner: 10Gergő Tisza) [20:11:30] (03CR) 10TrainBranchBot: "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967208 (owner: 10Bartosz Dziewoński) [20:11:31] jan_drewniak: they should be good to go now [20:11:32] (03CR) 10TrainBranchBot: "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967394 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza) [20:11:34] (03CR) 10TrainBranchBot: "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967973 (https://phabricator.wikimedia.org/T344757) (owner: 10Bartosz Dziewoński) [20:11:36] (03CR) 10TrainBranchBot: "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967995 (owner: 10Gergő Tisza) [20:12:12] (03CR) 10CI reject: [V: 04-1] [noop] Explain more thoroughly how the '-' prefix works [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967995 (owner: 10Gergő Tisza) [20:12:22] (03Merged) 10jenkins-bot: Update comment about EditAttemptStep instruments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967208 (owner: 10Bartosz Dziewoński) [20:12:24] (03Merged) 10jenkins-bot: CentralAuth: Clarify why we don't use second-level domain for some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967394 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza) [20:13:07] (03PS3) 10Jdrewniak: [noop] Explain more thoroughly how the '-' prefix works [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967995 (owner: 10Gergő Tisza) [20:13:31] (03Merged) 10jenkins-bot: Remove unused VisualEditor config settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967973 (https://phabricator.wikimedia.org/T344757) (owner: 10Bartosz Dziewoński) [20:14:09] (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 47.22% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:14:11] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns5004.wikimedia.org with reason: host reimage [20:14:20] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967973 (https://phabricator.wikimedia.org/T344757) (owner: 10Bartosz Dziewoński) [20:14:22] (03CR) 10TrainBranchBot: "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967995 (owner: 10Gergő Tisza) [20:14:25] (03CR) 10TrainBranchBot: "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967995 (owner: 10Gergő Tisza) [20:16:09] (03Merged) 10jenkins-bot: [noop] Explain more thoroughly how the '-' prefix works [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967995 (owner: 10Gergő Tisza) [20:16:35] !log jdrewniak@deploy2002 Started scap: Backport for [[gerrit:967208|Update comment about EditAttemptStep instruments]], [[gerrit:967394|CentralAuth: Clarify why we don't use second-level domain for some wikis (T257852)]], [[gerrit:967973|Remove unused VisualEditor config settings (T344757 T344759)]], [[gerrit:967995|[noop] Explain more thoroughly how the '-' prefix works]] [20:16:54] T257852: CentralAuth edge login and autologin for some Wikimedia domains broken on mobile - https://phabricator.wikimedia.org/T257852 [20:16:54] T344759: Remove VisualEditorTransitionDefault config and AutodisableVisualEditorPref maint script - https://phabricator.wikimedia.org/T344759 [20:16:54] T344757: Remove the BetaFeatures integration in VisualEditor - https://phabricator.wikimedia.org/T344757 [20:17:58] !log jdrewniak@deploy2002 tgr and matmarex and jdrewniak: Backport for [[gerrit:967208|Update comment about EditAttemptStep instruments]], [[gerrit:967394|CentralAuth: Clarify why we don't use second-level domain for some wikis (T257852)]], [[gerrit:967973|Remove unused VisualEditor config settings (T344757 T344759)]], [[gerrit:967995|[noop] Explain more thoroughly how the '-' prefix works]] synced to the testservers (htt [20:17:58] ps://wikitech.wikimedia.org/wiki/Mwdebug) [20:18:07] PROBLEM - Recursive DNS on 103.102.166.8 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [20:18:40] !log jdrewniak@deploy2002 tgr and matmarex and jdrewniak: Continuing with sync [20:19:09] (PHPFPMTooBusy) resolved: (2) Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 47.22% idle - https://bit.ly/wmf-fpmsat - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:23:56] !log jdrewniak@deploy2002 Finished scap: Backport for [[gerrit:967208|Update comment about EditAttemptStep instruments]], [[gerrit:967394|CentralAuth: Clarify why we don't use second-level domain for some wikis (T257852)]], [[gerrit:967973|Remove unused VisualEditor config settings (T344757 T344759)]], [[gerrit:967995|[noop] Explain more thoroughly how the '-' prefix works]] (duration: 07m 21s) [20:24:10] T257852: CentralAuth edge login and autologin for some Wikimedia domains broken on mobile - https://phabricator.wikimedia.org/T257852 [20:24:11] T344759: Remove VisualEditorTransitionDefault config and AutodisableVisualEditorPref maint script - https://phabricator.wikimedia.org/T344759 [20:24:11] T344757: Remove the BetaFeatures integration in VisualEditor - https://phabricator.wikimedia.org/T344757 [20:24:25] * jan_drewniak MatmaRex: done! [20:24:29] (JobUnavailable) firing: Reduced availability for job pdnsrec in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:24:36] thanks jan_drewniak [20:25:09] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968328 (https://phabricator.wikimedia.org/T349232) (owner: 10Jdrewniak) [20:25:14] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [skins/Vector] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/968311 (https://phabricator.wikimedia.org/T349232) (owner: 10Jdrewniak) [20:25:25] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [skins/Vector] (wmf/1.42.0-wmf.2) - 10https://gerrit.wikimedia.org/r/968312 (https://phabricator.wikimedia.org/T349232) (owner: 10Jdrewniak) [20:25:59] (03PS2) 10Jdrewniak: Enable Vector readability survey on select wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968328 (https://phabricator.wikimedia.org/T349232) [20:26:51] (03CR) 10TrainBranchBot: "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968328 (https://phabricator.wikimedia.org/T349232) (owner: 10Jdrewniak) [20:26:57] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [skins/Vector] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/968311 (https://phabricator.wikimedia.org/T349232) (owner: 10Jdrewniak) [20:27:05] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [skins/Vector] (wmf/1.42.0-wmf.2) - 10https://gerrit.wikimedia.org/r/968312 (https://phabricator.wikimedia.org/T349232) (owner: 10Jdrewniak) [20:27:15] (03Merged) 10jenkins-bot: Enable Vector readability survey on select wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968328 (https://phabricator.wikimedia.org/T349232) (owner: 10Jdrewniak) [20:28:02] (i'm here, btw) [20:28:40] (JobUnavailable) firing: (2) Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:30:07] PROBLEM - Recursive DNS on 2001:df2:e500:1:103:102:166:8 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS [20:38:40] (JobUnavailable) firing: (2) Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:38:47] (03Merged) 10jenkins-bot: Follow-up to 74b5834: Add language prefix to Readability survey [skins/Vector] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/968311 (https://phabricator.wikimedia.org/T349232) (owner: 10Jdrewniak) [20:38:49] RECOVERY - Recursive DNS on 2001:df2:e500:1:103:102:166:8 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [20:38:55] RECOVERY - Recursive DNS on 103.102.166.8 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS [20:39:22] hey cscott: merge is going sloowly, 3min eta on my patches... [20:39:30] (JobUnavailable) resolved: (2) Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [20:42:22] (03Merged) 10jenkins-bot: Follow-up to 74b5834: Add language prefix to Readability survey [skins/Vector] (wmf/1.42.0-wmf.2) - 10https://gerrit.wikimedia.org/r/968312 (https://phabricator.wikimedia.org/T349232) (owner: 10Jdrewniak) [20:42:42] there seems to bea ton of lag [20:42:46] !log jdrewniak@deploy2002 Started scap: Backport for [[gerrit:968328|Enable Vector readability survey on select wikis (T349232)]], [[gerrit:968311|Follow-up to 74b5834: Add language prefix to Readability survey (T349232)]], [[gerrit:968312|Follow-up to 74b5834: Add language prefix to Readability survey (T349232)]] [20:42:51] are there any known things going on? [20:42:56] T349232: Readability survey should link to language-specific feedback form - https://phabricator.wikimedia.org/T349232 [20:44:01] jan_drewniak: no worries, i'm patient (and working on other things) [20:44:08] !log jdrewniak@deploy2002 jdrewniak: Backport for [[gerrit:968328|Enable Vector readability survey on select wikis (T349232)]], [[gerrit:968311|Follow-up to 74b5834: Add language prefix to Readability survey (T349232)]], [[gerrit:968312|Follow-up to 74b5834: Add language prefix to Readability survey (T349232)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:44:31] !log jdrewniak@deploy2002 jdrewniak: Continuing with sync [20:46:03] (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:49:43] !log jdrewniak@deploy2002 Finished scap: Backport for [[gerrit:968328|Enable Vector readability survey on select wikis (T349232)]], [[gerrit:968311|Follow-up to 74b5834: Add language prefix to Readability survey (T349232)]], [[gerrit:968312|Follow-up to 74b5834: Add language prefix to Readability survey (T349232)]] (duration: 06m 57s) [20:49:48] T349232: Readability survey should link to language-specific feedback form - https://phabricator.wikimedia.org/T349232 [20:50:12] ok cscott: finally your turn [20:50:17] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965608 (https://phabricator.wikimedia.org/T334980) (owner: 10C. Scott Ananian) [20:51:02] (03PS4) 10Jdrewniak: Disable Parsoid internal REST API everywhere except on Parsoid cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965608 (https://phabricator.wikimedia.org/T334980) (owner: 10C. Scott Ananian) [20:51:03] (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:51:14] (03CR) 10TrainBranchBot: "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965608 (https://phabricator.wikimedia.org/T334980) (owner: 10C. Scott Ananian) [20:51:43] (03CR) 10CI reject: [V: 04-1] Disable Parsoid internal REST API everywhere except on Parsoid cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965608 (https://phabricator.wikimedia.org/T334980) (owner: 10C. Scott Ananian) [20:52:03] (03Merged) 10jenkins-bot: Disable Parsoid internal REST API everywhere except on Parsoid cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965608 (https://phabricator.wikimedia.org/T334980) (owner: 10C. Scott Ananian) [20:52:09] (03CR) 10Jdrewniak: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965608 (https://phabricator.wikimedia.org/T334980) (owner: 10C. Scott Ananian) [20:53:01] cscott: shoot looks like I'm getting a CI error... [20:53:16] oh never mind [20:53:33] !log jdrewniak@deploy2002 Started scap: Backport for [[gerrit:965608|Disable Parsoid internal REST API everywhere except on Parsoid cluster (T334980)]] [20:53:40] T334980: Run visual diff testing without RL and other hacks to compare Parsoid rendering against legacy parser rendering - https://phabricator.wikimedia.org/T334980 [20:54:54] !log jdrewniak@deploy2002 jdrewniak and cscott: Backport for [[gerrit:965608|Disable Parsoid internal REST API everywhere except on Parsoid cluster (T334980)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:55:09] cscott: does anything need to be checked for this patch? [20:55:54] it's on mwdebug [20:56:11] i can quickly verify that the parsoid api isn't present on mwdebug, hang on [21:00:10] jan_drewniak: looks good, go ahead [21:00:46] !log jdrewniak@deploy2002 jdrewniak and cscott: Continuing with sync [21:03:47] (03PS1) 10Ssingh: wikimedia.org: add verification for Jamf [dns] - 10https://gerrit.wikimedia.org/r/968354 (https://phabricator.wikimedia.org/T349665) [21:05:03] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:05:04] (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [21:05:21] (03PS1) 10Jdlrobson: [Visual change] Normalize small font sizes in Vector 2022 [skins/Vector] (wmf/1.42.0-wmf.2) - 10https://gerrit.wikimedia.org/r/968314 (https://phabricator.wikimedia.org/T346062) [21:05:49] RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:06:12] !log jdrewniak@deploy2002 Finished scap: Backport for [[gerrit:965608|Disable Parsoid internal REST API everywhere except on Parsoid cluster (T334980)]] (duration: 12m 39s) [21:06:18] T334980: Run visual diff testing without RL and other hacks to compare Parsoid rendering against legacy parser rendering - https://phabricator.wikimedia.org/T334980 [21:06:41] RECOVERY - BFD status on cr3-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [21:06:54] (KeyholderUnarmed) firing: 19 unarmed Keyholder key(s) on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [21:07:26] (03CR) 10Cwhite: "Change overall looks good. Inline is an idea for your consideration." [puppet] - 10https://gerrit.wikimedia.org/r/966645 (https://phabricator.wikimedia.org/T332672) (owner: 10Hashar) [21:07:31] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:07:45] (03CR) 10BCornwall: [C: 03+1] "Ew" [dns] - 10https://gerrit.wikimedia.org/r/968354 (https://phabricator.wikimedia.org/T349665) (owner: 10Ssingh) [21:08:28] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns5004.wikimedia.org with OS bookworm [21:08:37] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host dns5004.wikimedia.org with OS bookworm completed: - dns5004 (**PASS**) - Downtimed on Icinga/Al... [21:09:08] (03CR) 10Ssingh: [C: 03+2] wikimedia.org: add verification for Jamf [dns] - 10https://gerrit.wikimedia.org/r/968354 (https://phabricator.wikimedia.org/T349665) (owner: 10Ssingh) [21:09:30] !log running authdns-update for CR 968354 [21:09:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:11:25] 10SRE, 10DNS, 10Traffic: Update DNS for Jamf account SSO - https://phabricator.wikimedia.org/T349665 (10ssingh) 05Open→03Resolved a:03ssingh To reduce the chances of error and for future requests, please copy-paste the requested record in the task (so that it is text) in addition to the screenshot for... [21:11:39] (03PS1) 10BCornwall: Revert "hiera: remove dns5004 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/968315 [21:13:10] (03CR) 10Ssingh: [C: 03+1] Revert "hiera: remove dns5004 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/968315 (owner: 10BCornwall) [21:14:16] (03CR) 10BCornwall: [C: 03+2] Revert "hiera: remove dns5004 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/968315 (owner: 10BCornwall) [21:16:35] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall) [21:37:56] (03PS2) 10Fabfur: hiera: added new cp hosts in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/968301 (https://phabricator.wikimedia.org/T349244) [21:38:44] (03CR) 10Fabfur: hiera: added new cp hosts in eqiad (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/968301 (https://phabricator.wikimedia.org/T349244) (owner: 10Fabfur) [21:45:14] (03PS1) 10Ebernhardson: search updater: Update container image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/968358 [21:48:04] (03CR) 10Cwhite: "Is there a task to go along with this for discussion?" [puppet] - 10https://gerrit.wikimedia.org/r/968250 (owner: 10Jbond) [21:48:32] (03CR) 10Cwhite: [C: 03+1] alertmanager: ship amtool.yml for AM api access [puppet] - 10https://gerrit.wikimedia.org/r/968232 (https://phabricator.wikimedia.org/T321579) (owner: 10Filippo Giunchedi) [21:48:46] (03CR) 10Cwhite: [C: 03+1] alertmanager: let karma use apache to access AM [puppet] - 10https://gerrit.wikimedia.org/r/968119 (https://phabricator.wikimedia.org/T321579) (owner: 10Filippo Giunchedi) [21:56:01] (03CR) 10Ebernhardson: [C: 03+2] search updater: Update container image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/968358 (owner: 10Ebernhardson) [21:56:51] (03Merged) 10jenkins-bot: search updater: Update container image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/968358 (owner: 10Ebernhardson) [21:58:35] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [21:58:44] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:00:46] 10SRE, 10Traffic, 10Patch-For-Review: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur) [22:02:07] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:06:06] (03PS1) 10EoghanGaffney: [systemd/timer] Add optional SuccessExitStatus argument to timer services [puppet] - 10https://gerrit.wikimedia.org/r/968360 (https://phabricator.wikimedia.org/T349166) [22:06:08] (03PS1) 10EoghanGaffney: [quickdatacopy] Add success_exit_status option to rsync::quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/968361 (https://phabricator.wikimedia.org/T349166) [22:09:51] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:15:24] (03PS3) 10Jforrester: [wikifunctions] Alter site to General Availability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966570 (https://phabricator.wikimedia.org/T349054) [22:15:26] (03PS1) 10Jforrester: [wikifunctions] Allow logged-out users to run approved functions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968362 (https://phabricator.wikimedia.org/T349055) [22:20:16] (03CR) 10Andrew Bogott: [C: 03+2] trove-guestagent: include service credentials [puppet] - 10https://gerrit.wikimedia.org/r/968299 (https://phabricator.wikimedia.org/T349651) (owner: 10Andrew Bogott) [22:23:33] (03PS1) 10Andrew Bogott: Trove: allow backups in policy.yaml [puppet] - 10https://gerrit.wikimedia.org/r/968363 (https://phabricator.wikimedia.org/T349651) [22:24:19] (03CR) 10Andrew Bogott: [C: 03+2] Trove: allow backups in policy.yaml [puppet] - 10https://gerrit.wikimedia.org/r/968363 (https://phabricator.wikimedia.org/T349651) (owner: 10Andrew Bogott) [22:50:23] (03PS13) 10Andrea Denisse: prometheus: Fail Puppet execution for unlisted sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) [22:51:56] (03CR) 10CI reject: [V: 04-1] prometheus: Fail Puppet execution for unlisted sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) (owner: 10Andrea Denisse) [22:53:18] (NodeTextfileStale) firing: Stale textfile for puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [22:54:27] PROBLEM - Disk space on Hadoop worker on analytics1075 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/l 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [23:11:55] PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1001 is CRITICAL: 1.001e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad [23:40:21] RECOVERY - Disk space on Hadoop worker on analytics1075 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration