[00:07:18] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[00:28:39] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[00:33:11] <icinga-wm>	 PROBLEM - Check systemd state on logstash1023 is CRITICAL: CRITICAL - degraded: The following units failed: opensearch_2@production-elk7-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:33:19] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on logstash1023 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f926e15b280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.wi
[00:33:19] <icinga-wm>	 org/wiki/Search%23Administration
[00:34:33] <icinga-wm>	 RECOVERY - Check systemd state on logstash1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:34:41] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on logstash1023 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: green, timed_out: False, number_of_nodes: 18, number_of_data_nodes: 12, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 646, active_shards: 1494, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shar
[00:34:41] <icinga-wm>	 umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[00:38:49] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/967913
[00:38:55] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/967913 (owner: 10TrainBranchBot)
[00:56:32] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/967913 (owner: 10TrainBranchBot)
[01:00:20] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[01:03:50] <wikibugs>	 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T349576 (10phaultfinder)
[01:15:43] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:51:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[02:00:05] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231024T0200)
[02:08:07] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.42.0-wmf.2 [core] (wmf/1.42.0-wmf.2) - 10https://gerrit.wikimedia.org/r/967914 (https://phabricator.wikimedia.org/T348355)
[02:08:13] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.42.0-wmf.2 [core] (wmf/1.42.0-wmf.2) - 10https://gerrit.wikimedia.org/r/967914 (https://phabricator.wikimedia.org/T348355) (owner: 10TrainBranchBot)
[02:21:05] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[02:25:31] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.42.0-wmf.2 [core] (wmf/1.42.0-wmf.2) - 10https://gerrit.wikimedia.org/r/967914 (https://phabricator.wikimedia.org/T348355) (owner: 10TrainBranchBot)
[02:34:51] <icinga-wm>	 PROBLEM - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is CRITICAL: CRITICAL - failed 131 probes of 719 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[02:38:39] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:53:18] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[03:00:05] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231024T0300)
[03:02:06] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis wikis to 1.42.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968012 (https://phabricator.wikimedia.org/T348355)
[03:02:08] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.42.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968012 (https://phabricator.wikimedia.org/T348355) (owner: 10TrainBranchBot)
[03:02:52] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.42.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968012 (https://phabricator.wikimedia.org/T348355) (owner: 10TrainBranchBot)
[03:03:22] <logmsgbot>	 !log mwpresync@deploy2002 Started scap: testwikis wikis to 1.42.0-wmf.2  refs T348355
[03:03:27] <stashbot>	 T348355: 1.42.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T348355
[03:03:39] <jinxer-wm>	 (JobUnavailable) firing: (3) Reduced availability for job ldap in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:06:53] <icinga-wm>	 RECOVERY - IPv6 ping to codfw on ripe-atlas-codfw IPv6 is OK: OK - failed 51 probes of 719 (alerts on 90) - https://atlas.ripe.net/measurements/32390541/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas
[03:11:37] <wikibugs>	 (03PS1) 10Tim Starling: Increase Lua memory limit to 100MB on Wiktionary only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968013 (https://phabricator.wikimedia.org/T165935)
[03:51:15] <logmsgbot>	 !log mwpresync@deploy2002 Finished scap: testwikis wikis to 1.42.0-wmf.2  refs T348355 (duration: 47m 53s)
[03:51:20] <stashbot>	 T348355: 1.42.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T348355
[04:07:18] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[04:45:46] <wikibugs>	 10SRE, 10Math, 10RESTBase-API, 10Wikimedia-production-error: "Math extension cannot connect to Restbase." error in Wikimedia projects - https://phabricator.wikimedia.org/T343648 (10Physikerwelt) >>! In T343648#9274377, @matmarex wrote: > I'm finding it hard to believe, as the rates of errors I linked in T3...
[05:00:20] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[05:20:58] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 39180
[05:22:00] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 39180
[05:23:43] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:23:45] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:24:17] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 145, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:28:01] <wikibugs>	 10SRE, 10Math, 10RESTBase-API, 10Wikimedia-production-error: "Math extension cannot connect to Restbase." error in Wikimedia projects - https://phabricator.wikimedia.org/T343648 (10daniel) >>! In T343648#9274943, @Physikerwelt wrote: > @daniel do you think it is possible that the number of retries has chan...
[05:36:27] <icinga-wm>	 PROBLEM - Ubuntu mirror in sync with upstream on mirror1001 is CRITICAL: /srv/mirrors/ubuntu is over 14 hours old. https://wikitech.wikimedia.org/wiki/Mirrors
[05:41:35] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: Replace db1127 with db1227 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967907 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb)
[05:43:14] <wikibugs>	 (03PS1) 10Marostegui: install_server: Do not reimage db1228 [puppet] - 10https://gerrit.wikimedia.org/r/968017
[05:43:32] <wikibugs>	 10SRE, 10Math, 10RESTBase-API, 10Wikimedia-production-error: "Math extension cannot connect to Restbase." error in Wikimedia projects - https://phabricator.wikimedia.org/T343648 (10daniel) The error handling in MathRestbaseInterface::evaluateRestbaseCheckresponse looks like this: `...
[05:44:09] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1228 [puppet] - 10https://gerrit.wikimedia.org/r/968017 (owner: 10Marostegui)
[05:44:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200: 4.920089664641323s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[05:49:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200: 5.625917861089405s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[06:00:06] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231024T0600)
[06:00:06] <jouncebot>	 kormat, marostegui, and Amir1: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231024T0600).
[06:02:11] <icinga-wm>	 PROBLEM - CirrusSearch more_like codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39
[06:04:26] <wikibugs>	 (03PS1) 10Marostegui: pc1016: Install 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/968018
[06:05:53] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] pc1016: Install 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/968018 (owner: 10Marostegui)
[06:08:23] <icinga-wm>	 RECOVERY - CirrusSearch more_like codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39
[06:23:15] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:23:23] <icinga-wm>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:25:11] <wikibugs>	 (03PS1) 10Marostegui: pc1015: Move it to new pc4 [puppet] - 10https://gerrit.wikimedia.org/r/968059
[06:26:04] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] pc1015: Move it to new pc4 [puppet] - 10https://gerrit.wikimedia.org/r/968059 (owner: 10Marostegui)
[06:27:13] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:27:21] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:28:21] <wikibugs>	 (03PS1) 10Marostegui: pc2016: Migrate to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/968107
[06:29:37] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] pc2016: Migrate to 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/968107 (owner: 10Marostegui)
[06:31:30] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Announce internal/core routes from CRs to L3 switches - https://phabricator.wikimedia.org/T344547 (10ayounsi) My understanding is that the LVS term is a fix for that one issue (now that I'm aware of T348446). If we start adding `anycast` then we need to add all...
[06:32:08] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 15435
[06:32:34] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 15435
[06:33:50] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.tls for network device lsw1-e8-eqiad
[06:33:50] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device lsw1-e8-eqiad
[06:40:33] <wikibugs>	 (03CR) 10Filippo Giunchedi: modules: cleanup last dispatch renmants (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967952 (https://phabricator.wikimedia.org/T344937) (owner: 10Filippo Giunchedi)
[06:42:24] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.dns.netbox
[06:42:45] <icinga-wm>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:43:23] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 146, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:43:43] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:44:28] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove ipv6 from pc1015 - marostegui@cumin1001"
[06:45:35] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Remove ipv6 from pc1015 - marostegui@cumin1001"
[06:45:35] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[06:46:55] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 145, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:47:17] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:47:27] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:47:53] <wikibugs>	 (03PS1) 10Filippo Giunchedi: data-engineering: fix deploy-tag for skein cert expiry [alerts] - 10https://gerrit.wikimedia.org/r/968112 (https://phabricator.wikimedia.org/T329398)
[06:53:18] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[06:53:34] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: mediawiki: add support for a prometheus-statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/960568 (https://phabricator.wikimedia.org/T343025)
[06:53:59] <wikibugs>	 (03PS1) 10Marostegui: install_server: Do not reimage db1227 [puppet] - 10https://gerrit.wikimedia.org/r/968113
[06:54:27] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 146, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:54:32] <godog>	 !log +50G to prometheus/analytics in eqiad
[06:54:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:54:46] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1227 [puppet] - 10https://gerrit.wikimedia.org/r/968113 (owner: 10Marostegui)
[06:54:47] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:54:59] <icinga-wm>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:57:07] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] data-engineering: fix deploy-tag for skein cert expiry [alerts] - 10https://gerrit.wikimedia.org/r/968112 (https://phabricator.wikimedia.org/T329398) (owner: 10Filippo Giunchedi)
[06:58:42] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: provisionning db1227 - T344036
[06:58:45] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1127.eqiad.wmnet with reason: provisionning db1227 - T344036
[06:58:46] <stashbot>	 T344036: Productionize db12[26-49] - https://phabricator.wikimedia.org/T344036
[06:58:47] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1227.eqiad.wmnet with reason: provisionning db1227 - T344036
[06:58:50] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1227.eqiad.wmnet with reason: provisionning db1227 - T344036
[06:59:45] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+2] mariadb: Replace db1127 with db1227 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967907 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb)
[07:00:04] <jouncebot>	 Amir1, Urbanecm, and taavi: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231024T0700).
[07:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[07:00:35] <wikibugs>	 (03PS1) 10Ayounsi: Remove Coherence report check [puppet] - 10https://gerrit.wikimedia.org/r/968114
[07:02:12] <wikibugs>	 (03PS5) 10Giuseppe Lavagetto: mediawiki: add support for a prometheus-statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/960568 (https://phabricator.wikimedia.org/T343025)
[07:06:09] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:06:47] <wikibugs>	 (03PS1) 10Ayounsi: Ignore missing VM from PuppetDB with a tenant [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/968116
[07:07:13] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:08:12] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.mysql.clone of db1127.eqiad.wmnet onto db1227.eqiad.wmnet
[07:10:01] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 145, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:10:25] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:10:41] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:11:49] <icinga-wm>	 RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:12:15] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 146, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:16:15] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:26:53] <icinga-wm>	 RECOVERY - Ubuntu mirror in sync with upstream on mirror1001 is OK: /srv/mirrors/ubuntu is over 1 hours old. https://wikitech.wikimedia.org/wiki/Mirrors
[07:26:57] <wikibugs>	 (03PS1) 10Filippo Giunchedi: alertmanager: let karma use apache to access AM [puppet] - 10https://gerrit.wikimedia.org/r/968119 (https://phabricator.wikimedia.org/T321579)
[07:27:35] <arnaudb>	 !log repool db2109
[07:27:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:27:46] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2109 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P53030 and previous config saved to /var/cache/conftool/dbconfig/20231024-072745-arnaudb.json
[07:28:59] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 44, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:29:32] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 145, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:34:45] <wikibugs>	 (03PS1) 10Arnaudb: mariadb: re-enable db2109 [puppet] - 10https://gerrit.wikimedia.org/r/967915 (https://phabricator.wikimedia.org/T347318)
[07:38:22] <wikibugs>	 (03CR) 10Marostegui: mariadb: re-enable db2109 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967915 (https://phabricator.wikimedia.org/T347318) (owner: 10Arnaudb)
[07:41:14] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 146, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:41:56] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[07:42:29] <wikibugs>	 (03PS2) 10Arnaudb: mariadb: re-enable db2109 [puppet] - 10https://gerrit.wikimedia.org/r/967915 (https://phabricator.wikimedia.org/T347318)
[07:42:51] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2109 (re)pooling @ 20%: Maint over', diff saved to https://phabricator.wikimedia.org/P53031 and previous config saved to /var/cache/conftool/dbconfig/20231024-074250-arnaudb.json
[07:42:57] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: re-enable db2109 [puppet] - 10https://gerrit.wikimedia.org/r/967915 (https://phabricator.wikimedia.org/T347318) (owner: 10Arnaudb)
[07:49:00] <wikibugs>	 10SRE, 10ops-eqiad: Add test server to rack E8 - https://phabricator.wikimedia.org/T349168 (10ayounsi) Thanks!
[07:56:54] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+2] mariadb: re-enable db2109 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967915 (https://phabricator.wikimedia.org/T347318) (owner: 10Arnaudb)
[07:57:56] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2109 (re)pooling @ 30%: Maint over', diff saved to https://phabricator.wikimedia.org/P53032 and previous config saved to /var/cache/conftool/dbconfig/20231024-075755-arnaudb.json
[08:07:18] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[08:07:27] <wikibugs>	 (03PS1) 10DDesouza: miscweb: update research-landing-page [deployment-charts] - 10https://gerrit.wikimedia.org/r/968225 (https://phabricator.wikimedia.org/T219903)
[08:09:50] <icinga-wm>	 PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:09:56] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:13:00] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2109 (re)pooling @ 40%: Maint over', diff saved to https://phabricator.wikimedia.org/P53033 and previous config saved to /var/cache/conftool/dbconfig/20231024-081300-arnaudb.json
[08:28:05] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2109 (re)pooling @ 50%: Maint over', diff saved to https://phabricator.wikimedia.org/P53034 and previous config saved to /var/cache/conftool/dbconfig/20231024-082805-arnaudb.json
[08:29:03] <wikibugs>	 10SRE-swift-storage, 10Commons, 10MediaWiki-Uploading, 10Unstewarded-production-error, 10Wikimedia-production-error: An unknown error occurred in storage backend "local-swift-eqiad" - https://phabricator.wikimedia.org/T341007 (10Beao) Sounds reasonable. I was also able to purge the remaining preview imag...
[08:33:23] <icinga-wm>	 PROBLEM - SSH on wdqs1024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:33:59] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1127.eqiad.wmnet onto db1227.eqiad.wmnet
[08:43:10] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2109 (re)pooling @ 60%: Maint over', diff saved to https://phabricator.wikimedia.org/P53035 and previous config saved to /var/cache/conftool/dbconfig/20231024-084310-arnaudb.json
[08:52:24] <wikibugs>	 (03PS2) 10Majavah: wmnet: drop cloudmetrics CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/967898 (https://phabricator.wikimedia.org/T326266)
[08:52:35] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] wmnet: drop cloudmetrics CNAMEs [dns] - 10https://gerrit.wikimedia.org/r/967898 (https://phabricator.wikimedia.org/T326266) (owner: 10Majavah)
[08:53:09] <wikibugs>	 10ops-eqiad, 10serviceops: deploy1002 lost connectivity - https://phabricator.wikimedia.org/T349587 (10elukey)
[08:56:27] <wikibugs>	 10ops-eqiad, 10serviceops: deploy1002 lost connectivity - https://phabricator.wikimedia.org/T349587 (10hashar)
[08:58:05] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] mw-on-k8s: Globally enable certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/967940 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[08:58:15] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2109 (re)pooling @ 70%: Maint over', diff saved to https://phabricator.wikimedia.org/P53036 and previous config saved to /var/cache/conftool/dbconfig/20231024-085815-arnaudb.json
[09:00:05] <wikibugs>	 10ops-eqiad, 10serviceops: deploy1002 lost connectivity - https://phabricator.wikimedia.org/T349587 (10hashar)
[09:00:09] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: eqiad: move non WMCS servers out of rack C8 - https://phabricator.wikimedia.org/T308339 (10hashar)
[09:00:17] <wikibugs>	 (03Merged) 10jenkins-bot: mw-on-k8s: Globally enable certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/967940 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[09:00:20] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[09:00:27] <wikibugs>	 10ops-eqiad, 10serviceops: deploy1002 lost connectivity - https://phabricator.wikimedia.org/T349587 (10hashar)
[09:00:55] <wikibugs>	 (03PS1) 10Kevin Bazira: ml-services: update recommendation-api-ng image [deployment-charts] - 10https://gerrit.wikimedia.org/r/967917 (https://phabricator.wikimedia.org/T348607)
[09:01:45] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1024 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:01:53] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] openstack: add antelope to the tests [puppet] - 10https://gerrit.wikimedia.org/r/967934 (owner: 10David Caro)
[09:03:36] <logmsgbot>	 !log taavi@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host deploy1002
[09:03:43] <jinxer-wm>	 (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:04:48] <logmsgbot>	 !log taavi@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host deploy1002
[09:04:49] <icinga-wm>	 RECOVERY - Host deploy1002 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms
[09:05:51] <icinga-wm>	 PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service,rsync-deployment_module.service,rsync-patches_module.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:06:35] <XioNoX>	 taavi: eh, was in the middle of doing the same with homer
[09:06:39] <jinxer-wm>	 (KeyholderUnarmed) firing: 19 unarmed Keyholder key(s) on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[09:09:03] <taavi>	 XioNoX: sorry, I saw releng asking about it on -serviceops and connected it to the move yesterday, homer was showing an unrelated diff with some cp nodes so I figured I'd use the cookbook instead
[09:09:16] <XioNoX>	 taavi: yep you did right
[09:09:45] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on deploy1002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[09:10:04] <wikibugs>	 10ops-eqiad, 10serviceops: deploy1002 lost connectivity - https://phabricator.wikimedia.org/T349587 (10taavi) The host was moved in {T308339} but the switch ports were not updated. I ran `sre.network.configure-switch-interface` to configure the port as Homer was showing an unrelated diff. That does mean that t...
[09:11:50] <taavi>	 !log restart ferm on deploy1002 T349587
[09:11:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:11:54] <stashbot>	 T349587: deploy1002 lost connectivity - https://phabricator.wikimedia.org/T349587
[09:13:20] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2109 (re)pooling @ 80%: Maint over', diff saved to https://phabricator.wikimedia.org/P53037 and previous config saved to /var/cache/conftool/dbconfig/20231024-091319-arnaudb.json
[09:16:01] <wikibugs>	 (03PS1) 10Filippo Giunchedi: alertmanager: also allow local access to the API [puppet] - 10https://gerrit.wikimedia.org/r/968231 (https://phabricator.wikimedia.org/T321579)
[09:16:03] <wikibugs>	 (03PS1) 10Filippo Giunchedi: alertmanager: ship amtool.yml for AM api access [puppet] - 10https://gerrit.wikimedia.org/r/968232 (https://phabricator.wikimedia.org/T321579)
[09:16:25] <vgutierrez>	 !log upload golang-github-florianl-go-tc  to apt.wm.o (bookworm) - T348837
[09:16:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:30] <stashbot>	 T348837: Investigate IPVS IPIP encapsulation support - https://phabricator.wikimedia.org/T348837
[09:16:40] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] alertmanager: ship amtool.yml for AM api access [puppet] - 10https://gerrit.wikimedia.org/r/968232 (https://phabricator.wikimedia.org/T321579) (owner: 10Filippo Giunchedi)
[09:18:04] <wikibugs>	 (03PS2) 10Filippo Giunchedi: alertmanager: ship amtool.yml for AM api access [puppet] - 10https://gerrit.wikimedia.org/r/968232 (https://phabricator.wikimedia.org/T321579)
[09:19:42] <wikibugs>	 (03CR) 10Filippo Giunchedi: "https://puppet-compiler.wmflabs.org/output/968119/150/alert1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/968119 (https://phabricator.wikimedia.org/T321579) (owner: 10Filippo Giunchedi)
[09:20:10] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: also allow local access to the API [puppet] - 10https://gerrit.wikimedia.org/r/968231 (https://phabricator.wikimedia.org/T321579) (owner: 10Filippo Giunchedi)
[09:22:17] <wikibugs>	 (03CR) 10Filippo Giunchedi: "https://puppet-compiler.wmflabs.org/output/968232/151/" [puppet] - 10https://gerrit.wikimedia.org/r/968232 (https://phabricator.wikimedia.org/T321579) (owner: 10Filippo Giunchedi)
[09:23:43] <jinxer-wm>	 (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1024:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:23:45] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:23:53] <icinga-wm>	 RECOVERY - SSH on wdqs1024 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:25:48] <wikibugs>	 (03CR) 10Filippo Giunchedi: "This will work, however I'd rather not maintain yet another list of datacenters in remote_syslog_tls. Since the functionality to add per-d" [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) (owner: 10Andrea Denisse)
[09:28:05] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/967951 (owner: 10Muehlenhoff)
[09:28:25] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2109 (re)pooling @ 90%: Maint over', diff saved to https://phabricator.wikimedia.org/P53038 and previous config saved to /var/cache/conftool/dbconfig/20231024-092824-arnaudb.json
[09:28:32] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] idp::memcached Avoid Ferm-specific syntax (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967951 (owner: 10Muehlenhoff)
[09:28:59] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] ml-services: update recommendation-api-ng image [deployment-charts] - 10https://gerrit.wikimedia.org/r/967917 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira)
[09:32:35] <wikibugs>	 (03PS1) 10Jbond: cas: improve error messages [puppet] - 10https://gerrit.wikimedia.org/r/968235
[09:33:27] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/968114 (owner: 10Ayounsi)
[09:33:57] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] Ignore missing VM from PuppetDB with a tenant [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/968116 (owner: 10Ayounsi)
[09:34:55] <wikibugs>	 (03PS17) 10Fabfur: haproxy: enable healthcheck-dedicated backend [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851)
[09:36:53] <logmsgbot>	 !log jnuche@deploy2002 Started scap: testwikis wikis to 1.42.0-wmf.2  refs T348355
[09:36:58] <stashbot>	 T348355: 1.42.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T348355
[09:37:11] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: enable thanos upload for cloud instance [puppet] - 10https://gerrit.wikimedia.org/r/968238 (https://phabricator.wikimedia.org/T336854)
[09:37:13] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: enable pint for 'cloud' instance [puppet] - 10https://gerrit.wikimedia.org/r/968239 (https://phabricator.wikimedia.org/T336854)
[09:37:32] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/153/console" [puppet] - 10https://gerrit.wikimedia.org/r/968235 (owner: 10Jbond)
[09:37:57] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] cas: improve error messages [puppet] - 10https://gerrit.wikimedia.org/r/968235 (owner: 10Jbond)
[09:38:32] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Ignore missing VM from PuppetDB with a tenant [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/968116 (owner: 10Ayounsi)
[09:38:53] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] graphite: migrate configparse to new puppet API [puppet] - 10https://gerrit.wikimedia.org/r/967474 (https://phabricator.wikimedia.org/T349291) (owner: 10Jbond)
[09:39:04] <wikibugs>	 (03CR) 10Jbond: [V: 03+1 C: 03+2] graphite: migrate configparse to new puppet API (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/967474 (https://phabricator.wikimedia.org/T349291) (owner: 10Jbond)
[09:39:09] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox
[09:39:15] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox
[09:40:04] <wikibugs>	 (03CR) 10Hashar: Add a json representation of the build (032 comments) [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967407 (owner: 10Hashar)
[09:40:07] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on deploy1002 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[09:40:21] <wikibugs>	 (03PS1) 10Majavah: P:wmcs::metricsinfra: update karma config to match alerts.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/968240
[09:40:56] <jnuche>	 ^^ in case anyone is wondering, train presync failed last night, so I'm rerunning
[09:41:04] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Remove Coherence report check [puppet] - 10https://gerrit.wikimedia.org/r/968114 (owner: 10Ayounsi)
[09:43:11] <wikibugs>	 (03CR) 10Kevin Bazira: [C: 03+2] ml-services: update recommendation-api-ng image [deployment-charts] - 10https://gerrit.wikimedia.org/r/967917 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira)
[09:43:13] <wikibugs>	 (03PS5) 10Hashar: Add a json representation of the build [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967407
[09:43:30] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'db2109 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P53039 and previous config saved to /var/cache/conftool/dbconfig/20231024-094329-arnaudb.json
[09:43:47] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update recommendation-api-ng image [deployment-charts] - 10https://gerrit.wikimedia.org/r/967917 (https://phabricator.wikimedia.org/T348607) (owner: 10Kevin Bazira)
[09:44:47] <wikibugs>	 (03CR) 10Hashar: "https://gerrit.wikimedia.org/r/c/operations/software/puppet-compiler/+/967407/4..5" [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967407 (owner: 10Hashar)
[09:45:24] <logmsgbot>	 !log kevinbazira@deploy2002 helmfile [ml-staging-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[09:46:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add a json representation of the build [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967407 (owner: 10Hashar)
[09:48:41] <logmsgbot>	 !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[09:49:17] <logmsgbot>	 !log kevinbazira@deploy2002 helmfile [ml-serve-codfw] 'sync' command on namespace 'recommendation-api-ng' for release 'main' .
[09:49:29] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/968225 (https://phabricator.wikimedia.org/T219903) (owner: 10DDesouza)
[09:50:05] <wikibugs>	 (03CR) 10Jbond: "also adding jesses who is another good resource for general puppet reviews" [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt)
[09:50:19] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb: update research-landing-page [deployment-charts] - 10https://gerrit.wikimedia.org/r/968225 (https://phabricator.wikimedia.org/T219903) (owner: 10DDesouza)
[09:55:49] <wikibugs>	 (03PS6) 10Hashar: Add a json representation of the build [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967407
[09:57:18] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] P:wmcs::metricsinfra: update karma config to match alerts.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/968240 (owner: 10Majavah)
[09:59:02] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add a json representation of the build [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967407 (owner: 10Hashar)
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231024T1000)
[10:01:03] <icinga-wm>	 RECOVERY - Check systemd state on deploy1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:02:20] <logmsgbot>	 !log jnuche@deploy2002 Finished scap: testwikis wikis to 1.42.0-wmf.2  refs T348355 (duration: 25m 27s)
[10:02:25] <stashbot>	 T348355: 1.42.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T348355
[10:04:11] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: s2 on dbstore1007 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:04:31] <logmsgbot>	 !log jnuche@deploy2002 Pruned MediaWiki: 1.41.0-wmf.30 (duration: 02m 08s)
[10:04:35] <icinga-wm>	 PROBLEM - mysqld processes on dbstore1007 is CRITICAL: PROCS CRITICAL: 2 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[10:06:10] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] Update recommendation-api to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/967406 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[10:07:03] <wikibugs>	 (03Merged) 10jenkins-bot: Update recommendation-api to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/967406 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[10:07:57] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: apply
[10:08:33] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: apply
[10:10:33] <logmsgbot>	 !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/recommendation-api: apply
[10:10:50] <logmsgbot>	 !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/recommendation-api: apply
[10:13:22] <wikibugs>	 (03PS6) 10Giuseppe Lavagetto: mediawiki: add support for a prometheus-statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/960568 (https://phabricator.wikimedia.org/T343025)
[10:14:00] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] P:wmcs::metricsinfra: update karma config to match alerts.wm.o [puppet] - 10https://gerrit.wikimedia.org/r/968240 (owner: 10Majavah)
[10:14:23] <icinga-wm>	 PROBLEM - MariaDB Replica Lag: s2 on dbstore1007 is CRITICAL: CRITICAL slave_sql_lag could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:14:45] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/recommendation-api: apply
[10:15:15] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/recommendation-api: apply
[10:18:56] <wikibugs>	 (03PS1) 10Majavah: P:wmcs::metricsinfra: fix karma config [puppet] - 10https://gerrit.wikimedia.org/r/968245
[10:23:49] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/154/con" [puppet] - 10https://gerrit.wikimedia.org/r/968238 (https://phabricator.wikimedia.org/T336854) (owner: 10Filippo Giunchedi)
[10:26:23] <wikibugs>	 (03CR) 10Majavah: [V: 03+1 C: 03+1] "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/968238 (https://phabricator.wikimedia.org/T336854) (owner: 10Filippo Giunchedi)
[10:26:37] <logmsgbot>	 !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply
[10:26:55] <logmsgbot>	 !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/recommendation-api: apply
[10:27:09] <logmsgbot>	 !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: apply
[10:27:22] <logmsgbot>	 !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/recommendation-api: apply
[10:27:28] <wikibugs>	 (03CR) 10Majavah: "What would enabling pint mean in practice for us? We get alerts on some types of issues in the laert rules?" [puppet] - 10https://gerrit.wikimedia.org/r/968239 (https://phabricator.wikimedia.org/T336854) (owner: 10Filippo Giunchedi)
[10:29:09] <wikibugs>	 (03CR) 10Samtar: [C: 03+1] Increase Lua memory limit to 100MB on Wiktionary only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968013 (https://phabricator.wikimedia.org/T165935) (owner: 10Tim Starling)
[10:30:25] <_joe_>	 jouncebot: next
[10:30:26] <jouncebot>	 In 1 hour(s) and 29 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231024T1200)
[10:30:32] <_joe_>	 jouncebot: now
[10:30:32] <jouncebot>	 For the next 0 hour(s) and 29 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231024T1000)
[10:31:05] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: add support for a prometheus-statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/960568 (https://phabricator.wikimedia.org/T343025) (owner: 10Giuseppe Lavagetto)
[10:32:19] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: add support for a prometheus-statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/960568 (https://phabricator.wikimedia.org/T343025) (owner: 10Giuseppe Lavagetto)
[10:37:50] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] Update shellbox to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/967410 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[10:38:44] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10User-RhinosF1: Volunteer NDA for RhinosF1 - https://phabricator.wikimedia.org/T341272 (10Aklapper) > Was this done @KFrancis ?  Let's move the general NDA workflow discussion to {T349595}, as this specific request for RhinosF1 is resolved. Thanks!
[10:38:54] <wikibugs>	 (03Merged) 10jenkins-bot: Update shellbox to use certmanager certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/967410 (https://phabricator.wikimedia.org/T300033) (owner: 10JMeybohm)
[10:39:23] <logmsgbot>	 !log oblivian@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[10:39:48] <wikibugs>	 (03CR) 10Jbond: "overall loks good, most comments are style nits but there is one error around the use of site_name which is not getting based to the core " [puppet] - 10https://gerrit.wikimedia.org/r/967258 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt)
[10:40:10] <logmsgbot>	 !log oblivian@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[10:41:41] <logmsgbot>	 !log oblivian@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[10:42:02] <logmsgbot>	 !log oblivian@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[10:42:32] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [labs/private] - 10https://gerrit.wikimedia.org/r/967519 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt)
[10:42:59] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 0:30:00 on an-test-client1002.eqiad.wmnet with reason: Cold booting with ganeti to increase RAM
[10:43:24] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on an-test-client1002.eqiad.wmnet with reason: Cold booting with ganeti to increase RAM
[10:43:47] <logmsgbot>	 !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/shellbox: apply
[10:44:15] <logmsgbot>	 !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox: apply
[10:46:44] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox: apply
[10:47:39] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply
[10:49:23] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[10:53:10] <logmsgbot>	 !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox: apply
[10:53:11] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mesh: add new minor for configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/968247
[10:53:13] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mesh: fix parameter [deployment-charts] - 10https://gerrit.wikimedia.org/r/968248
[10:53:18] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[10:54:01] <logmsgbot>	 !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox: apply
[10:54:29] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mediawiki: update to mesh.configuration:1.4.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/968249
[10:54:38] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mesh: add new minor for configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/968247 (owner: 10Giuseppe Lavagetto)
[10:55:25] <wikibugs>	 (03Merged) 10jenkins-bot: mesh: add new minor for configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/968247 (owner: 10Giuseppe Lavagetto)
[10:56:00] <wikibugs>	 (03PS1) 10Jbond: prometheus: realise blackbox::check's instantly on prometheus hosts [puppet] - 10https://gerrit.wikimedia.org/r/968250
[10:56:54] <logmsgbot>	 !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply
[10:57:09] <logmsgbot>	 !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply
[10:57:34] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply
[10:57:54] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply
[10:57:55] <logmsgbot>	 !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply
[10:58:11] <logmsgbot>	 !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply
[10:59:14] <logmsgbot>	 !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-media: apply
[10:59:28] <logmsgbot>	 !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply
[11:00:16] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] mesh: fix parameter [deployment-charts] - 10https://gerrit.wikimedia.org/r/968248 (owner: 10Giuseppe Lavagetto)
[11:00:20] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] mediawiki: update to mesh.configuration:1.4.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/968249 (owner: 10Giuseppe Lavagetto)
[11:01:01] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:01:24] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mesh: fix parameter [deployment-charts] - 10https://gerrit.wikimedia.org/r/968248 (owner: 10Giuseppe Lavagetto)
[11:03:14] <wikibugs>	 (03Merged) 10jenkins-bot: mesh: fix parameter [deployment-charts] - 10https://gerrit.wikimedia.org/r/968248 (owner: 10Giuseppe Lavagetto)
[11:03:29] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki: update to mesh.configuration:1.4.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/968249 (owner: 10Giuseppe Lavagetto)
[11:03:48] <wikibugs>	 (03CR) 10Jbond: "I'm not sure this is currently useful right now but it replicates what we do in `monitoring::service` and came about from a different revi" [puppet] - 10https://gerrit.wikimedia.org/r/968250 (owner: 10Jbond)
[11:04:38] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply
[11:04:45] <wikibugs>	 (03Merged) 10jenkins-bot: mediawiki: update to mesh.configuration:1.4.3 [deployment-charts] - 10https://gerrit.wikimedia.org/r/968249 (owner: 10Giuseppe Lavagetto)
[11:04:59] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply
[11:05:01] <logmsgbot>	 !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply
[11:05:17] <logmsgbot>	 !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply
[11:07:51] <logmsgbot>	 !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply
[11:07:53] <wikibugs>	 (03CR) 10Jbond: "yuo also need to  tox -e py3-format" [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967407 (owner: 10Hashar)
[11:08:07] <logmsgbot>	 !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply
[11:08:28] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply
[11:08:52] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply
[11:08:53] <logmsgbot>	 !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply
[11:09:17] <logmsgbot>	 !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply
[11:11:29] <logmsgbot>	 !log jiji@deploy2002 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply
[11:11:51] <logmsgbot>	 !log jiji@deploy2002 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply
[11:12:18] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply
[11:12:45] <logmsgbot>	 !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply
[11:12:46] <logmsgbot>	 !log jiji@deploy2002 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply
[11:13:10] <logmsgbot>	 !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply
[11:15:24] <logmsgbot>	 !log oblivian@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[11:16:34] <logmsgbot>	 !log oblivian@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[11:17:03] <logmsgbot>	 !log oblivian@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[11:17:21] <logmsgbot>	 !log oblivian@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[11:26:43] <icinga-wm>	 RECOVERY - mysqld processes on dbstore1007 is OK: PROCS OK: 3 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting
[11:27:19] <icinga-wm>	 RECOVERY - MariaDB Replica Lag: s2 on dbstore1007 is OK: OK slave_sql_lag not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[11:27:37] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: s2 on dbstore1007 is OK: OK slave_sql_state not a slave https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[12:00:05] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231024T1200)
[12:07:18] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[12:12:33] <wikibugs>	 (03PS1) 10Brouberol: Fix: remove duplicate /topicmappr path in zk URL and disable compression [puppet] - 10https://gerrit.wikimedia.org/r/968257
[12:34:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200: 3.5258375591962716s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[12:35:25] <wikibugs>	 (03PS2) 10Brouberol: Fix: remove duplicate /topicmappr path in zk URL [puppet] - 10https://gerrit.wikimedia.org/r/968257
[12:39:15] <wikibugs>	 (03PS2) 10Jbond: sre.puppet.migrate-role: add new cookbook to migrate roles to puppet7 [cookbooks] - 10https://gerrit.wikimedia.org/r/967935
[12:39:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200: 2.13681268189364s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[12:41:27] <jbond>	 !log migrate idp_test to puppet7
[12:41:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:42:39] <wikibugs>	 (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/155/con" [puppet] - 10https://gerrit.wikimedia.org/r/968257 (owner: 10Brouberol)
[12:43:26] <wikibugs>	 (03PS3) 10Brouberol: Fix: remove duplicate /topicmappr path in zk URL [puppet] - 10https://gerrit.wikimedia.org/r/968257
[12:43:35] <wikibugs>	 (03PS4) 10Brouberol: Fix: remove duplicate /topicmappr path in zk URL [puppet] - 10https://gerrit.wikimedia.org/r/968257
[12:44:03] <wikibugs>	 (03PS3) 10Jbond: sre.puppet.migrate-role: add new cookbook to migrate roles to puppet7 [cookbooks] - 10https://gerrit.wikimedia.org/r/967935 (https://phabricator.wikimedia.org/T340739)
[12:44:19] <wikibugs>	 (03PS1) 10Jbond: idp_test: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/968258 (https://phabricator.wikimedia.org/T340739)
[12:44:34] <wikibugs>	 (03PS5) 10Brouberol: Fix: remove duplicate /topicmappr path in zk URL [puppet] - 10https://gerrit.wikimedia.org/r/968257
[12:45:12] <wikibugs>	 (03CR) 10Hashar: Add a json representation of the build (031 comment) [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967407 (owner: 10Hashar)
[12:46:17] <wikibugs>	 (03PS7) 10Hashar: Add a json representation of the build [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967407
[12:48:19] <wikibugs>	 (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/156/con" [puppet] - 10https://gerrit.wikimedia.org/r/968257 (owner: 10Brouberol)
[12:48:25] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] idp_test: migrate to puppet7 [puppet] - 10https://gerrit.wikimedia.org/r/968258 (https://phabricator.wikimedia.org/T340739) (owner: 10Jbond)
[12:48:53] <wikibugs>	 (03PS6) 10Brouberol: Fix: remove duplicate /topicmappr path in zk URL [puppet] - 10https://gerrit.wikimedia.org/r/968257
[12:50:20] <wikibugs>	 (03PS1) 10Tsevener: Add stream config for iOS schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968259 (https://phabricator.wikimedia.org/T347122)
[12:50:28] <wikibugs>	 (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/157/con" [puppet] - 10https://gerrit.wikimedia.org/r/968257 (owner: 10Brouberol)
[12:55:11] <wikibugs>	 (03PS4) 10Samtar: InitialiseSettings-labs: Set values for renamed PageTriage variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965395 (https://phabricator.wikimedia.org/T331595) (owner: 10MPGuy2824)
[12:56:30] <wikibugs>	 (03CR) 10Samtar: [C: 03+2] "beta-only change, +2ing prior to window" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965395 (https://phabricator.wikimedia.org/T331595) (owner: 10MPGuy2824)
[12:57:06] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "doh!" [puppet] - 10https://gerrit.wikimedia.org/r/968245 (owner: 10Majavah)
[12:57:16] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] P:wmcs::metricsinfra: fix karma config [puppet] - 10https://gerrit.wikimedia.org/r/968245 (owner: 10Majavah)
[12:57:48] <wikibugs>	 (03Merged) 10jenkins-bot: InitialiseSettings-labs: Set values for renamed PageTriage variable [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965395 (https://phabricator.wikimedia.org/T331595) (owner: 10MPGuy2824)
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231024T1300).
[13:00:05] <jouncebot>	 JSherman, TheresNoTime, and dcausse: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:08] * TheresNoTime can deploy
[13:00:18] <dcausse>	 o/
[13:00:20] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[13:00:26] <TheresNoTime>	 JSherman: I've already +2'd your change as it was beta-only, so it should be live on beta in a few minutes
[13:01:16] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968013 (https://phabricator.wikimedia.org/T165935) (owner: 10Tim Starling)
[13:01:55] <JSherman>	 thanks!
[13:02:07] <wikibugs>	 (03Merged) 10jenkins-bot: Increase Lua memory limit to 100MB on Wiktionary only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968013 (https://phabricator.wikimedia.org/T165935) (owner: 10Tim Starling)
[13:03:00] <logmsgbot>	 !log samtar@deploy2002 Started scap: Backport for [[gerrit:968013|Increase Lua memory limit to 100MB on Wiktionary only (T165935)]]
[13:03:17] <stashbot>	 T165935: "Lua error: not enough memory" on certain en.wiktionary pages - https://phabricator.wikimedia.org/T165935
[13:04:28] <logmsgbot>	 !log samtar@deploy2002 samtar and tstarling: Backport for [[gerrit:968013|Increase Lua memory limit to 100MB on Wiktionary only (T165935)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:04:31] * TheresNoTime testing
[13:05:45] <logmsgbot>	 !log samtar@deploy2002 samtar and tstarling: Continuing with sync
[13:06:29] <wikibugs>	 (03PS7) 10Samtar: cirrus: add the mediawiki.cirrussearch.page_rerender.v1 stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957726 (https://phabricator.wikimedia.org/T325565) (owner: 10DCausse)
[13:06:38] <wikibugs>	 (03PS7) 10Samtar: cirrus: add wgCirrusSearchUseEventBusBridge and enable it on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957727 (https://phabricator.wikimedia.org/T325565) (owner: 10DCausse)
[13:06:40] <jinxer-wm>	 (KeyholderUnarmed) firing: 19 unarmed Keyholder key(s) on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[13:10:10] <toni_>	 hi all, FYI I added my patch to the deploy calendar right as this window was starting, hope that's okay
[13:10:51] <logmsgbot>	 !log samtar@deploy2002 Finished scap: Backport for [[gerrit:968013|Increase Lua memory limit to 100MB on Wiktionary only (T165935)]] (duration: 07m 51s)
[13:10:56] <stashbot>	 T165935: "Lua error: not enough memory" on certain en.wiktionary pages - https://phabricator.wikimedia.org/T165935
[13:11:19] <wikibugs>	 10Puppet, 10SRE, 10Infrastructure-Foundations, 10netbox, and 3 others: Netbox: use the netbox to  also sync networks and network devices - https://phabricator.wikimedia.org/T329272 (10ayounsi) Following up a bit on that, I think the easiest next step is to also add `$netbox_infra_devices = lookup('profile:...
[13:12:32] <TheresNoTime>	 (just checking T349612 was indeed only a temporary bump..)
[13:12:32] <stashbot>	 T349612: LuaSandboxMemoryError: not enough memory - https://phabricator.wikimedia.org/T349612
[13:13:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957726 (https://phabricator.wikimedia.org/T325565) (owner: 10DCausse)
[13:13:25] <TheresNoTime>	 proceeding with your patches dcausse 
[13:13:33] <dcausse>	 TheresNoTime: thanks!
[13:13:35] <TheresNoTime>	 toni_: (ack, that's fine!)
[13:14:02] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus: add the mediawiki.cirrussearch.page_rerender.v1 stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957726 (https://phabricator.wikimedia.org/T325565) (owner: 10DCausse)
[13:14:25] <logmsgbot>	 !log samtar@deploy2002 Started scap: Backport for [[gerrit:957726|cirrus: add the mediawiki.cirrussearch.page_rerender.v1 stream (T325565)]]
[13:14:30] <stashbot>	 T325565: Add support for page re-renders - https://phabricator.wikimedia.org/T325565
[13:15:47] <logmsgbot>	 !log samtar@deploy2002 samtar and dcausse: Backport for [[gerrit:957726|cirrus: add the mediawiki.cirrussearch.page_rerender.v1 stream (T325565)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:15:58] <TheresNoTime>	 dcausse: live on mwdebug, can you test that?
[13:15:59] <dcausse>	 looking ^
[13:16:02] <dcausse>	 yes
[13:16:06] <TheresNoTime>	 (ack)
[13:16:58] <dcausse>	 TheresNoTime: all good, I'll need to restart eventgate-main once this one is deployed and before shipping the next one 
[13:17:07] <TheresNoTime>	 okay
[13:17:10] <logmsgbot>	 !log samtar@deploy2002 samtar and dcausse: Continuing with sync
[13:18:20] <wikibugs>	 (03PS4) 10Jbond: sre.puppet.migrate-role: add new cookbook to migrate roles to puppet7 [cookbooks] - 10https://gerrit.wikimedia.org/r/967935 (https://phabricator.wikimedia.org/T340739)
[13:19:52] <cormacparle>	 TheresNoTime: is it too late to do a backport that I forgot to put on the list earlier?
[13:20:08] <wikibugs>	 (03PS2) 10Jforrester: [Staging only] wikifunctions: Raise PyWASM CPU limits by 4x [deployment-charts] - 10https://gerrit.wikimedia.org/r/968004
[13:20:15] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] [Staging only] wikifunctions: Raise PyWASM CPU limits by 4x [deployment-charts] - 10https://gerrit.wikimedia.org/r/968004 (owner: 10Jforrester)
[13:20:22] <TheresNoTime>	 cormacparle: it'll probably be okay, depends what :)
[13:20:51] <cormacparle>	 it's this one https://gerrit.wikimedia.org/r/c/mediawiki/extensions/MediaSearch/+/967411/
[13:21:07] <cormacparle>	 was hoping to backport it last night but there were no deployers around
[13:21:13] <wikibugs>	 (03Merged) 10jenkins-bot: [Staging only] wikifunctions: Raise PyWASM CPU limits by 4x [deployment-charts] - 10https://gerrit.wikimedia.org/r/968004 (owner: 10Jforrester)
[13:21:43] <TheresNoTime>	 cormacparle: yeah that's fine, can you cherry-pick it & add it to the backport calendar?
[13:22:04] <cormacparle>	 sure, gimme a sec
[13:22:11] <logmsgbot>	 !log samtar@deploy2002 Finished scap: Backport for [[gerrit:957726|cirrus: add the mediawiki.cirrussearch.page_rerender.v1 stream (T325565)]] (duration: 07m 45s)
[13:22:15] <stashbot>	 T325565: Add support for page re-renders - https://phabricator.wikimedia.org/T325565
[13:22:17] <TheresNoTime>	 dcausse: that first patch is deployed, let me know when I can start the next
[13:22:22] <dcausse>	 sure
[13:22:36] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[13:22:50] <logmsgbot>	 !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[13:22:50] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-main: sync
[13:23:18] <logmsgbot>	 !log dcausse@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-main: sync
[13:23:31] <wikibugs>	 (03PS1) 10Cparle: Fix typo (undefined event) [extensions/MediaSearch] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/967780 (https://phabricator.wikimedia.org/T349271)
[13:23:51] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-main: sync
[13:24:18] <logmsgbot>	 !log dcausse@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: sync
[13:24:21] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: sync
[13:24:36] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: sync
[13:25:05] <logmsgbot>	 !log dcausse@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-main: sync
[13:25:22] <logmsgbot>	 !log dcausse@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: sync
[13:25:38] <wikibugs>	 (03PS5) 10Jbond: sre.puppet.migrate-role: add new cookbook to migrate roles to puppet7 [cookbooks] - 10https://gerrit.wikimedia.org/r/967935 (https://phabricator.wikimedia.org/T340739)
[13:25:40] <wikibugs>	 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T349576 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm known issue with no impact
[13:25:59] <dcausse>	 TheresNoTime: should be done
[13:26:05] <TheresNoTime>	 ack, starting the next
[13:26:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957727 (https://phabricator.wikimedia.org/T325565) (owner: 10DCausse)
[13:26:45] <dcausse>	 this one touches jobrunners so can't be tested
[13:26:52] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus: add wgCirrusSearchUseEventBusBridge and enable it on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/957727 (https://phabricator.wikimedia.org/T325565) (owner: 10DCausse)
[13:27:15] <logmsgbot>	 !log samtar@deploy2002 Started scap: Backport for [[gerrit:957727|cirrus: add wgCirrusSearchUseEventBusBridge and enable it on testwiki (T325565)]]
[13:27:19] <stashbot>	 T325565: Add support for page re-renders - https://phabricator.wikimedia.org/T325565
[13:28:39] <logmsgbot>	 !log samtar@deploy2002 samtar and dcausse: Backport for [[gerrit:957727|cirrus: add wgCirrusSearchUseEventBusBridge and enable it on testwiki (T325565)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:28:45] <logmsgbot>	 !log samtar@deploy2002 samtar and dcausse: Continuing with sync
[13:28:56] <TheresNoTime>	 (syncing as can't be tested)
[13:29:10] <wikibugs>	 (03PS2) 10Samtar: Add stream config for iOS schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968259 (https://phabricator.wikimedia.org/T347122) (owner: 10Tsevener)
[13:29:39] <logmsgbot>	 !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply
[13:29:41] <wikibugs>	 (03CR) 10Matthias Mullie: [C: 03+1] Fix typo (undefined event) [extensions/MediaSearch] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/967780 (https://phabricator.wikimedia.org/T349271) (owner: 10Cparle)
[13:30:00] <logmsgbot>	 !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[13:30:03] <logmsgbot>	 !log dani@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply
[13:30:31] <logmsgbot>	 !log dani@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[13:30:33] <logmsgbot>	 !log dani@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[13:30:52] <cormacparle>	 TheresNoTime: that cherry-pick is done and in the deploy calendar
[13:31:02] <TheresNoTime>	 cormacparle: thanks :)
[13:31:03] <logmsgbot>	 !log dani@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[13:33:55] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Thank you for the followup!" [puppet] - 10https://gerrit.wikimedia.org/r/968250 (owner: 10Jbond)
[13:34:10] <logmsgbot>	 !log samtar@deploy2002 Finished scap: Backport for [[gerrit:957727|cirrus: add wgCirrusSearchUseEventBusBridge and enable it on testwiki (T325565)]] (duration: 06m 55s)
[13:34:12] <TheresNoTime>	 dcausse: live on prod
[13:34:19] <dcausse>	 TheresNoTime: thanks!
[13:34:29] <TheresNoTime>	 toni_: doing yours now
[13:34:39] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968259 (https://phabricator.wikimedia.org/T347122) (owner: 10Tsevener)
[13:35:07] <wikibugs>	 (03CR) 10Filippo Giunchedi: prometheus: enable pint for 'cloud' instance (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/968239 (https://phabricator.wikimedia.org/T336854) (owner: 10Filippo Giunchedi)
[13:35:20] <wikibugs>	 (03Merged) 10jenkins-bot: Add stream config for iOS schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968259 (https://phabricator.wikimedia.org/T347122) (owner: 10Tsevener)
[13:35:43] <logmsgbot>	 !log samtar@deploy2002 Started scap: Backport for [[gerrit:968259|Add stream config for iOS schema (T347122)]]
[13:36:03] <wikibugs>	 (03CR) 10Samtar: [C: 03+2] "starting CI for backport" [extensions/MediaSearch] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/967780 (https://phabricator.wikimedia.org/T349271) (owner: 10Cparle)
[13:36:12] <stashbot>	 T325565: Add support for page re-renders - https://phabricator.wikimedia.org/T325565
[13:36:17] <stashbot>	 T347122: Document Instrumentation and Schema Needs for Suggested Edits on iOS Proof of Concept  - https://phabricator.wikimedia.org/T347122
[13:37:06] <logmsgbot>	 !log samtar@deploy2002 samtar and tsev: Backport for [[gerrit:968259|Add stream config for iOS schema (T347122)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:37:08] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: enable thanos upload for cloud instance [puppet] - 10https://gerrit.wikimedia.org/r/968238 (https://phabricator.wikimedia.org/T336854) (owner: 10Filippo Giunchedi)
[13:37:11] <TheresNoTime>	 toni_: live on mwdebug, can you test?
[13:37:49] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] "ok, sounds good!" [puppet] - 10https://gerrit.wikimedia.org/r/968239 (https://phabricator.wikimedia.org/T336854) (owner: 10Filippo Giunchedi)
[13:38:23] <toni_>	 TheresNoTime tested, looks good!
[13:38:28] <logmsgbot>	 !log samtar@deploy2002 samtar and tsev: Continuing with sync
[13:41:33] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): Migarate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10jbond)
[13:43:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: codfw parsoid GET/200: 2.395174465963646s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[13:43:35] <logmsgbot>	 !log samtar@deploy2002 Finished scap: Backport for [[gerrit:968259|Add stream config for iOS schema (T347122)]] (duration: 07m 52s)
[13:43:41] <stashbot>	 T347122: Document Instrumentation and Schema Needs for Suggested Edits on iOS Proof of Concept  - https://phabricator.wikimedia.org/T347122
[13:43:41] <TheresNoTime>	 toni_: live on prod :)
[13:43:43] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): update acme chief access - https://phabricator.wikimedia.org/T349620 (10jbond)
[13:44:18] <TheresNoTime>	 cormacparle: just waiting for your patch to merge, shouldn't be long
[13:44:22] <toni_>	 TheresNoTime great, thank you!
[13:44:24] <cormacparle>	 👍
[13:46:55] <wikibugs>	 (03PS1) 10Jbond: acme_chief: add pki root certificate to list of trusted roots [puppet] - 10https://gerrit.wikimedia.org/r/968269 (https://phabricator.wikimedia.org/T349620)
[13:48:16] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: codfw parsoid GET/200: 2.395174465963646s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[13:49:16] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: enable pint for 'cloud' instance [puppet] - 10https://gerrit.wikimedia.org/r/968239 (https://phabricator.wikimedia.org/T336854) (owner: 10Filippo Giunchedi)
[13:49:22] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: enable pint for 'cloud' instance [puppet] - 10https://gerrit.wikimedia.org/r/968239 (https://phabricator.wikimedia.org/T336854)
[13:50:48] <wikibugs>	 (03Merged) 10jenkins-bot: Fix typo (undefined event) [extensions/MediaSearch] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/967780 (https://phabricator.wikimedia.org/T349271) (owner: 10Cparle)
[13:51:28] <logmsgbot>	 !log samtar@deploy2002 Started scap: Backport for [[gerrit:967780|Fix typo (undefined event) (T349271)]]
[13:51:33] <stashbot>	 T349271: Errors in at reportLoadTiming: Cannot read properties of undefined (reading 'loadEventEnd') / event is undefined / Cannot read property 'loadEventEnd' of undefined - https://phabricator.wikimedia.org/T349271
[13:52:48] <logmsgbot>	 !log samtar@deploy2002 samtar and cparle: Backport for [[gerrit:967780|Fix typo (undefined event) (T349271)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:52:52] <TheresNoTime>	 cormacparle: live on mwdebug, can you test?
[13:53:31] <cormacparle>	 sure ...
[13:55:42] <cormacparle>	 works!
[13:55:46] <logmsgbot>	 !log samtar@deploy2002 samtar and cparle: Continuing with sync
[13:55:50] <TheresNoTime>	 syncing :)
[13:55:58] <wikibugs>	 (03CR) 10Jbond: prometheus: realise blackbox::check's instantly on prometheus hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/968250 (owner: 10Jbond)
[14:00:55] <logmsgbot>	 !log samtar@deploy2002 Finished scap: Backport for [[gerrit:967780|Fix typo (undefined event) (T349271)]] (duration: 09m 26s)
[14:00:56] <TheresNoTime>	 cormacparle: live on prod :)
[14:01:10] <stashbot>	 T349271: Errors in at reportLoadTiming: Cannot read properties of undefined (reading 'loadEventEnd') / event is undefined / Cannot read property 'loadEventEnd' of undefined - https://phabricator.wikimedia.org/T349271
[14:01:47] <TheresNoTime>	 !log close backport window
[14:01:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:03:06] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+1] Fix: remove duplicate /topicmappr path in zk URL [puppet] - 10https://gerrit.wikimedia.org/r/968257 (owner: 10Brouberol)
[14:03:13] <cormacparle>	 TheresNoTime: do I need to wait a few mins for the js cache to clear? still seeing the bug on prod (and not on debug)
[14:04:44] <TheresNoTime>	 cormacparle: potentially (I'm not quite sure myself), let me see if there's any docs on that..
[14:06:19] <wikibugs>	 (03CR) 10Brouberol: [V: 03+1 C: 03+2] Fix: remove duplicate /topicmappr path in zk URL [puppet] - 10https://gerrit.wikimedia.org/r/968257 (owner: 10Brouberol)
[14:07:04] <TheresNoTime>	 ah yes, it "can take up to five minutes"
[14:07:17] <cormacparle>	 aha ok, will wait a little while so ...
[14:08:27] <TheresNoTime>	 (or does `?debug=true` work?)
[14:09:32] <cormacparle>	 it does indeed
[14:09:53] <cormacparle>	 great, thank you!
[14:10:24] <TheresNoTime>	 you're welcome :)
[14:11:42] <wikibugs>	 10SRE-OnFire, 10Cloud-VPS, 10Observability-Alerting, 10cloud-services-team, and 2 others: monitoring: find out how we could have been paged for outage "Multiple CloudVPS instances lost their IPs" - https://phabricator.wikimedia.org/T347694 (10lmata)
[14:12:11] <wikibugs>	 (03PS2) 10Jbond: acme_chief: add new puppet intermediate CA to list of trusted clients [puppet] - 10https://gerrit.wikimedia.org/r/968269 (https://phabricator.wikimedia.org/T349620)
[14:13:34] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/158/con" [puppet] - 10https://gerrit.wikimedia.org/r/968269 (https://phabricator.wikimedia.org/T349620) (owner: 10Jbond)
[14:14:33] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-1] acme_chief: add new puppet intermediate CA to list of trusted clients (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/968269 (https://phabricator.wikimedia.org/T349620) (owner: 10Jbond)
[14:16:14] <wikibugs>	 (03PS3) 10Jbond: acme_chief: add new puppet intermediate CA to list of trusted clients [puppet] - 10https://gerrit.wikimedia.org/r/968269 (https://phabricator.wikimedia.org/T349620)
[14:16:24] <wikibugs>	 (03CR) 10Jbond: "fixed thanks" [puppet] - 10https://gerrit.wikimedia.org/r/968269 (https://phabricator.wikimedia.org/T349620) (owner: 10Jbond)
[14:17:48] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] "hardcoding the certs will break acme-chief cert validation in cloud vps" [puppet] - 10https://gerrit.wikimedia.org/r/968269 (https://phabricator.wikimedia.org/T349620) (owner: 10Jbond)
[14:19:22] <wikibugs>	 (03CR) 10Filippo Giunchedi: prometheus: realise blackbox::check's instantly on prometheus hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/968250 (owner: 10Jbond)
[14:19:57] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:31:07] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:31:19] <wikibugs>	 (03PS6) 10Jbond: sre.puppet.migrate-role: add new cookbook to migrate roles to puppet7 [cookbooks] - 10https://gerrit.wikimedia.org/r/967935 (https://phabricator.wikimedia.org/T340739)
[14:31:33] <wikibugs>	 (03PS1) 10Majavah: site: Re-image cloudmetrics hosts as insetup [puppet] - 10https://gerrit.wikimedia.org/r/968277 (https://phabricator.wikimedia.org/T336774)
[14:31:35] <wikibugs>	 (03PS1) 10Majavah: hieradata: drop prometheus access for cloudmetrics1003/4 [puppet] - 10https://gerrit.wikimedia.org/r/968278 (https://phabricator.wikimedia.org/T336854)
[14:31:37] <wikibugs>	 (03PS1) 10Majavah: P:alertmanager: drop cloudmetrics hosts [puppet] - 10https://gerrit.wikimedia.org/r/968279 (https://phabricator.wikimedia.org/T336854)
[14:31:39] <wikibugs>	 (03PS1) 10Majavah: P:wmcs::prometheus: drop profile [puppet] - 10https://gerrit.wikimedia.org/r/968280 (https://phabricator.wikimedia.org/T336854)
[14:31:41] <wikibugs>	 (03PS1) 10Majavah: P:wmcs: drop graphite manifests [puppet] - 10https://gerrit.wikimedia.org/r/968281
[14:31:43] <wikibugs>	 (03PS1) 10Majavah: O:wmcs::monitoring: drop role [puppet] - 10https://gerrit.wikimedia.org/r/968282 (https://phabricator.wikimedia.org/T336774)
[14:32:04] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Adding db1227 depooled', diff saved to https://phabricator.wikimedia.org/P53041 and previous config saved to /var/cache/conftool/dbconfig/20231024-143204-arnaudb.json
[14:38:39] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:39:16] <wikibugs>	 (03CR) 10Fabfur: [C: 03+2] hiera: enable dual disk storage for new cp hosts in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/967235 (https://phabricator.wikimedia.org/T349244) (owner: 10Fabfur)
[14:39:41] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Puppet (Puppet 7.0): Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10Aklapper)
[14:39:56] <wikibugs>	 (03CR) 10Fabfur: [C: 03+2] haproxy: enable healthcheck-dedicated backend [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur)
[14:40:12] <wikibugs>	 (03CR) 10Fabfur: "wrong window" [puppet] - 10https://gerrit.wikimedia.org/r/966221 (https://phabricator.wikimedia.org/T348851) (owner: 10Fabfur)
[14:41:07] <wikibugs>	 (03PS8) 10Hashar: Add a json representation of the build [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967407
[14:41:11] <wikibugs>	 (03PS3) 10Hashar: Add a json representation for each host [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967479
[14:42:04] <wikibugs>	 (03PS4) 10Jbond: acme_chief: add new puppet intermediate CA to list of trusted clients [puppet] - 10https://gerrit.wikimedia.org/r/968269 (https://phabricator.wikimedia.org/T349620)
[14:42:19] <wikibugs>	 (03CR) 10Hashar: Add a json representation for each host (031 comment) [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967479 (owner: 10Hashar)
[14:42:31] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] acme_chief: add new puppet intermediate CA to list of trusted clients [puppet] - 10https://gerrit.wikimedia.org/r/968269 (https://phabricator.wikimedia.org/T349620) (owner: 10Jbond)
[14:43:17] <wikibugs>	 (03CR) 10Hashar: Add a json representation of the build (031 comment) [software/puppet-compiler] (2.x) - 10https://gerrit.wikimedia.org/r/967407 (owner: 10Hashar)
[14:45:37] <wikibugs>	 (03PS5) 10Jbond: acme_chief: add new puppet intermediate CA to list of trusted clients [puppet] - 10https://gerrit.wikimedia.org/r/968269 (https://phabricator.wikimedia.org/T349620)
[14:47:32] <wikibugs>	 (03PS6) 10Jbond: acme_chief: add new puppet intermediate CA to list of trusted clients [puppet] - 10https://gerrit.wikimedia.org/r/968269 (https://phabricator.wikimedia.org/T349620)
[14:48:36] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1100.eqiad.wmnet with OS bullseye
[14:50:08] <wikibugs>	 (03PS7) 10Jbond: acme_chief: add new puppet intermediate CA to list of trusted clients [puppet] - 10https://gerrit.wikimedia.org/r/968269 (https://phabricator.wikimedia.org/T349620)
[14:50:14] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host snapshot1016.eqiad.wmnet
[14:50:41] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:51:30] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/162/con" [puppet] - 10https://gerrit.wikimedia.org/r/968269 (https://phabricator.wikimedia.org/T349620) (owner: 10Jbond)
[14:53:18] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[14:53:20] <wikibugs>	 (03PS8) 10Jbond: acme_chief: add new puppet intermediate CA to list of trusted clients [puppet] - 10https://gerrit.wikimedia.org/r/968269 (https://phabricator.wikimedia.org/T349620)
[14:53:39] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:54:47] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/968269 (https://phabricator.wikimedia.org/T349620) (owner: 10Jbond)
[14:55:19] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] acme_chief: add new puppet intermediate CA to list of trusted clients (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/968269 (https://phabricator.wikimedia.org/T349620) (owner: 10Jbond)
[14:57:08] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host snapshot1016.eqiad.wmnet
[14:58:44] <logmsgbot>	 !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1100.eqiad.wmnet with OS bullseye
[14:59:28] <wikibugs>	 (03CR) 10Herron: [C: 03+1] alertmanager: ship amtool.yml for AM api access [puppet] - 10https://gerrit.wikimedia.org/r/968232 (https://phabricator.wikimedia.org/T321579) (owner: 10Filippo Giunchedi)
[14:59:42] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1100.eqiad.wmnet with OS bullseye
[15:00:04] <jouncebot>	 eoghan, jelto, and arnoldokoth: #bothumor I � Unicode. All rise for SRE Collaboration Services office hours deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231024T1500).
[15:00:40] <wikibugs>	 (03PS1) 10Filippo Giunchedi: prometheus: add cloud replica label [puppet] - 10https://gerrit.wikimedia.org/r/968284 (https://phabricator.wikimedia.org/T336854)
[15:00:50] <wikibugs>	 (03CR) 10Herron: [C: 03+1] alertmanager: let karma use apache to access AM [puppet] - 10https://gerrit.wikimedia.org/r/968119 (https://phabricator.wikimedia.org/T321579) (owner: 10Filippo Giunchedi)
[15:02:22] <wikibugs>	 (03PS1) 10Aqu: [WIP] Send metrics from Airflow analytics test [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532)
[15:02:59] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:04:13] <wikibugs>	 (03PS2) 10Filippo Giunchedi: prometheus: add cloud replica label [puppet] - 10https://gerrit.wikimedia.org/r/968284 (https://phabricator.wikimedia.org/T336854)
[15:04:54] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [WIP] Send metrics from Airflow analytics test [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu)
[15:05:06] <wikibugs>	 (03CR) 10Aqu: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu)
[15:06:04] <wikibugs>	 (03PS7) 10Jbond: sre.puppet.migrate-role: add new cookbook to migrate roles to puppet7 [cookbooks] - 10https://gerrit.wikimedia.org/r/967935 (https://phabricator.wikimedia.org/T340739)
[15:07:06] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] prometheus: add cloud replica label [puppet] - 10https://gerrit.wikimedia.org/r/968284 (https://phabricator.wikimedia.org/T336854) (owner: 10Filippo Giunchedi)
[15:07:24] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: add cloud replica label [puppet] - 10https://gerrit.wikimedia.org/r/968284 (https://phabricator.wikimedia.org/T336854) (owner: 10Filippo Giunchedi)
[15:09:08] <wikibugs>	 (03PS8) 10Jbond: sre.puppet.migrate-role: add new cookbook to migrate roles to puppet7 [cookbooks] - 10https://gerrit.wikimedia.org/r/967935 (https://phabricator.wikimedia.org/T340739)
[15:10:07] <logmsgbot>	 !log fabfur@cumin1001 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp1100.eqiad.wmnet with OS bullseye
[15:11:48] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.reimage for host cp1100.eqiad.wmnet with OS bullseye
[15:14:37] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:15:51] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:19:21] <wikibugs>	 (03PS9) 10Jbond: sre.puppet.migrate-role: add new cookbook to migrate roles to puppet7 [cookbooks] - 10https://gerrit.wikimedia.org/r/967935 (https://phabricator.wikimedia.org/T340739)
[15:22:21] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:22:31] <godog>	 !log clean up overlapping blocks from thanos for instance 'cloud'
[15:22:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:25:27] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:26:21] <icinga-wm>	 PROBLEM - Check systemd state on titan2001 is CRITICAL: CRITICAL - degraded: The following units failed: thanos-compact.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:26:33] <logmsgbot>	 !log fabfur@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cp1100.eqiad.wmnet with reason: host reimage
[15:26:41] <godog>	 the thanos-compact is me
[15:27:43] <icinga-wm>	 RECOVERY - Check systemd state on titan2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:32:12] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp1100.eqiad.wmnet with reason: host reimage
[15:45:27] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:46:01] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:47:05] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 2.672 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:47:25] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:47:41] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50713 bytes in 0.062 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[15:48:47] <logmsgbot>	 !log fabfur@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp1100.eqiad.wmnet with OS bullseye
[15:49:15] <wikibugs>	 (03CR) 10Bartosz Dziewoński: [C: 03+1] "https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231024T2000" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967995 (owner: 10Gergő Tisza)
[15:55:55] <icinga-wm>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[16:00:05] <jouncebot>	 jbond and rzl: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Puppet request window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231024T1600).
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:07:18] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[16:09:01] <wikibugs>	 (03PS1) 10MVernon: profile::tlsproxy::envoy fix docstring typo [puppet] - 10https://gerrit.wikimedia.org/r/968291
[16:12:38] <wikibugs>	 (03CR) 10RLazarus: [C: 03+1] profile::tlsproxy::envoy fix docstring typo [puppet] - 10https://gerrit.wikimedia.org/r/968291 (owner: 10MVernon)
[16:13:07] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] profile::tlsproxy::envoy fix docstring typo [puppet] - 10https://gerrit.wikimedia.org/r/968291 (owner: 10MVernon)
[16:19:11] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:19:49] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:20:25] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.279 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:21:01] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50713 bytes in 0.061 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:21:40] <wikibugs>	 (03PS1) 10Jbond: systemd::service: Add service owner parameter [puppet] - 10https://gerrit.wikimedia.org/r/968293 (https://phabricator.wikimedia.org/T349176)
[16:38:58] <wikibugs>	 10SRE-swift-storage, 10API Platform, 10Commons, 10MediaWiki-File-management, and 4 others: Commons: UploadChunkFileException: Error storing file: backend-fail-internal; local-swift-codfw - https://phabricator.wikimedia.org/T328872 (10aaron)
[16:39:19] <wikibugs>	 10SRE, 10Traffic: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur)
[16:44:04] <logmsgbot>	 !log xcollazo@deploy2002 Started deploy [airflow-dags/analytics@cc56357]: Deploying latest DAGs to analytics Airflow instance
[16:46:00] <logmsgbot>	 !log xcollazo@deploy2002 Finished deploy [airflow-dags/analytics@cc56357]: Deploying latest DAGs to analytics Airflow instance (duration: 01m 55s)
[16:46:04] <wikibugs>	 (03PS1) 10BCornwall: hiera: remove dns5003 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/968294 (https://phabricator.wikimedia.org/T342154)
[16:47:16] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] hiera: remove dns5003 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/968294 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall)
[16:53:35] <wikibugs>	 (03CR) 10Btullis: [WIP] Send metrics from Airflow analytics test (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/968285 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu)
[16:59:16] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host dns5003.wikimedia.org with OS bookworm
[16:59:28] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host dns5003.wikimedia.org with OS bookworm
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231024T1700)
[17:00:20] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[17:03:03] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:04:05] <icinga-wm>	 PROBLEM - BFD status on cr3-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[17:06:47] <icinga-wm>	 PROBLEM - Host 2001:df2:e500:1:103:102:166:10 is DOWN: PING CRITICAL - Packet loss = 100%
[17:06:54] <jinxer-wm>	 (KeyholderUnarmed) firing: 19 unarmed Keyholder key(s) on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[17:08:39] <icinga-wm>	 PROBLEM - Recursive DNS on 103.102.166.10 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[17:08:39] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:16:58] <wikibugs>	 (03PS6) 10Hnowlan: Upgrade container and dependencies for bullseye [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/920760 (https://phabricator.wikimedia.org/T336881)
[17:17:35] <wikibugs>	 (03PS1) 10Andrew Bogott: trove-guestagent: include service credentials [puppet] - 10https://gerrit.wikimedia.org/r/968299 (https://phabricator.wikimedia.org/T349651)
[17:17:38] <wikibugs>	 (03PS1) 10Andrew Bogott: codfw1dev keystone/swift: make endpoints public [puppet] - 10https://gerrit.wikimedia.org/r/968300 (https://phabricator.wikimedia.org/T349651)
[17:17:42] <wikibugs>	 (03PS1) 10Fabfur: hiera: added new cp hosts for eqiad [puppet] - 10https://gerrit.wikimedia.org/r/968301 (https://phabricator.wikimedia.org/T349244)
[17:23:39] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:26:35] <wikibugs>	 (03CR) 10Jforrester: "Apparently caused T349648." [dns] - 10https://gerrit.wikimedia.org/r/967898 (https://phabricator.wikimedia.org/T326266) (owner: 10Majavah)
[17:28:10] <wikibugs>	 (03CR) 10Ssingh: "Looks good overall for the missing bits, sorry for overlooking them in the last review. One comment we should fix in this CR in-line and t" [puppet] - 10https://gerrit.wikimedia.org/r/968301 (https://phabricator.wikimedia.org/T349244) (owner: 10Fabfur)
[17:29:52] <wikibugs>	 (03CR) 10Andrew Bogott: "Taavi, Francesco and I discussed this today. Since these creds wind up only in the Trove service project (which has limited access) and th" [puppet] - 10https://gerrit.wikimedia.org/r/968299 (https://phabricator.wikimedia.org/T349651) (owner: 10Andrew Bogott)
[17:32:56] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:36:05] <wikibugs>	 (03CR) 10Andrew Bogott: "For this patch and the previous one: https://puppet-compiler.wmflabs.org/output/968300/167/" [puppet] - 10https://gerrit.wikimedia.org/r/968300 (https://phabricator.wikimedia.org/T349651) (owner: 10Andrew Bogott)
[17:46:00] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:46:08] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns5003.wikimedia.org with reason: host reimage
[17:46:08] <wikibugs>	 (03PS1) 10Ottomata: eventgate-logging-external - upgrade to debian bookworm and nodejs 18 [deployment-charts] - 10https://gerrit.wikimedia.org/r/968304 (https://phabricator.wikimedia.org/T347477)
[17:47:57] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] eventgate-logging-external - upgrade to debian bookworm and nodejs 18 [deployment-charts] - 10https://gerrit.wikimedia.org/r/968304 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata)
[17:48:49] <wikibugs>	 (03Merged) 10jenkins-bot: eventgate-logging-external - upgrade to debian bookworm and nodejs 18 [deployment-charts] - 10https://gerrit.wikimedia.org/r/968304 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata)
[17:49:14] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns5003.wikimedia.org with reason: host reimage
[17:50:06] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply
[17:51:57] <icinga-wm>	 PROBLEM - Disk space on Hadoop worker on an-worker1128 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/c 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[17:53:39] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job pdnsrec in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:54:29] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[17:55:35] <icinga-wm>	 PROBLEM - Recursive DNS on 103.102.166.10 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[18:00:05] <jouncebot>	 dancy and brennen: Time to snap out of that daydream and deploy MediaWiki train - Utc-7 Version. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231024T1800).
[18:00:14] <dancy>	 o/
[18:00:41] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply
[18:02:33] <brennen>	 o/
[18:03:03] <icinga-wm>	 PROBLEM - Recursive DNS on 2001:df2:e500:1:103:102:166:10 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[18:03:35] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply
[18:04:48] * dancy reads https://phabricator.wikimedia.org/T349310
[18:05:51] <dancy>	 looks safe to proceed.
[18:06:37] <wikibugs>	 (03PS1) 10TrainBranchBot: group0 wikis to 1.42.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968326 (https://phabricator.wikimedia.org/T348355)
[18:06:39] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.42.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968326 (https://phabricator.wikimedia.org/T348355) (owner: 10TrainBranchBot)
[18:06:41] <wikibugs>	 (03PS1) 10Ottomata: eventgate-logging-external - use 127.0.0.1 instead of localhost [deployment-charts] - 10https://gerrit.wikimedia.org/r/968327 (https://phabricator.wikimedia.org/T347477)
[18:07:27] <wikibugs>	 (03Merged) 10jenkins-bot: group0 wikis to 1.42.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968326 (https://phabricator.wikimedia.org/T348355) (owner: 10TrainBranchBot)
[18:08:13] <icinga-wm>	 PROBLEM - Disk space on Hadoop worker on an-worker1146 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/f 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[18:08:45] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] eventgate-logging-external - use 127.0.0.1 instead of localhost [deployment-charts] - 10https://gerrit.wikimedia.org/r/968327 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata)
[18:09:31] <wikibugs>	 (03Merged) 10jenkins-bot: eventgate-logging-external - use 127.0.0.1 instead of localhost [deployment-charts] - 10https://gerrit.wikimedia.org/r/968327 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata)
[18:11:14] <wikibugs>	 (03CR) 10Majavah: [C: 04-1] "Everything is open to the cloud vps VM ranges by default, so I don't think this is needed?" [puppet] - 10https://gerrit.wikimedia.org/r/968300 (https://phabricator.wikimedia.org/T349651) (owner: 10Andrew Bogott)
[18:12:23] <wikibugs>	 (03PS1) 10Jdrewniak: Follow-up to 74b5834: Add language prefix to Readability survey [skins/Vector] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/968311 (https://phabricator.wikimedia.org/T349232)
[18:12:49] <wikibugs>	 (03PS1) 10Jdrewniak: Follow-up to 74b5834: Add language prefix to Readability survey [skins/Vector] (wmf/1.42.0-wmf.2) - 10https://gerrit.wikimedia.org/r/968312 (https://phabricator.wikimedia.org/T349232)
[18:13:33] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply
[18:13:40] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:13:48] <logmsgbot>	 !log dancy@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.42.0-wmf.2  refs T348355
[18:13:50] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply
[18:13:53] <stashbot>	 T348355: 1.42.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T348355
[18:14:01] <icinga-wm>	 RECOVERY - Recursive DNS on 103.102.166.10 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[18:14:30] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:14:31] <icinga-wm>	 RECOVERY - Recursive DNS on 2001:df2:e500:1:103:102:166:10 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[18:15:32] <logmsgbot>	 !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply
[18:16:00] <logmsgbot>	 !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply
[18:18:12] <logmsgbot>	 !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply
[18:18:56] <logmsgbot>	 !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply
[18:21:04] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:23:55] <logmsgbot>	 !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply
[18:24:43] <logmsgbot>	 !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply
[18:25:58] <wikibugs>	 (03PS1) 10Jdrewniak: Enable Vector readability survey on select wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968328 (https://phabricator.wikimedia.org/T349232)
[18:26:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[18:28:21] <icinga-wm>	 RECOVERY - Disk space on Hadoop worker on an-worker1146 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[18:29:53] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] Decommission restbase2012 [puppet] - 10https://gerrit.wikimedia.org/r/968006 (https://phabricator.wikimedia.org/T349526) (owner: 10Eevans)
[18:31:35] <wikibugs>	 (03PS1) 10Ottomata: eventgate-analytics - upgrade to debian bookworm and nodejs 18 [deployment-charts] - 10https://gerrit.wikimedia.org/r/968330 (https://phabricator.wikimedia.org/T347477)
[18:31:41] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.decommission for hosts restbase2012.codfw.wmnet
[18:34:09] <wikibugs>	 (03PS1) 10Majavah: openstack: encapi: don't try to hold a single connection open [puppet] - 10https://gerrit.wikimedia.org/r/968331 (https://phabricator.wikimedia.org/T349195)
[18:34:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] openstack: encapi: don't try to hold a single connection open [puppet] - 10https://gerrit.wikimedia.org/r/968331 (https://phabricator.wikimedia.org/T349195) (owner: 10Majavah)
[18:35:11] <icinga-wm>	 RECOVERY - Disk space on Hadoop worker on an-worker1128 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[18:35:12] <wikibugs>	 (03PS2) 10Majavah: openstack: encapi: don't try to hold a single connection open [puppet] - 10https://gerrit.wikimedia.org/r/968331 (https://phabricator.wikimedia.org/T349195)
[18:35:14] <wikibugs>	 (03CR) 10Ottomata: [C: 03+2] eventgate-analytics - upgrade to debian bookworm and nodejs 18 [deployment-charts] - 10https://gerrit.wikimedia.org/r/968330 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata)
[18:36:43] <wikibugs>	 (03Merged) 10jenkins-bot: eventgate-analytics - upgrade to debian bookworm and nodejs 18 [deployment-charts] - 10https://gerrit.wikimedia.org/r/968330 (https://phabricator.wikimedia.org/T347477) (owner: 10Ottomata)
[18:37:34] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.dns.netbox
[18:38:43] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply
[18:39:15] <logmsgbot>	 !log otto@deploy2002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply
[18:39:46] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: restbase2012.codfw.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1001"
[18:41:02] <logmsgbot>	 !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply
[18:41:06] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: restbase2012.codfw.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1001"
[18:41:06] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:41:06] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts restbase2012.codfw.wmnet
[18:41:18] <logmsgbot>	 !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply
[18:42:06] <logmsgbot>	 !log otto@deploy2002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply
[18:42:07] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload
[18:42:08] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)
[18:42:59] <logmsgbot>	 !log otto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply
[18:47:04] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload
[18:47:05] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)
[18:47:27] <logmsgbot>	 !log otto@deploy2002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply
[18:48:14] <logmsgbot>	 !log otto@deploy2002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply
[18:48:24] <wikibugs>	 (03PS1) 10Andrew Bogott: developer-portal: update version tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/968333 (https://phabricator.wikimedia.org/T349045)
[18:48:38] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload
[18:48:39] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)
[18:49:01] <wikibugs>	 (03CR) 10Alex Paskulin: [C: 03+1] developer-portal: update version tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/968333 (https://phabricator.wikimedia.org/T349045) (owner: 10Andrew Bogott)
[18:49:18] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] developer-portal: update version tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/968333 (https://phabricator.wikimedia.org/T349045) (owner: 10Andrew Bogott)
[18:50:02] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload
[18:50:03] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)
[18:50:07] <wikibugs>	 (03Merged) 10jenkins-bot: developer-portal: update version tag [deployment-charts] - 10https://gerrit.wikimedia.org/r/968333 (https://phabricator.wikimedia.org/T349045) (owner: 10Andrew Bogott)
[18:50:29] <icinga-wm>	 RECOVERY - BFD status on cr3-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[18:50:42] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload
[18:50:53] <wikibugs>	 (03CR) 10Ssingh: hiera: added new cp hosts for eqiad (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/968301 (https://phabricator.wikimedia.org/T349244) (owner: 10Fabfur)
[18:52:05] <wikibugs>	 10ops-codfw, 10Cassandra, 10decommission-hardware: decommission restbase2012.codfw.wmnet - https://phabricator.wikimedia.org/T349526 (10Eevans)
[18:53:06] <wikibugs>	 (03PS1) 10Ottomata: eventgate-analytics-external - upgrade to debian bookworm and nodejs 18 [deployment-charts] - 10https://gerrit.wikimedia.org/r/968334 (https://phabricator.wikimedia.org/T347477)
[18:53:18] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[18:54:39] <wikibugs>	 10ops-codfw, 10Cassandra, 10decommission-hardware: decommission restbase2012.codfw.wmnet - https://phabricator.wikimedia.org/T349526 (10Eevans)
[18:54:47] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns5003.wikimedia.org with OS bookworm
[18:54:54] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)
[18:54:56] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host dns5003.wikimedia.org with OS bookworm completed: - dns5003 (**PASS**)   - Downtimed on Icinga/Al...
[18:55:08] <logmsgbot>	 !log andrew@deploy2002 helmfile [staging] START helmfile.d/services/developer-portal: apply
[18:55:26] <logmsgbot>	 !log andrew@deploy2002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply
[18:59:05] <logmsgbot>	 !log andrew@deploy2002 helmfile [codfw] START helmfile.d/services/developer-portal: apply
[18:59:29] <logmsgbot>	 !log andrew@deploy2002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply
[19:00:02] <logmsgbot>	 !log andrew@deploy2002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply
[19:00:39] <logmsgbot>	 !log andrew@deploy2002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply
[19:03:47] <wikibugs>	 (03PS1) 10BCornwall: Revert "hiera: remove dns5003 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/968313
[19:04:15] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] Revert "hiera: remove dns5003 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/968313 (owner: 10BCornwall)
[19:05:22] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] Revert "hiera: remove dns5003 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/968313 (owner: 10BCornwall)
[19:08:22] <wikibugs>	 (03Abandoned) 10Andrew Bogott: codfw1dev keystone/swift: make endpoints public [puppet] - 10https://gerrit.wikimedia.org/r/968300 (https://phabricator.wikimedia.org/T349651) (owner: 10Andrew Bogott)
[19:10:52] <wikibugs>	 (03PS1) 10Eevans: Decommission restbase1016 [puppet] - 10https://gerrit.wikimedia.org/r/968339 (https://phabricator.wikimedia.org/T349526)
[19:10:54] <wikibugs>	 (03PS1) 10Eevans: Decommission restbase1017 [puppet] - 10https://gerrit.wikimedia.org/r/968340 (https://phabricator.wikimedia.org/T349526)
[19:10:56] <wikibugs>	 (03PS1) 10Eevans: Decommission restbase1018 [puppet] - 10https://gerrit.wikimedia.org/r/968341 (https://phabricator.wikimedia.org/T349526)
[19:11:42] <wikibugs>	 (03PS1) 10BCornwall: hiera: remove dns5004 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/968342 (https://phabricator.wikimedia.org/T342154)
[19:12:30] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] "as discussed earlier I think this is fine. users don't have direct access to the VMs and this is a trove-specific password" [puppet] - 10https://gerrit.wikimedia.org/r/968299 (https://phabricator.wikimedia.org/T349651) (owner: 10Andrew Bogott)
[19:13:13] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] hiera: remove dns5004 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/968342 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall)
[19:13:54] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/968341 (https://phabricator.wikimedia.org/T349526) (owner: 10Eevans)
[19:14:05] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] hiera: remove dns5004 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/968342 (https://phabricator.wikimedia.org/T342154) (owner: 10BCornwall)
[19:16:19] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:20:39] <wikibugs>	 (03PS1) 10Ottomata: Enable canary events for all MediaWiki event streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968344 (https://phabricator.wikimedia.org/T266798)
[19:23:28] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host dns5004.wikimedia.org with OS bookworm
[19:23:38] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host dns5004.wikimedia.org with OS bookworm
[19:27:27] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:28:33] <icinga-wm>	 PROBLEM - BFD status on cr3-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[19:31:45] <wikibugs>	 (03PS1) 10Bking: wdqs.data-reload: add logic for graph_split hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/968346 (https://phabricator.wikimedia.org/T349011)
[19:33:39] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:35:33] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wdqs.data-reload: add logic for graph_split hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/968346 (https://phabricator.wikimedia.org/T349011) (owner: 10Bking)
[19:36:25] <wikibugs>	 (03PS2) 10Bking: wdqs.data-reload: add logic for graph_split hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/968346 (https://phabricator.wikimedia.org/T349011)
[19:40:01] <wikibugs>	 (03PS2) 10Ottomata: Enable canary events for all MediaWiki event streams [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968344 (https://phabricator.wikimedia.org/T266798)
[19:40:25] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wdqs.data-reload: add logic for graph_split hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/968346 (https://phabricator.wikimedia.org/T349011) (owner: 10Bking)
[19:42:12] <wikibugs>	 (03PS3) 10Bking: wdqs.data-reload: add logic for graph_split hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/968346 (https://phabricator.wikimedia.org/T349011)
[19:43:03] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload
[19:43:04] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)
[19:44:59] <wikibugs>	 (03PS4) 10Bking: wdqs.data-reload: add logic for graph_split hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/968346 (https://phabricator.wikimedia.org/T349011)
[19:45:27] <wikibugs>	 (03PS2) 10C. Scott Ananian: Enable Parsoid interal REST API only on Parsoid cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965608 (https://phabricator.wikimedia.org/T334980)
[19:45:27] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload
[19:45:28] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)
[19:46:46] <wikibugs>	 (03PS5) 10Bking: wdqs.data-reload: add logic for graph_split hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/968346 (https://phabricator.wikimedia.org/T349011)
[19:47:24] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/968119 (https://phabricator.wikimedia.org/T321579) (owner: 10Filippo Giunchedi)
[19:47:27] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload
[19:47:28] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)
[19:48:37] <wikibugs>	 (03PS3) 10C. Scott Ananian: Disable Parsoid internal REST API everywhere except on Parsoid cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965608 (https://phabricator.wikimedia.org/T334980)
[19:48:51] <wikibugs>	 (03CR) 10Ottomata: [C: 04-1] "-1 until we coordinate with some folks, and send an announcement." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968344 (https://phabricator.wikimedia.org/T266798) (owner: 10Ottomata)
[19:49:08] <wikibugs>	 (03CR) 10C. Scott Ananian: Disable Parsoid internal REST API everywhere except on Parsoid cluster (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965608 (https://phabricator.wikimedia.org/T334980) (owner: 10C. Scott Ananian)
[19:49:20] <wikibugs>	 (03PS6) 10Bking: wdqs.data-reload: add logic for graph_split hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/968346 (https://phabricator.wikimedia.org/T349011)
[19:49:30] <logmsgbot>	 !log bking@cumin1001 START - Cookbook sre.wdqs.data-reload
[19:53:10] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wdqs.data-reload: add logic for graph_split hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/968346 (https://phabricator.wikimedia.org/T349011) (owner: 10Bking)
[19:53:40] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[19:56:36] <logmsgbot>	 !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@c585842]: T346373: Update mjolnir to use python 3.10
[19:56:43] <stashbot>	 T346373: Ensure mjolnir can work on Python 3.9 or later - https://phabricator.wikimedia.org/T346373
[19:57:05] <logmsgbot>	 !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@c585842]: T346373: Update mjolnir to use python 3.10 (duration: 00m 28s)
[19:57:48] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/968232 (https://phabricator.wikimedia.org/T321579) (owner: 10Filippo Giunchedi)
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231024T2000).
[20:00:05] <jouncebot>	 MatmaRex, jan_drewniak, and cscott: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[20:00:20] <wikibugs>	 (03CR) 10Andrea Denisse: prometheus: Fail Puppet execution for unlisted sites (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) (owner: 10Andrea Denisse)
[20:00:25] <jan_drewniak>	 O/
[20:00:30] <MatmaRex>	 hi. all of my changes today are no-ops
[20:00:59] <MatmaRex>	 (so, feel free to ship them all at once without testing)
[20:02:17] <MatmaRex>	 jan_drewniak: cscott: yesterday there was no deployer for this window, so if either of you are able to deploy, you might want to get started
[20:03:17] <RhinosF1>	 thcipriani: ^
[20:06:14] <jan_drewniak>	 MatmaRex: cscott: ok, I can do the deploys in that case
[20:07:18] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[20:07:41] <jan_drewniak>	 MatmaRex: I'm doing yours first
[20:08:39] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967208 (owner: 10Bartosz Dziewoński)
[20:08:41] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967394 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza)
[20:08:43] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967973 (https://phabricator.wikimedia.org/T344757) (owner: 10Bartosz Dziewoński)
[20:08:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967995 (owner: 10Gergő Tisza)
[20:09:50] <jan_drewniak>	 MatmaRex: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/967208 needs a rebase
[20:10:15] <wikibugs>	 (03PS3) 10Bartosz Dziewoński: Update comment about EditAttemptStep instruments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967208
[20:10:37] <MatmaRex>	 well, they all do, but they rebase cleanly, so you can just click the button in gerrit
[20:10:58] <logmsgbot>	 !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on dns5004.wikimedia.org with reason: host reimage
[20:11:03] <wikibugs>	 (03PS7) 10Bartosz Dziewoński: CentralAuth: Clarify why we don't use second-level domain for some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967394 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza)
[20:11:12] <wikibugs>	 (03PS3) 10Bartosz Dziewoński: Remove unused VisualEditor config settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967973 (https://phabricator.wikimedia.org/T344757)
[20:11:20] <wikibugs>	 (03PS2) 10Bartosz Dziewoński: [noop] Explain more thoroughly how the '-' prefix works [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967995 (owner: 10Gergő Tisza)
[20:11:30] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967208 (owner: 10Bartosz Dziewoński)
[20:11:31] <MatmaRex>	 jan_drewniak: they should be good to go now
[20:11:32] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967394 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza)
[20:11:34] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967973 (https://phabricator.wikimedia.org/T344757) (owner: 10Bartosz Dziewoński)
[20:11:36] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967995 (owner: 10Gergő Tisza)
[20:12:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [noop] Explain more thoroughly how the '-' prefix works [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967995 (owner: 10Gergő Tisza)
[20:12:22] <wikibugs>	 (03Merged) 10jenkins-bot: Update comment about EditAttemptStep instruments [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967208 (owner: 10Bartosz Dziewoński)
[20:12:24] <wikibugs>	 (03Merged) 10jenkins-bot: CentralAuth: Clarify why we don't use second-level domain for some wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967394 (https://phabricator.wikimedia.org/T257852) (owner: 10Gergő Tisza)
[20:13:07] <wikibugs>	 (03PS3) 10Jdrewniak: [noop] Explain more thoroughly how the '-' prefix works [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967995 (owner: 10Gergő Tisza)
[20:13:31] <wikibugs>	 (03Merged) 10jenkins-bot: Remove unused VisualEditor config settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967973 (https://phabricator.wikimedia.org/T344757) (owner: 10Bartosz Dziewoński)
[20:14:09] <jinxer-wm>	 (PHPFPMTooBusy) firing: (2) Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 47.22% idle - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[20:14:11] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dns5004.wikimedia.org with reason: host reimage
[20:14:20] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967973 (https://phabricator.wikimedia.org/T344757) (owner: 10Bartosz Dziewoński)
[20:14:22] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967995 (owner: 10Gergő Tisza)
[20:14:25] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967995 (owner: 10Gergő Tisza)
[20:16:09] <wikibugs>	 (03Merged) 10jenkins-bot: [noop] Explain more thoroughly how the '-' prefix works [mediawiki-config] - 10https://gerrit.wikimedia.org/r/967995 (owner: 10Gergő Tisza)
[20:16:35] <logmsgbot>	 !log jdrewniak@deploy2002 Started scap: Backport for [[gerrit:967208|Update comment about EditAttemptStep instruments]], [[gerrit:967394|CentralAuth: Clarify why we don't use second-level domain for some wikis (T257852)]], [[gerrit:967973|Remove unused VisualEditor config settings (T344757 T344759)]], [[gerrit:967995|[noop] Explain more thoroughly how the '-' prefix works]]
[20:16:54] <stashbot>	 T257852: CentralAuth edge login and autologin for some Wikimedia domains broken on mobile - https://phabricator.wikimedia.org/T257852
[20:16:54] <stashbot>	 T344759: Remove VisualEditorTransitionDefault config and AutodisableVisualEditorPref maint script - https://phabricator.wikimedia.org/T344759
[20:16:54] <stashbot>	 T344757: Remove the BetaFeatures integration in VisualEditor - https://phabricator.wikimedia.org/T344757
[20:17:58] <logmsgbot>	 !log jdrewniak@deploy2002 tgr and matmarex and jdrewniak: Backport for [[gerrit:967208|Update comment about EditAttemptStep instruments]], [[gerrit:967394|CentralAuth: Clarify why we don't use second-level domain for some wikis (T257852)]], [[gerrit:967973|Remove unused VisualEditor config settings (T344757 T344759)]], [[gerrit:967995|[noop] Explain more thoroughly how the '-' prefix works]] synced to the testservers (htt
[20:17:58] <logmsgbot>	 ps://wikitech.wikimedia.org/wiki/Mwdebug)
[20:18:07] <icinga-wm>	 PROBLEM - Recursive DNS on 103.102.166.8 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[20:18:40] <logmsgbot>	 !log jdrewniak@deploy2002 tgr and matmarex and jdrewniak: Continuing with sync
[20:19:09] <jinxer-wm>	 (PHPFPMTooBusy) resolved: (2) Not enough idle PHP-FPM workers for Mediawiki mw-web at codfw: 47.22% idle - https://bit.ly/wmf-fpmsat  - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[20:23:56] <logmsgbot>	 !log jdrewniak@deploy2002 Finished scap: Backport for [[gerrit:967208|Update comment about EditAttemptStep instruments]], [[gerrit:967394|CentralAuth: Clarify why we don't use second-level domain for some wikis (T257852)]], [[gerrit:967973|Remove unused VisualEditor config settings (T344757 T344759)]], [[gerrit:967995|[noop] Explain more thoroughly how the '-' prefix works]] (duration: 07m 21s)
[20:24:10] <stashbot>	 T257852: CentralAuth edge login and autologin for some Wikimedia domains broken on mobile - https://phabricator.wikimedia.org/T257852
[20:24:11] <stashbot>	 T344759: Remove VisualEditorTransitionDefault config and AutodisableVisualEditorPref maint script - https://phabricator.wikimedia.org/T344759
[20:24:11] <stashbot>	 T344757: Remove the BetaFeatures integration in VisualEditor - https://phabricator.wikimedia.org/T344757
[20:24:25] * jan_drewniak MatmaRex: done!
[20:24:29] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job pdnsrec in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:24:36] <MatmaRex>	 thanks jan_drewniak
[20:25:09] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968328 (https://phabricator.wikimedia.org/T349232) (owner: 10Jdrewniak)
[20:25:14] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [skins/Vector] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/968311 (https://phabricator.wikimedia.org/T349232) (owner: 10Jdrewniak)
[20:25:25] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [skins/Vector] (wmf/1.42.0-wmf.2) - 10https://gerrit.wikimedia.org/r/968312 (https://phabricator.wikimedia.org/T349232) (owner: 10Jdrewniak)
[20:25:59] <wikibugs>	 (03PS2) 10Jdrewniak: Enable Vector readability survey on select wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968328 (https://phabricator.wikimedia.org/T349232)
[20:26:51] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968328 (https://phabricator.wikimedia.org/T349232) (owner: 10Jdrewniak)
[20:26:57] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [skins/Vector] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/968311 (https://phabricator.wikimedia.org/T349232) (owner: 10Jdrewniak)
[20:27:05] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [skins/Vector] (wmf/1.42.0-wmf.2) - 10https://gerrit.wikimedia.org/r/968312 (https://phabricator.wikimedia.org/T349232) (owner: 10Jdrewniak)
[20:27:15] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Vector readability survey on select wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968328 (https://phabricator.wikimedia.org/T349232) (owner: 10Jdrewniak)
[20:28:02] <cscott>	 (i'm here, btw)
[20:28:40] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:30:07] <icinga-wm>	 PROBLEM - Recursive DNS on 2001:df2:e500:1:103:102:166:8 is CRITICAL: DNS_QUERY CRITICAL - query timed out https://wikitech.wikimedia.org/wiki/DNS
[20:38:40] <jinxer-wm>	 (JobUnavailable) firing: (2) Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:38:47] <wikibugs>	 (03Merged) 10jenkins-bot: Follow-up to 74b5834: Add language prefix to Readability survey [skins/Vector] (wmf/1.42.0-wmf.1) - 10https://gerrit.wikimedia.org/r/968311 (https://phabricator.wikimedia.org/T349232) (owner: 10Jdrewniak)
[20:38:49] <icinga-wm>	 RECOVERY - Recursive DNS on 2001:df2:e500:1:103:102:166:8 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[20:38:55] <icinga-wm>	 RECOVERY - Recursive DNS on 103.102.166.8 is OK: DNS_QUERY OK - Success https://wikitech.wikimedia.org/wiki/DNS
[20:39:22] <jan_drewniak>	 hey cscott: merge is going sloowly, 3min eta on my patches...
[20:39:30] <jinxer-wm>	 (JobUnavailable) resolved: (2) Reduced availability for job haproxy in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[20:42:22] <wikibugs>	 (03Merged) 10jenkins-bot: Follow-up to 74b5834: Add language prefix to Readability survey [skins/Vector] (wmf/1.42.0-wmf.2) - 10https://gerrit.wikimedia.org/r/968312 (https://phabricator.wikimedia.org/T349232) (owner: 10Jdrewniak)
[20:42:42] <JD|cloud>	 there seems to bea ton of lag
[20:42:46] <logmsgbot>	 !log jdrewniak@deploy2002 Started scap: Backport for [[gerrit:968328|Enable Vector readability survey on select wikis (T349232)]], [[gerrit:968311|Follow-up to 74b5834: Add language prefix to Readability survey (T349232)]], [[gerrit:968312|Follow-up to 74b5834: Add language prefix to Readability survey (T349232)]]
[20:42:51] <JD|cloud>	 are there any known things going on?
[20:42:56] <stashbot>	 T349232: Readability survey should link to language-specific feedback form - https://phabricator.wikimedia.org/T349232
[20:44:01] <cscott>	 jan_drewniak: no worries, i'm patient (and working on other things)
[20:44:08] <logmsgbot>	 !log jdrewniak@deploy2002 jdrewniak: Backport for [[gerrit:968328|Enable Vector readability survey on select wikis (T349232)]], [[gerrit:968311|Follow-up to 74b5834: Add language prefix to Readability survey (T349232)]], [[gerrit:968312|Follow-up to 74b5834: Add language prefix to Readability survey (T349232)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:44:31] <logmsgbot>	 !log jdrewniak@deploy2002 jdrewniak: Continuing with sync
[20:46:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:49:43] <logmsgbot>	 !log jdrewniak@deploy2002 Finished scap: Backport for [[gerrit:968328|Enable Vector readability survey on select wikis (T349232)]], [[gerrit:968311|Follow-up to 74b5834: Add language prefix to Readability survey (T349232)]], [[gerrit:968312|Follow-up to 74b5834: Add language prefix to Readability survey (T349232)]] (duration: 06m 57s)
[20:49:48] <stashbot>	 T349232: Readability survey should link to language-specific feedback form - https://phabricator.wikimedia.org/T349232
[20:50:12] <jan_drewniak>	 ok cscott: finally your turn
[20:50:17] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965608 (https://phabricator.wikimedia.org/T334980) (owner: 10C. Scott Ananian)
[20:51:02] <wikibugs>	 (03PS4) 10Jdrewniak: Disable Parsoid internal REST API everywhere except on Parsoid cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965608 (https://phabricator.wikimedia.org/T334980) (owner: 10C. Scott Ananian)
[20:51:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[20:51:14] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by jdrewniak@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965608 (https://phabricator.wikimedia.org/T334980) (owner: 10C. Scott Ananian)
[20:51:43] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Disable Parsoid internal REST API everywhere except on Parsoid cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965608 (https://phabricator.wikimedia.org/T334980) (owner: 10C. Scott Ananian)
[20:52:03] <wikibugs>	 (03Merged) 10jenkins-bot: Disable Parsoid internal REST API everywhere except on Parsoid cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965608 (https://phabricator.wikimedia.org/T334980) (owner: 10C. Scott Ananian)
[20:52:09] <wikibugs>	 (03CR) 10Jdrewniak: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/965608 (https://phabricator.wikimedia.org/T334980) (owner: 10C. Scott Ananian)
[20:53:01] <jan_drewniak>	 cscott: shoot looks like I'm getting a CI error...
[20:53:16] <jan_drewniak>	 oh never mind
[20:53:33] <logmsgbot>	 !log jdrewniak@deploy2002 Started scap: Backport for [[gerrit:965608|Disable Parsoid internal REST API everywhere except on Parsoid cluster (T334980)]]
[20:53:40] <stashbot>	 T334980: Run visual diff testing without RL and other hacks to compare Parsoid rendering against legacy parser rendering - https://phabricator.wikimedia.org/T334980
[20:54:54] <logmsgbot>	 !log jdrewniak@deploy2002 jdrewniak and cscott: Backport for [[gerrit:965608|Disable Parsoid internal REST API everywhere except on Parsoid cluster (T334980)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[20:55:09] <jan_drewniak>	 cscott: does anything need to be checked for this patch?
[20:55:54] <jan_drewniak>	 it's on mwdebug
[20:56:11] <cscott>	 i can quickly verify that the parsoid api isn't present on mwdebug, hang on
[21:00:10] <cscott>	 jan_drewniak: looks good, go ahead
[21:00:46] <logmsgbot>	 !log jdrewniak@deploy2002 jdrewniak and cscott: Continuing with sync
[21:03:47] <wikibugs>	 (03PS1) 10Ssingh: wikimedia.org: add verification for Jamf [dns] - 10https://gerrit.wikimedia.org/r/968354 (https://phabricator.wikimedia.org/T349665)
[21:05:03] <icinga-wm>	 PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:05:04] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[21:05:21] <wikibugs>	 (03PS1) 10Jdlrobson: [Visual change] Normalize small font sizes in Vector 2022 [skins/Vector] (wmf/1.42.0-wmf.2) - 10https://gerrit.wikimedia.org/r/968314 (https://phabricator.wikimedia.org/T346062)
[21:05:49] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:06:12] <logmsgbot>	 !log jdrewniak@deploy2002 Finished scap: Backport for [[gerrit:965608|Disable Parsoid internal REST API everywhere except on Parsoid cluster (T334980)]] (duration: 12m 39s)
[21:06:18] <stashbot>	 T334980: Run visual diff testing without RL and other hacks to compare Parsoid rendering against legacy parser rendering - https://phabricator.wikimedia.org/T334980
[21:06:41] <icinga-wm>	 RECOVERY - BFD status on cr3-eqsin is OK: UP: 12 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[21:06:54] <jinxer-wm>	 (KeyholderUnarmed) firing: 19 unarmed Keyholder key(s) on deploy1002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[21:07:26] <wikibugs>	 (03CR) 10Cwhite: "Change overall looks good.  Inline is an idea for your consideration." [puppet] - 10https://gerrit.wikimedia.org/r/966645 (https://phabricator.wikimedia.org/T332672) (owner: 10Hashar)
[21:07:31] <icinga-wm>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[21:07:45] <wikibugs>	 (03CR) 10BCornwall: [C: 03+1] "Ew" [dns] - 10https://gerrit.wikimedia.org/r/968354 (https://phabricator.wikimedia.org/T349665) (owner: 10Ssingh)
[21:08:28] <logmsgbot>	 !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dns5004.wikimedia.org with OS bookworm
[21:08:37] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host dns5004.wikimedia.org with OS bookworm completed: - dns5004 (**PASS**)   - Downtimed on Icinga/Al...
[21:09:08] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] wikimedia.org: add verification for Jamf [dns] - 10https://gerrit.wikimedia.org/r/968354 (https://phabricator.wikimedia.org/T349665) (owner: 10Ssingh)
[21:09:30] <sukhe>	 !log running authdns-update for CR 968354
[21:09:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:11:25] <wikibugs>	 10SRE, 10DNS, 10Traffic: Update DNS for Jamf account SSO - https://phabricator.wikimedia.org/T349665 (10ssingh) 05Open→03Resolved a:03ssingh To reduce the chances of error and for future requests, please copy-paste the requested record in the task (so that it is text) in addition to the screenshot for...
[21:11:39] <wikibugs>	 (03PS1) 10BCornwall: Revert "hiera: remove dns5004 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/968315
[21:13:10] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] Revert "hiera: remove dns5004 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/968315 (owner: 10BCornwall)
[21:14:16] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] Revert "hiera: remove dns5004 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/968315 (owner: 10BCornwall)
[21:16:35] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bookworm - https://phabricator.wikimedia.org/T342154 (10BCornwall)
[21:37:56] <wikibugs>	 (03PS2) 10Fabfur: hiera: added new cp hosts in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/968301 (https://phabricator.wikimedia.org/T349244)
[21:38:44] <wikibugs>	 (03CR) 10Fabfur: hiera: added new cp hosts in eqiad (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/968301 (https://phabricator.wikimedia.org/T349244) (owner: 10Fabfur)
[21:45:14] <wikibugs>	 (03PS1) 10Ebernhardson: search updater: Update container image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/968358
[21:48:04] <wikibugs>	 (03CR) 10Cwhite: "Is there a task to go along with this for discussion?" [puppet] - 10https://gerrit.wikimedia.org/r/968250 (owner: 10Jbond)
[21:48:32] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] alertmanager: ship amtool.yml for AM api access [puppet] - 10https://gerrit.wikimedia.org/r/968232 (https://phabricator.wikimedia.org/T321579) (owner: 10Filippo Giunchedi)
[21:48:46] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] alertmanager: let karma use apache to access AM [puppet] - 10https://gerrit.wikimedia.org/r/968119 (https://phabricator.wikimedia.org/T321579) (owner: 10Filippo Giunchedi)
[21:56:01] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+2] search updater: Update container image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/968358 (owner: 10Ebernhardson)
[21:56:51] <wikibugs>	 (03Merged) 10jenkins-bot: search updater: Update container image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/968358 (owner: 10Ebernhardson)
[21:58:35] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[21:58:44] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[22:00:46] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Q1:Install cp11[00-15] and rotate into production - https://phabricator.wikimedia.org/T349244 (10Fabfur)
[22:02:07] <icinga-wm>	 RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[22:06:06] <wikibugs>	 (03PS1) 10EoghanGaffney: [systemd/timer] Add optional SuccessExitStatus argument to timer services [puppet] - 10https://gerrit.wikimedia.org/r/968360 (https://phabricator.wikimedia.org/T349166)
[22:06:08] <wikibugs>	 (03PS1) 10EoghanGaffney: [quickdatacopy] Add success_exit_status option to rsync::quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/968361 (https://phabricator.wikimedia.org/T349166)
[22:09:51] <icinga-wm>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[22:15:24] <wikibugs>	 (03PS3) 10Jforrester: [wikifunctions] Alter site to General Availability [mediawiki-config] - 10https://gerrit.wikimedia.org/r/966570 (https://phabricator.wikimedia.org/T349054)
[22:15:26] <wikibugs>	 (03PS1) 10Jforrester: [wikifunctions] Allow logged-out users to run approved functions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/968362 (https://phabricator.wikimedia.org/T349055)
[22:20:16] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] trove-guestagent: include service credentials [puppet] - 10https://gerrit.wikimedia.org/r/968299 (https://phabricator.wikimedia.org/T349651) (owner: 10Andrew Bogott)
[22:23:33] <wikibugs>	 (03PS1) 10Andrew Bogott: Trove: allow backups in policy.yaml [puppet] - 10https://gerrit.wikimedia.org/r/968363 (https://phabricator.wikimedia.org/T349651)
[22:24:19] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Trove: allow backups in policy.yaml [puppet] - 10https://gerrit.wikimedia.org/r/968363 (https://phabricator.wikimedia.org/T349651) (owner: 10Andrew Bogott)
[22:50:23] <wikibugs>	 (03PS13) 10Andrea Denisse: prometheus: Fail Puppet execution for unlisted sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448)
[22:51:56] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] prometheus: Fail Puppet execution for unlisted sites [puppet] - 10https://gerrit.wikimedia.org/r/965561 (https://phabricator.wikimedia.org/T336448) (owner: 10Andrea Denisse)
[22:53:18] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for puppetserver1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[22:54:27] <icinga-wm>	 PROBLEM - Disk space on Hadoop worker on analytics1075 is CRITICAL: DISK CRITICAL - free space: /var/lib/hadoop/data/l 16 GB (0% inode=99%): https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration
[23:11:55] <icinga-wm>	 PROBLEM - Kafka MirrorMaker main-codfw_to_main-eqiad max lag in last 10 minutes on alert1001 is CRITICAL: 1.001e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=eqiad+prometheus/ops&var-lag_datasource=codfw+prometheus/ops&var-mirror_name=main-codfw_to_main-eqiad
[23:40:21] <icinga-wm>	 RECOVERY - Disk space on Hadoop worker on analytics1075 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration