[00:00:02] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1037498 (owner: 10TrainBranchBot) [00:01:48] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [00:01:55] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:03:57] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:04:00] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:04:05] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [00:04:10] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:05:44] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:05:47] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:06:27] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [00:06:33] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:09:49] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [00:09:55] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:11:51] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [00:11:56] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:13:43] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [00:13:49] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:16:56] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [00:17:02] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:18:43] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:18:58] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [00:19:04] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:20:30] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:20:34] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:21:01] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [00:21:07] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:22:28] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:22:31] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:23:52] (03PS1) 10Scott French: eventstreams: adopt base.external-services-networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037870 (https://phabricator.wikimedia.org/T359423) [00:24:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 851.8ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:24:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2204 (T352010)', diff saved to https://phabricator.wikimedia.org/P63804 and previous config saved to /var/cache/conftool/dbconfig/20240601-002435-ladsgroup.json [00:24:40] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [00:25:04] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [00:25:10] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:27:47] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [00:27:53] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:29:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 824.9ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:30:00] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [00:30:06] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:30:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 896.7ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:34:30] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid (k8s) 901.4ms - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [00:38:23] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [00:38:28] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:39:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2204', diff saved to https://phabricator.wikimedia.org/P63805 and previous config saved to /var/cache/conftool/dbconfig/20240601-003943-ladsgroup.json [00:45:26] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [00:45:33] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:45:45] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:45:48] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:47:39] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [00:47:41] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:47:45] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:47:45] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:49:42] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [00:49:49] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:50:08] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:50:12] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:51:55] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [00:52:00] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:52:26] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:52:29] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:53:56] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [00:54:03] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:54:13] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:54:16] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:54:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2204', diff saved to https://phabricator.wikimedia.org/P63806 and previous config saved to /var/cache/conftool/dbconfig/20240601-005451-ladsgroup.json [00:55:59] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [00:56:05] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:56:49] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [00:56:52] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [00:58:02] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [00:58:08] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:00:05] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [01:00:11] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:02:07] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [01:02:13] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:04:10] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [01:04:15] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:06:12] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [01:06:19] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:08:16] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [01:08:21] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:10:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2204 (T352010)', diff saved to https://phabricator.wikimedia.org/P63807 and previous config saved to /var/cache/conftool/dbconfig/20240601-010959-ladsgroup.json [01:10:03] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [01:10:08] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [01:10:15] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:12:11] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [01:12:17] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:14:13] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [01:14:19] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:16:16] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [01:16:22] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:18:19] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [01:18:24] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:19:19] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [01:19:22] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:20:21] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [01:20:27] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:21:36] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [01:21:40] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:22:24] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [01:22:30] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:23:43] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [01:23:46] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:24:27] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [01:24:33] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:25:07] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#9851982 (10Ganesha811) Thank you all for putting this on the list for this year and making it a WMF priority! We appreciate your hard work at en-wiki! [01:26:40] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [01:26:45] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:27:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228 (T364299)', diff saved to https://phabricator.wikimedia.org/P63808 and previous config saved to /var/cache/conftool/dbconfig/20240601-012708-marostegui.json [01:27:12] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [01:28:09] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [01:28:13] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:28:32] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [01:28:39] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:30:07] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [01:30:10] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:30:36] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [01:30:42] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:32:38] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [01:32:43] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:36:20] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [01:36:27] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:40:04] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [01:40:10] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:41:34] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [01:41:37] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:42:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228', diff saved to https://phabricator.wikimedia.org/P63809 and previous config saved to /var/cache/conftool/dbconfig/20240601-014216-marostegui.json [01:43:06] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [01:43:11] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:45:08] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [01:45:15] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:47:01] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [01:47:07] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:49:04] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [01:49:09] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:51:06] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [01:51:13] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:52:59] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [01:53:04] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:55:10] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [01:55:16] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:55:44] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:57:13] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [01:57:19] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [01:57:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228', diff saved to https://phabricator.wikimedia.org/P63810 and previous config saved to /var/cache/conftool/dbconfig/20240601-015725-marostegui.json [01:59:16] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [01:59:22] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:01:08] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [02:01:14] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:03:11] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [02:03:18] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:11:01] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [02:11:04] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:12:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1228 (T364299)', diff saved to https://phabricator.wikimedia.org/P63811 and previous config saved to /var/cache/conftool/dbconfig/20240601-021233-marostegui.json [02:12:36] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1232.eqiad.wmnet with reason: Maintenance [02:12:36] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [02:12:49] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1232.eqiad.wmnet with reason: Maintenance [02:12:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1232 (T364299)', diff saved to https://phabricator.wikimedia.org/P63812 and previous config saved to /var/cache/conftool/dbconfig/20240601-021256-marostegui.json [02:13:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ... [02:13:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [02:14:14] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [02:14:20] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:16:37] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [02:16:43] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:18:39] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [02:18:44] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:20:44] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:21:21] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [02:21:27] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:23:24] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [02:23:30] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:25:27] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [02:25:34] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:27:30] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [02:27:36] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:28:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ... [02:28:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [02:29:33] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [02:29:39] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:31:35] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [02:31:40] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:33:37] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [02:33:43] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:33:43] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:35:40] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [02:35:47] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:37:43] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [02:37:50] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:38:43] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:43:47] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [02:43:52] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:47:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ... [02:47:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [02:48:09] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [02:48:16] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:50:13] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [02:50:19] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:52:16] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [02:52:22] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:52:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ... [02:52:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [02:53:43] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:57:08] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [02:57:11] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [02:58:43] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:59:36] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [02:59:40] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:02:29] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [03:02:34] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:03:43] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:04:32] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [03:04:34] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [03:04:37] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:04:38] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:06:35] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [03:06:41] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:08:37] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [03:08:43] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:10:40] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [03:10:45] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:12:52] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [03:12:59] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:17:36] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [03:17:42] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:19:39] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [03:19:44] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:23:41] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [03:23:47] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:24:41] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [03:24:44] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:25:44] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [03:25:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ... [03:25:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [03:25:50] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:27:08] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [03:27:11] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:27:46] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [03:27:53] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:29:05] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [03:29:08] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:29:49] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [03:29:55] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:30:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ... [03:30:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [03:31:51] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [03:31:58] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:33:55] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [03:34:00] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:35:57] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [03:36:03] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:36:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [03:39:59] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [03:40:04] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:41:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [03:42:01] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [03:42:08] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:44:04] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [03:44:11] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:46:07] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [03:46:14] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:47:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): ... [03:47:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [03:48:10] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [03:48:15] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:53:32] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [03:53:38] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:55:36] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [03:55:41] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:57:27] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [03:57:33] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [03:59:30] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [03:59:35] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:02:45] FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [04:03:02] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [04:03:08] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:12:45] FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [04:16:25] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [04:16:31] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:18:28] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [04:18:33] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [04:22:45] RESOLVED: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [04:31:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ... [04:31:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [04:36:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ... [04:36:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [04:39:22] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [04:39:25] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:01:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [05:03:01] PROBLEM - Host dbstore1009 is DOWN: PING CRITICAL - Packet loss = 100% [05:06:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [05:07:15] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [05:12:00] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [05:13:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): ... [05:13:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [05:14:10] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [05:14:16] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:17:19] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [05:17:23] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:22:21] RECOVERY - Host dbstore1009 is UP: PING OK - Packet loss = 0%, RTA = 0.19 ms [05:29:17] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [05:29:20] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:31:13] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [05:31:17] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:33:00] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [05:33:03] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:38:43] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:39:57] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [05:40:00] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:42:44] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [05:42:47] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:44:40] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [05:44:43] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [05:48:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): ... [05:48:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [05:50:57] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [05:51:01] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:07:23] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [06:07:27] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:31:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T364299)', diff saved to https://phabricator.wikimedia.org/P63813 and previous config saved to /var/cache/conftool/dbconfig/20240601-063135-marostegui.json [06:31:38] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [06:46:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P63814 and previous config saved to /var/cache/conftool/dbconfig/20240601-064643-marostegui.json [06:57:00] (03PS1) 10Alexandros Kosiaris: preseed: Remove kafka-main1010 exception [puppet] - 10https://gerrit.wikimedia.org/r/1037879 (https://phabricator.wikimedia.org/T363212) [06:57:24] (03CR) 10Alexandros Kosiaris: [V:03+2 C:03+2] preseed: Remove kafka-main1010 exception [puppet] - 10https://gerrit.wikimedia.org/r/1037879 (https://phabricator.wikimedia.org/T363212) (owner: 10Alexandros Kosiaris) [06:57:41] PROBLEM - SSH on dbstore1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [06:58:00] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [06:58:04] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [06:59:43] !log akosiaris@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main1010.eqiad.wmnet with OS bullseye [06:59:57] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops, 13Patch-For-Review: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9852119 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by akosiaris@cumin1002 for host kafka-main1010.eqiad.wmn... [07:01:50] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance [07:01:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232', diff saved to https://phabricator.wikimedia.org/P63815 and previous config saved to /var/cache/conftool/dbconfig/20240601-070151-marostegui.json [07:02:03] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2206.codfw.wmnet with reason: Maintenance [07:02:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2206 (T364069)', diff saved to https://phabricator.wikimedia.org/P63816 and previous config saved to /var/cache/conftool/dbconfig/20240601-070211-marostegui.json [07:02:14] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [07:12:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ... [07:12:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [07:17:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1232 (T364299)', diff saved to https://phabricator.wikimedia.org/P63817 and previous config saved to /var/cache/conftool/dbconfig/20240601-071700-marostegui.json [07:17:03] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1234.eqiad.wmnet with reason: Maintenance [07:17:04] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [07:17:10] !log akosiaris@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-main1010.eqiad.wmnet with reason: host reimage [07:17:16] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1234.eqiad.wmnet with reason: Maintenance [07:17:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1234 (T364299)', diff saved to https://phabricator.wikimedia.org/P63818 and previous config saved to /var/cache/conftool/dbconfig/20240601-071723-marostegui.json [07:17:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ... [07:17:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [07:17:57] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:18:01] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:19:54] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:19:58] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:20:32] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-main1010.eqiad.wmnet with reason: host reimage [07:21:51] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:21:55] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:23:58] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:24:02] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:29:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [07:34:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [07:36:06] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:36:09] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:36:44] !log akosiaris@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - akosiaris@cumin1002" [07:36:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ... [07:36:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [07:38:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [07:41:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ... [07:41:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [07:43:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [07:46:33] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:46:36] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:47:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [07:48:19] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:48:23] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:50:16] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:50:20] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:52:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [07:52:53] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:52:57] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:56:13] PROBLEM - MariaDB Replica Lag: x1 on dbstore1009 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1025.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:56:13] PROBLEM - MariaDB Replica Lag: s8 on dbstore1009 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 3412.26 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [07:56:20] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [07:56:23] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [07:56:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [07:56:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [08:01:45] FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [08:06:45] FIRING: [3x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [08:07:42] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9852135 (10Aklapper) 05Stalled→03Open Subtask resolved thus reopening [08:16:45] FIRING: [3x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [08:21:45] FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [08:26:43] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [08:26:49] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:29:48] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [08:29:51] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:31:45] FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [08:32:26] 06SRE, 10Wikimedia-Mailing-lists: Create an 'Universal Code of Conduct Coordinating Committee (U4C)' mailing list - https://phabricator.wikimedia.org/T366401 (10Superpes15) 03NEW [08:34:45] 06SRE, 10Wikimedia-Mailing-lists: Create an 'Universal Code of Conduct Coordinating Committee (U4C)' mailing list - https://phabricator.wikimedia.org/T366401#9852169 (10Superpes15) [08:39:45] FIRING: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search-backfill is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [08:41:45] FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [08:51:04] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [08:51:08] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:52:51] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [08:52:55] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:54:58] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [08:55:02] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:56:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): ... [08:56:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [08:56:45] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [08:56:49] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [08:58:32] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [08:58:35] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:01:29] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:01:32] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:03:26] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:03:30] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:05:24] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:05:27] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:11:30] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:11:33] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:15:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): ... [09:15:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [09:16:56] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:17:00] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:20:17] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:20:21] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:33:54] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:33:57] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:40:44] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:40:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): ... [09:40:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [09:45:21] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:45:24] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:55:06] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance [09:55:20] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance [09:55:22] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [09:55:37] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [09:55:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1156 (T352010)', diff saved to https://phabricator.wikimedia.org/P63819 and previous config saved to /var/cache/conftool/dbconfig/20240601-095545-ladsgroup.json [09:55:48] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [10:02:35] 06SRE, 10Wikimedia-Mailing-lists: Create an 'Universal Code of Conduct Coordinating Committee (U4C)' mailing list - https://phabricator.wikimedia.org/T366401#9852198 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup https://lists.wikimedia.org/postorius/lists/u4c.lists.wikimedia.org/ [10:05:18] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [10:05:21] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:07:45] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [10:07:48] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:09:42] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [10:09:46] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:11:50] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [10:11:53] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:15:06] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [10:15:09] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:16:53] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [10:16:56] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:19:00] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [10:19:03] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:19:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): ... [10:19:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [10:21:57] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [10:22:00] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:23:53] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [10:23:56] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [10:32:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [10:37:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [10:42:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [10:47:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [11:00:50] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [11:00:53] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:04:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_search_eqiad in eqiad (k8s): ... [11:04:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [11:04:50] RESOLVED: CirrusStreamingUpdaterRateTooLow: CirrusSearch update rate from flink-app-consumer-search-backfill is critically low - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/jKqki4MSk/cirrus-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterRateTooLow [11:07:57] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [11:08:00] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:08:25] !log @deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [11:08:31] !log @deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:09:54] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [11:09:57] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:23:31] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [11:23:34] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:25:44] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:28:27] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [11:28:31] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:30:14] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [11:30:18] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:34:11] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [11:34:15] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:36:28] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [11:36:31] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:38:25] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [11:38:28] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:46:52] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [11:46:55] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:48:49] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [11:48:51] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [11:50:45] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [11:50:48] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:00:22] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [12:00:25] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:02:08] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [12:02:12] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:04:26] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [12:04:29] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:06:22] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [12:06:25] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:06:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T364299)', diff saved to https://phabricator.wikimedia.org/P63820 and previous config saved to /var/cache/conftool/dbconfig/20240601-120628-marostegui.json [12:06:31] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [12:08:39] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [12:08:42] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:10:56] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [12:10:59] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:12:53] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [12:12:57] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:18:10] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [12:18:14] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:19:57] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [12:20:01] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:21:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P63821 and previous config saved to /var/cache/conftool/dbconfig/20240601-122136-marostegui.json [12:25:54] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [12:25:58] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:27:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [12:27:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [12:27:51] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [12:27:55] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:29:59] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [12:30:02] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:32:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [12:32:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [12:36:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234', diff saved to https://phabricator.wikimedia.org/P63822 and previous config saved to /var/cache/conftool/dbconfig/20240601-123644-marostegui.json [12:43:30] FIRING: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:44:36] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [12:44:39] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:48:30] RESOLVED: [2x] ProbeDown: Service wdqs1015:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1015:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:50:23] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [12:50:26] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:51:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1234 (T364299)', diff saved to https://phabricator.wikimedia.org/P63823 and previous config saved to /var/cache/conftool/dbconfig/20240601-125152-marostegui.json [12:51:55] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1235.eqiad.wmnet with reason: Maintenance [12:51:55] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [12:52:08] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1235.eqiad.wmnet with reason: Maintenance [12:52:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1235 (T364299)', diff saved to https://phabricator.wikimedia.org/P63824 and previous config saved to /var/cache/conftool/dbconfig/20240601-125216-marostegui.json [12:52:19] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [12:52:23] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [12:55:02] (03PS1) 10GergesShamon: [arwiki] add ipblock-exempt to bot group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037887 [12:58:01] (03PS2) 10GergesShamon: [arwiki] add ipblock-exempt to bot group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037887 (https://phabricator.wikimedia.org/T366404) [13:12:13] PROBLEM - MariaDB Replica Lag: s6 on dbstore1009 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 635.23 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:14:13] (03PS1) 10David Caro: toolforge,prometheus: renew certificate [puppet] - 10https://gerrit.wikimedia.org/r/1037888 (https://phabricator.wikimedia.org/T309782) [13:14:40] (03CR) 10David Caro: [C:03+2] toolforge,prometheus: renew certificate [puppet] - 10https://gerrit.wikimedia.org/r/1037888 (https://phabricator.wikimedia.org/T309782) (owner: 10David Caro) [13:16:07] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [13:16:10] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:39:29] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - akosiaris@cumin1002" [13:39:35] !log akosiaris@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-main1010.eqiad.wmnet with OS bullseye [13:39:42] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9852270 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by akosiaris@cumin1002 for host kafka-main1010.eqiad.wmnet with OS bullseye comple... [13:40:44] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:42:43] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9852271 (10akosiaris) 05Open→03Resolved kafka-main1010, after 2 rounds of imaging (1 with the normal recipe and 1 with the reuse recipe) imaged succes... [14:18:03] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [14:18:06] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:20:30] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [14:20:34] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:22:17] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [14:22:21] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:24:34] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [14:24:37] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:38:43] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:44:47] PROBLEM - Host an-worker1168 is DOWN: PING CRITICAL - Packet loss = 100% [14:49:11] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [14:49:14] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:50:17] RECOVERY - Host an-worker1168 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [14:55:44] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:01:18] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:01:22] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:03:26] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:03:28] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:09:22] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:09:26] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:11:20] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:11:22] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:13:16] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:13:19] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:16:23] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:16:26] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:18:19] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:18:22] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:20:16] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:20:20] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:22:13] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:22:17] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:24:20] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:24:24] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:25:44] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:26:17] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:26:21] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:28:14] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:28:18] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:30:51] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:30:54] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [15:53:05] 10SRE-swift-storage, 10Thumbor: Outdated thumbnails for djvu file on Commons cannot be purged and do not update - https://phabricator.wikimedia.org/T206190#9852311 (10Soda) @Ankry I see the code snippet is still there in plwiki's common.js, is it still required ? [15:53:18] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [15:53:22] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:02:31] RECOVERY - SSH on dbstore1009 is OK: SSH OK - OpenSSH_9.2p1 Debian-2+deb12u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:03:15] PROBLEM - MariaDB Replica Lag: s8 on dbstore1009 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 31681.86 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:05:13] RECOVERY - MariaDB Replica Lag: x1 on dbstore1009 is OK: OK slave_sql_lag Replication lag: 0.27 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:07:15] RECOVERY - MariaDB Replica Lag: s6 on dbstore1009 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:09:56] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:09:59] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:12:13] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:12:16] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:14:00] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:14:03] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:15:57] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:15:59] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:16:15] RECOVERY - MariaDB Replica Lag: s8 on dbstore1009 is OK: OK slave_sql_lag Replication lag: 0.15 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:17:53] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:17:56] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:39:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T364069)', diff saved to https://phabricator.wikimedia.org/P63825 and previous config saved to /var/cache/conftool/dbconfig/20240601-163907-marostegui.json [16:39:11] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [16:54:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P63826 and previous config saved to /var/cache/conftool/dbconfig/20240601-165416-marostegui.json [16:56:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T364299)', diff saved to https://phabricator.wikimedia.org/P63827 and previous config saved to /var/cache/conftool/dbconfig/20240601-165609-marostegui.json [16:56:12] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [17:09:20] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [17:09:23] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:09:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206', diff saved to https://phabricator.wikimedia.org/P63828 and previous config saved to /var/cache/conftool/dbconfig/20240601-170924-marostegui.json [17:11:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P63829 and previous config saved to /var/cache/conftool/dbconfig/20240601-171116-marostegui.json [17:13:27] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [17:13:29] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:23:43] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:24:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2206 (T364069)', diff saved to https://phabricator.wikimedia.org/P63830 and previous config saved to /var/cache/conftool/dbconfig/20240601-172432-marostegui.json [17:24:35] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance [17:24:35] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [17:24:48] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2210.codfw.wmnet with reason: Maintenance [17:24:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2210 (T364069)', diff saved to https://phabricator.wikimedia.org/P63831 and previous config saved to /var/cache/conftool/dbconfig/20240601-172455-marostegui.json [17:26:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235', diff saved to https://phabricator.wikimedia.org/P63832 and previous config saved to /var/cache/conftool/dbconfig/20240601-172625-marostegui.json [17:40:44] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:41:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1235 (T364299)', diff saved to https://phabricator.wikimedia.org/P63833 and previous config saved to /var/cache/conftool/dbconfig/20240601-174133-marostegui.json [17:41:35] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1239.eqiad.wmnet with reason: Maintenance [17:41:36] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [17:41:48] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1239.eqiad.wmnet with reason: Maintenance [17:41:51] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1240.eqiad.wmnet with reason: Maintenance [17:42:04] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1240.eqiad.wmnet with reason: Maintenance [17:42:07] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [17:42:20] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [17:42:23] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2102.codfw.wmnet with reason: Maintenance [17:42:36] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2102.codfw.wmnet with reason: Maintenance [18:24:10] (03PS1) 10GergesShamon: [trwiki] Create translator group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037896 [18:24:45] (03CR) 10CI reject: [V:04-1] [trwiki] Create translator group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037896 (owner: 10GergesShamon) [18:27:55] (03PS2) 10GergesShamon: [trwiki] Create translator group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037896 [18:29:52] (03PS3) 10GergesShamon: [trwiki] Create translator group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037896 [18:35:44] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:36:37] (03PS4) 10GergesShamon: [trwiki] Create translator group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037896 (https://phabricator.wikimedia.org/T356440) [18:52:19] (03PS1) 10GergesShamon: [trwiki] Allow translator group to publish translation only in Extension:ContentTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037897 (https://phabricator.wikimedia.org/T356440) [18:57:50] (03PS2) 10GergesShamon: [trwiki] Allow translator group to publish translation only in Extension:ContentTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1037897 (https://phabricator.wikimedia.org/T356440) [19:01:03] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:01:06] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:25:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T352010)', diff saved to https://phabricator.wikimedia.org/P63834 and previous config saved to /var/cache/conftool/dbconfig/20240601-192505-ladsgroup.json [19:25:08] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [19:27:00] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:27:04] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:33:43] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:36:39] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:36:42] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:38:36] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:38:39] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:40:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P63835 and previous config saved to /var/cache/conftool/dbconfig/20240601-194013-ladsgroup.json [19:40:33] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:40:37] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:42:20] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:42:23] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:55:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P63836 and previous config saved to /var/cache/conftool/dbconfig/20240601-195521-ladsgroup.json [19:56:25] 06SRE, 10SRE-swift-storage, 06Traffic, 10MediaWiki-Platform-Team (Radar), 07Performance Issue: Automatically clean up unused thumbnails in Swift - https://phabricator.wikimedia.org/T211661#9852446 (10WhatamIdoing) This feels like it's gotten stalled. Is there any little thing that we can do to move this... [20:10:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T352010)', diff saved to https://phabricator.wikimedia.org/P63837 and previous config saved to /var/cache/conftool/dbconfig/20240601-201029-ladsgroup.json [20:10:32] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance [20:10:33] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [20:10:45] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1162.eqiad.wmnet with reason: Maintenance [20:10:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1162 (T352010)', diff saved to https://phabricator.wikimedia.org/P63838 and previous config saved to /var/cache/conftool/dbconfig/20240601-201053-ladsgroup.json [20:38:43] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:53:54] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:53:58] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:55:51] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:55:54] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:57:48] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:57:51] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [20:59:45] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [20:59:48] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:01:41] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:01:45] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:03:49] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:03:52] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:05:46] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:05:49] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:07:43] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:07:46] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:09:30] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:09:33] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:11:36] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:11:40] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:13:33] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:13:36] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:18:00] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:18:04] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:33:43] RESOLVED: SystemdUnitFailed: generate_vrts_aliases.service on mx1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:39:38] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:39:41] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:40:44] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:40:56] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 10:00:00 on db2102.codfw.wmnet with reason: Long schema change [21:40:58] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 10:00:00 on db2102.codfw.wmnet with reason: Long schema change [21:43:05] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:43:08] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:47:02] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:47:05] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:48:59] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:49:02] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:50:56] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:51:00] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:55:12] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2116.codfw.wmnet with reason: Maintenance [21:55:26] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2116.codfw.wmnet with reason: Maintenance [21:55:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2116 (T364299)', diff saved to https://phabricator.wikimedia.org/P63839 and previous config saved to /var/cache/conftool/dbconfig/20240601-215534-marostegui.json [21:55:38] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [21:58:34] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:58:37] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:01:00] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:01:04] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:07:27] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:07:30] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:09:24] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:09:28] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:11:22] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:11:25] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:14:18] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:14:22] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:18:35] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:18:39] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:21:43] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:21:46] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:30:26] (03PS1) 10Pppery: Rescue libphutil translations [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1037902 [22:31:50] (03PS2) 10Pppery: Rescue libphutil translations [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1037902 (https://phabricator.wikimedia.org/T366377) [22:32:30] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:32:33] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:34:26] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:34:29] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:02:43] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:02:46] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:10:23] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:10:26] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:12:19] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:12:23] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:14:07] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:14:10] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:16:43] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:16:47] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:23:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [23:23:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [23:28:45] FIRING: [2x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [23:37:00] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:37:04] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:38:10] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1037501 [23:38:11] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1037501 (owner: 10TrainBranchBot) [23:39:47] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:39:51] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:43:45] FIRING: [3x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [23:51:04] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:51:08] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:53:01] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:53:05] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:54:59] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:55:02] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [23:56:46] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [23:56:49] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply