[00:00:07] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1039592 (owner: 10TrainBranchBot) [00:01:30] RECOVERY - Check unit status of push_cross_cluster_settings_9400 on elastic2100 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:03:10] RESOLVED: [10x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2081:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:03:28] RECOVERY - Check unit status of push_cross_cluster_settings_9600 on elastic2083 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:06:14] RECOVERY - Check unit status of push_cross_cluster_settings_9200 on elastic2081 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:09:52] the change from https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1038749 isn't working as hoped upon further testing. I'm going to roll it back [00:10:30] PROBLEM - Check unit status of push_cross_cluster_settings_9600 on elastic2108 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:10:35] (03PS1) 10TrainBranchBot: Revert "wikitech: Replace OSM class in Gerrit blocking hook" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039866 [00:10:35] (03CR) 10TrainBranchBot: "bd808@deploy1002 created a revert of this change as I22313f028a82175f2a3aa1f2432aca27a71ac1df" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038749 (https://phabricator.wikimedia.org/T161553) (owner: 10Majavah) [00:10:55] FIRING: [18x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2081:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:11:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by bd808@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039866 (owner: 10TrainBranchBot) [00:12:04] (03Merged) 10jenkins-bot: Revert "wikitech: Replace OSM class in Gerrit blocking hook" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039866 (owner: 10TrainBranchBot) [00:12:32] !log bd808@deploy1002 Started scap: Backport for [[gerrit:1039866|Revert "wikitech: Replace OSM class in Gerrit blocking hook"]] [00:13:10] FIRING: [18x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2081:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:13:19] (03CR) 10BryanDavis: "Testing via `mwscript shell --wiki=labswiki` before revert showed:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039866 (owner: 10TrainBranchBot) [00:14:31] (03CR) 10BryanDavis: "Similarly, block and unblock actions against gerrit accounts were having no effect" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039866 (owner: 10TrainBranchBot) [00:14:55] !log bd808@deploy1002 bd808 and trainbranchbot: Backport for [[gerrit:1039866|Revert "wikitech: Replace OSM class in Gerrit blocking hook"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [00:15:02] !log bd808@deploy1002 bd808 and trainbranchbot: Continuing with sync [00:15:45] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:16:04] FIRING: [2x] PuppetDisabled: Puppet disabled on mc1049:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=memcached&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [00:16:10] PROBLEM - Check unit status of push_cross_cluster_settings_9400 on elastic2106 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:18:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:20:30] RECOVERY - Check unit status of push_cross_cluster_settings_9600 on elastic2108 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:20:55] RESOLVED: [9x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2086:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:23:10] FIRING: [12x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2084:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:23:57] !log bd808@deploy1002 Finished scap: Backport for [[gerrit:1039866|Revert "wikitech: Replace OSM class in Gerrit blocking hook"]] (duration: 11m 24s) [00:25:55] FIRING: [12x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2084:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:25:56] and the gerrit block hook is working again. excellent [00:26:10] RECOVERY - Check unit status of push_cross_cluster_settings_9400 on elastic2106 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:29:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T352010)', diff saved to https://phabricator.wikimedia.org/P64215 and previous config saved to /var/cache/conftool/dbconfig/20240607-002915-ladsgroup.json [00:29:19] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [00:33:10] RESOLVED: [5x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2084:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:36:40] (03CR) 10BryanDavis: [C:04-1] "The Depends-On change was reverted" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038750 (https://phabricator.wikimedia.org/T161553) (owner: 10Majavah) [00:38:12] PROBLEM - Check unit status of push_cross_cluster_settings_9200 on elastic2061 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:38:16] PROBLEM - Check unit status of push_cross_cluster_settings_9400 on elastic2073 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:39:10] FIRING: [14x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2061:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:40:55] FIRING: [12x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2061:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:44:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P64216 and previous config saved to /var/cache/conftool/dbconfig/20240607-004423-ladsgroup.json [00:48:12] RECOVERY - Check unit status of push_cross_cluster_settings_9200 on elastic2061 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:48:16] RECOVERY - Check unit status of push_cross_cluster_settings_9400 on elastic2073 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:49:10] RESOLVED: [10x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2061:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:50:55] FIRING: [16x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2061:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:53:20] (03PS1) 10Andrea Denisse: traffic: Route logstash.w.o to logstash.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1039887 (https://phabricator.wikimedia.org/T356386) [00:53:30] PROBLEM - Check unit status of push_cross_cluster_settings_9600 on elastic2075 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:54:10] FIRING: [15x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2061:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:55:07] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_codfw: codfw cluster restart - ryankemper@cumin2002 - T366555 [00:55:44] (03PS2) 10Andrea Denisse: conftool: Integrate logstash with active-passive configuration [puppet] - 10https://gerrit.wikimedia.org/r/1039406 (https://phabricator.wikimedia.org/T356386) [00:55:44] (03CR) 10Andrea Denisse: "Hi Filippo, thanks for taking a look. I've sent a new patchset implementing your change along with a couple of patches more." [puppet] - 10https://gerrit.wikimedia.org/r/1039406 (https://phabricator.wikimedia.org/T356386) (owner: 10Andrea Denisse) [00:55:55] !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot - ryankemper@cumin2002 - T366555 [00:56:06] (03PS1) 10Andrea Denisse: discovery: Add metafo entry for logstash [dns] - 10https://gerrit.wikimedia.org/r/1039882 (https://phabricator.wikimedia.org/T356386) [00:59:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130', diff saved to https://phabricator.wikimedia.org/P64217 and previous config saved to /var/cache/conftool/dbconfig/20240607-005930-ladsgroup.json [01:00:55] RESOLVED: [8x] SystemdUnitFailed: push_cross_cluster_settings_9200.service on elastic2061:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:03:30] RECOVERY - Check unit status of push_cross_cluster_settings_9600 on elastic2075 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:07:30] FIRING: ProbeDown: Service wdqs1021:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1021:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:07:55] FIRING: [12x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic1058:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:12:30] RESOLVED: [2x] ProbeDown: Service wdqs1021:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1021:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [01:12:55] RESOLVED: [10x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic1058:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:14:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2130 (T352010)', diff saved to https://phabricator.wikimedia.org/P64218 and previous config saved to /var/cache/conftool/dbconfig/20240607-011438-ladsgroup.json [01:14:41] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance [01:14:42] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [01:14:54] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2141.codfw.wmnet with reason: Maintenance [01:15:45] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:18:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:22:55] FIRING: [11x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic1058:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:25:10] RESOLVED: [11x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic1058:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:28:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T352010)', diff saved to https://phabricator.wikimedia.org/P64219 and previous config saved to /var/cache/conftool/dbconfig/20240607-012855-ladsgroup.json [01:28:59] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [01:34:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [01:42:10] FIRING: [11x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic1059:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:44:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P64220 and previous config saved to /var/cache/conftool/dbconfig/20240607-014403-ladsgroup.json [01:44:55] RESOLVED: [6x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic1089:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:45:44] PROBLEM - Host ps1-22-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [01:45:44] PROBLEM - Host mr1-ulsfo.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:45:45] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:45:46] PROBLEM - Host asw2-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [01:45:48] PROBLEM - Host ps1-23-ulsfo is DOWN: PING CRITICAL - Packet loss = 100% [01:46:28] PROBLEM - OSPF status on cr3-ulsfo is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:46:42] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:48:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:49:52] PROBLEM - Host mr1-ulsfo IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [01:50:45] FIRING: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:52:28] RECOVERY - Host ps1-23-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 75.27 ms [01:52:28] RECOVERY - Host ps1-22-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 73.16 ms [01:52:30] RECOVERY - OSPF status on cr3-ulsfo is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:52:42] RECOVERY - Host asw2-ulsfo is UP: PING OK - Packet loss = 0%, RTA = 71.82 ms [01:52:42] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [01:54:54] RECOVERY - Host mr1-ulsfo IPv6 is UP: PING OK - Packet loss = 0%, RTA = 71.56 ms [01:55:45] RESOLVED: JobUnavailable: Reduced availability for job pdu_sentry4 in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:55:58] RECOVERY - Host mr1-ulsfo.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 75.06 ms [01:56:19] 06SRE, 10Wikimedia-Mailing-lists: Create a mailing list for Bangla Wikimoitree - https://phabricator.wikimedia.org/T365915#9869716 (10Bodhisattwa) >>! In T365915#9868908, @Ladsgroup wrote: > I talked to some people at community resources and they said this should be first approved by affcom. Sorry. Let me cont... [01:56:55] FIRING: [12x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic1089:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:59:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186', diff saved to https://phabricator.wikimedia.org/P64221 and previous config saved to /var/cache/conftool/dbconfig/20240607-015910-ladsgroup.json [02:01:55] RESOLVED: [12x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic1089:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:11:55] FIRING: [12x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic1090:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:14:10] FIRING: [13x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic1090:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:14:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1186 (T352010)', diff saved to https://phabricator.wikimedia.org/P64222 and previous config saved to /var/cache/conftool/dbconfig/20240607-021418-ladsgroup.json [02:14:22] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1196.eqiad.wmnet with reason: Maintenance [02:14:23] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [02:14:35] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1196.eqiad.wmnet with reason: Maintenance [02:14:37] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [02:14:53] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [02:15:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1196 (T352010)', diff saved to https://phabricator.wikimedia.org/P64223 and previous config saved to /var/cache/conftool/dbconfig/20240607-021501-ladsgroup.json [02:15:45] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:18:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:29:10] FIRING: [13x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic1096:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:31:55] FIRING: [7x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic1096:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:38:44] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:41:55] FIRING: [13x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:42:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T364069)', diff saved to https://phabricator.wikimedia.org/P64224 and previous config saved to /var/cache/conftool/dbconfig/20240607-024221-marostegui.json [02:42:25] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [02:44:10] FIRING: [13x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic1069:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:45:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T352010)', diff saved to https://phabricator.wikimedia.org/P64225 and previous config saved to /var/cache/conftool/dbconfig/20240607-024554-ladsgroup.json [02:46:00] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [02:55:45] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:56:55] FIRING: [13x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic1053:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:57:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P64226 and previous config saved to /var/cache/conftool/dbconfig/20240607-025729-marostegui.json [03:01:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P64227 and previous config saved to /var/cache/conftool/dbconfig/20240607-030102-ladsgroup.json [03:09:10] FIRING: [13x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic1053:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:11:55] FIRING: [14x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic1053:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:12:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P64228 and previous config saved to /var/cache/conftool/dbconfig/20240607-031238-marostegui.json [03:15:45] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:16:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P64229 and previous config saved to /var/cache/conftool/dbconfig/20240607-031610-ladsgroup.json [03:18:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:27:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T364069)', diff saved to https://phabricator.wikimedia.org/P64230 and previous config saved to /var/cache/conftool/dbconfig/20240607-032746-marostegui.json [03:27:49] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1248.eqiad.wmnet with reason: Maintenance [03:27:52] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [03:28:02] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1248.eqiad.wmnet with reason: Maintenance [03:28:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1248 (T364069)', diff saved to https://phabricator.wikimedia.org/P64231 and previous config saved to /var/cache/conftool/dbconfig/20240607-032809-marostegui.json [03:31:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T352010)', diff saved to https://phabricator.wikimedia.org/P64232 and previous config saved to /var/cache/conftool/dbconfig/20240607-033118-ladsgroup.json [03:31:20] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance [03:31:22] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [03:31:33] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1209.eqiad.wmnet with reason: Maintenance [03:31:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1209 (T352010)', diff saved to https://phabricator.wikimedia.org/P64233 and previous config saved to /var/cache/conftool/dbconfig/20240607-033141-ladsgroup.json [03:31:55] FIRING: [8x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic1055:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:34:10] FIRING: [8x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic1055:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:35:07] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot - ryankemper@cumin2002 - T366555 [03:45:45] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:46:11] 10ops-ulsfo, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T366863 (10phaultfinder) 03NEW [03:47:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T352010)', diff saved to https://phabricator.wikimedia.org/P64234 and previous config saved to /var/cache/conftool/dbconfig/20240607-034755-ladsgroup.json [03:47:59] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [03:48:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:00:58] !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot - ryankemper@cumin2002 - T366555 [04:02:43] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot - ryankemper@cumin2002 - T366555 [04:03:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P64235 and previous config saved to /var/cache/conftool/dbconfig/20240607-040302-ladsgroup.json [04:03:44] FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:15:45] FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:16:04] FIRING: [2x] PuppetDisabled: Puppet disabled on mc1049:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=memcached&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [04:18:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P64236 and previous config saved to /var/cache/conftool/dbconfig/20240607-041812-ladsgroup.json [04:18:44] FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:23:14] !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot - ryankemper@cumin2002 - T366555 [04:33:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T352010)', diff saved to https://phabricator.wikimedia.org/P64237 and previous config saved to /var/cache/conftool/dbconfig/20240607-043320-ladsgroup.json [04:33:23] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance [04:33:24] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [04:33:36] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance [04:33:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2163 (T352010)', diff saved to https://phabricator.wikimedia.org/P64238 and previous config saved to /var/cache/conftool/dbconfig/20240607-043343-ladsgroup.json [04:35:09] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot - ryankemper@cumin2002 - T366555 [04:44:24] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [04:44:37] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1150.eqiad.wmnet with reason: Maintenance [04:53:45] 10ops-eqdfw, 06DC-Ops: cr2-eqdfw: PEM 0 Input Voltage Out Of Range - https://phabricator.wikimedia.org/T366864 (10ayounsi) 03NEW p:05Triage→03High [04:56:01] 10ops-eqdfw, 06DC-Ops: cr2-eqdfw: PEM 0 Input Voltage Out Of Range - https://phabricator.wikimedia.org/T366864#9869836 (10Papaul) @ayounsi i think this is the same as what to had on https://phabricator.wikimedia.org/T294009 [05:00:45] FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:10:07] (03PS1) 10Marostegui: db1152: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1039941 [05:15:35] !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot - ryankemper@cumin2002 - T366555 [05:15:45] RESOLVED: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:18:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:19:44] (03CR) 10Marostegui: [C:03+2] db1152: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1039941 (owner: 10Marostegui) [05:29:10] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9600.service on elastic1086:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:31:55] FIRING: [4x] SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9600.service on elastic1086:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:34:28] (03PS1) 10Fabfur: depool text@drmrs before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1039944 [05:34:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [05:36:55] FIRING: [8x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic1061:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:37:37] (03PS2) 10Fabfur: depool text@drmrs before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1039944 (https://phabricator.wikimedia.org/T366466) [05:40:47] (03PS1) 10Fabfur: hiera: enable IPIP for high-traffic1@drmrs for text services [puppet] - 10https://gerrit.wikimedia.org/r/1039947 (https://phabricator.wikimedia.org/T366466) [05:43:54] (03PS1) 10Fabfur: cache:hiera: enable IPIP on text@drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1039948 (https://phabricator.wikimedia.org/T366466) [05:44:10] FIRING: [14x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic1061:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:44:13] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1039947 (https://phabricator.wikimedia.org/T366466) (owner: 10Fabfur) [05:45:45] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:46:55] FIRING: [14x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic1061:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:46:59] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1039948 (https://phabricator.wikimedia.org/T366466) (owner: 10Fabfur) [05:48:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:56:55] FIRING: [14x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic1060:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:59:10] FIRING: [14x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic1060:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240607T0600) [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:10:00] PROBLEM - Check unit status of push_cross_cluster_settings_9200 on elastic1081 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:11:55] FIRING: [17x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic1057:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:15:45] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:16:52] PROBLEM - Check unit status of push_cross_cluster_settings_9400 on elastic1057 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:18:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:19:10] FIRING: [11x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic1057:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:20:00] RECOVERY - Check unit status of push_cross_cluster_settings_9200 on elastic1081 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:22:05] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2121 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1039593 (https://phabricator.wikimedia.org/T366875) [06:24:10] FIRING: [20x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic1057:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:25:43] (03PS1) 10Marostegui: s7-pager.sql: Remove [software] - 10https://gerrit.wikimedia.org/r/1039960 [06:26:04] PROBLEM - Check unit status of push_cross_cluster_settings_9400 on elastic1093 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:26:18] (03CR) 10Marostegui: [C:03+2] s7-pager.sql: Remove [software] - 10https://gerrit.wikimedia.org/r/1039960 (owner: 10Marostegui) [06:26:44] (03Merged) 10jenkins-bot: s7-pager.sql: Remove [software] - 10https://gerrit.wikimedia.org/r/1039960 (owner: 10Marostegui) [06:26:52] RECOVERY - Check unit status of push_cross_cluster_settings_9400 on elastic1057 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:26:55] FIRING: [20x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic1057:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:31:42] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot - ryankemper@cumin2002 - T366555 [06:33:31] (03PS69) 10Arnaudb: mariadb: add some logic to allow instance conversion [software/spicerack] - 10https://gerrit.wikimedia.org/r/1005531 (https://phabricator.wikimedia.org/T343674) [06:34:10] FIRING: [13x] SystemdUnitFailed: push_cross_cluster_settings_9400.service on elastic1057:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:34:39] !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot - ryankemper@cumin2002 - T366555 [06:36:04] RECOVERY - Check unit status of push_cross_cluster_settings_9400 on elastic1093 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:36:51] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot - ryankemper@cumin2002 - T366555 [06:38:29] (03CR) 10JMeybohm: [C:03+2] "Yeah...me too 😄" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039788 (https://phabricator.wikimedia.org/T350846) (owner: 10JMeybohm) [06:40:29] (03PS5) 10JMeybohm: flink-app: Update various modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039727 (https://phabricator.wikimedia.org/T362978) [06:49:10] 06SRE, 10Wikimedia-Mailing-lists: Create a mailing list for Bangla Wikimoitree - https://phabricator.wikimedia.org/T365915#9870050 (10Tarunno) Bangla Wikimoitree is a cross border initiative between Wikimedia Bangladesh Chapter and West Bengal Wikimedians User Group, India. Both are approved affiliates. Bangla... [06:50:30] (03Merged) 10jenkins-bot: Fix fixture generation for upstream splits [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039788 (https://phabricator.wikimedia.org/T350846) (owner: 10JMeybohm) [06:51:39] !log reimaging bast1003 to bookworm [06:51:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:30] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host bast1003.wikimedia.org with OS bookworm [06:53:02] (03CR) 10JMeybohm: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1039741 (https://phabricator.wikimedia.org/T366801) (owner: 10Ilias Sarantopoulos) [06:54:49] (03CR) 10DCausse: [C:03+1] flink-app: Update various modules (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039727 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [06:55:35] (03CR) 10Giuseppe Lavagetto: [C:03+1] "Indeed. the difference is that port 2380 is more important, but communication is protected by using client certificates. So maybe it's ind" [puppet] - 10https://gerrit.wikimedia.org/r/1038283 (https://phabricator.wikimedia.org/T366465) (owner: 10JMeybohm) [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240607T0700) [07:07:32] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on bast1003.wikimedia.org with reason: host reimage [07:09:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on bast1003.wikimedia.org with reason: host reimage [07:12:23] !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot - ryankemper@cumin2002 - T366555 [07:12:24] (03PS1) 10Jcrespo: dbbackups: Prepare db2098 for decommission [puppet] - 10https://gerrit.wikimedia.org/r/1039973 (https://phabricator.wikimedia.org/T360751) [07:14:05] (03CR) 10Marostegui: [C:03+1] "I am sad to see this host go, classic one" [puppet] - 10https://gerrit.wikimedia.org/r/1039973 (https://phabricator.wikimedia.org/T360751) (owner: 10Jcrespo) [07:16:09] (03PS1) 10Filippo Giunchedi: webperf: don't hardcode php version [puppet] - 10https://gerrit.wikimedia.org/r/1039974 (https://phabricator.wikimedia.org/T353912) [07:16:34] (03PS2) 10Jcrespo: dbbackups: Prepare db2098 for decommission [puppet] - 10https://gerrit.wikimedia.org/r/1039973 (https://phabricator.wikimedia.org/T360751) [07:17:27] (03PS4) 10Aklapper: Rescue libphutil translations (languages below old export threshold) [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1039341 (https://phabricator.wikimedia.org/T366377) (owner: 10Pppery) [07:17:56] (03CR) 10Filippo Giunchedi: "I used webperf to test statsd-exporter upgrade in https://phabricator.wikimedia.org/T302373 and it was easy enough to make the role work o" [puppet] - 10https://gerrit.wikimedia.org/r/1039974 (https://phabricator.wikimedia.org/T353912) (owner: 10Filippo Giunchedi) [07:19:32] !log jynus@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2098.codfw.wmnet with reason: about to decommission [07:19:45] !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2098.codfw.wmnet with reason: about to decommission [07:19:53] (03CR) 10Jcrespo: [C:03+2] dbbackups: Prepare db2098 for decommission [puppet] - 10https://gerrit.wikimedia.org/r/1039973 (https://phabricator.wikimedia.org/T360751) (owner: 10Jcrespo) [07:26:15] PROBLEM - Check unit status of push_cross_cluster_settings_9200 on elastic1074 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:26:38] (03PS1) 10Ayounsi: Disable Accounting report alerting until bug is fixed [puppet] - 10https://gerrit.wikimedia.org/r/1039978 (https://phabricator.wikimedia.org/T366874) [07:28:01] (03PS2) 10Ayounsi: Disable Accounting report alerting until bug is fixed [puppet] - 10https://gerrit.wikimedia.org/r/1039978 (https://phabricator.wikimedia.org/T366874) [07:29:43] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1039978 (https://phabricator.wikimedia.org/T366874) (owner: 10Ayounsi) [07:30:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host bast1003.wikimedia.org with OS bookworm [07:30:49] (03PS3) 10Ayounsi: Disable Accounting report alerting until bug is fixed [puppet] - 10https://gerrit.wikimedia.org/r/1039978 (https://phabricator.wikimedia.org/T366874) [07:30:57] (03PS1) 10Jcrespo: dbbackups: Prepare for db2097 decommission [puppet] - 10https://gerrit.wikimedia.org/r/1039979 (https://phabricator.wikimedia.org/T362802) [07:31:39] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1039978 (https://phabricator.wikimedia.org/T366874) (owner: 10Ayounsi) [07:31:55] FIRING: [11x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic1074:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:32:13] (03PS3) 10JMeybohm: etcd::v3: Allow all nodes of an etcd cluster to connect to each other [puppet] - 10https://gerrit.wikimedia.org/r/1038283 (https://phabricator.wikimedia.org/T366465) [07:34:10] FIRING: [11x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic1074:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:34:14] (03CR) 10Muehlenhoff: etcd::v3: Allow all nodes of an etcd cluster to connect to each other (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1038283 (https://phabricator.wikimedia.org/T366465) (owner: 10JMeybohm) [07:34:25] (03CR) 10JMeybohm: "Indeed. I double checked and we're creating peer certificates unconditionally for etcd v3. That should be sufficiently secure so that we d" [puppet] - 10https://gerrit.wikimedia.org/r/1038283 (https://phabricator.wikimedia.org/T366465) (owner: 10JMeybohm) [07:34:25] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2792/co" [puppet] - 10https://gerrit.wikimedia.org/r/1038283 (https://phabricator.wikimedia.org/T366465) (owner: 10JMeybohm) [07:36:15] RECOVERY - Check unit status of push_cross_cluster_settings_9200 on elastic1074 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:36:50] (03PS4) 10JMeybohm: etcd::v3: Allow all nodes of an etcd cluster to connect to each other [puppet] - 10https://gerrit.wikimedia.org/r/1038283 (https://phabricator.wikimedia.org/T366465) [07:37:09] (03CR) 10JMeybohm: etcd::v3: Allow all nodes of an etcd cluster to connect to each other (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1038283 (https://phabricator.wikimedia.org/T366465) (owner: 10JMeybohm) [07:37:15] (03PS1) 10Ilias Sarantopoulos: ml-services: decrease maxReplicas in viwiki-reverted [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039980 (https://phabricator.wikimedia.org/T349274) [07:38:57] (03PS2) 10Ilias Sarantopoulos: ml-services: decrease maxReplicas in viwiki-reverted [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039980 (https://phabricator.wikimedia.org/T349274) [07:39:04] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2793/co" [puppet] - 10https://gerrit.wikimedia.org/r/1038283 (https://phabricator.wikimedia.org/T366465) (owner: 10JMeybohm) [07:41:20] (03CR) 10Ayounsi: [V:04-1] Disable Accounting report alerting until bug is fixed [puppet] - 10https://gerrit.wikimedia.org/r/1039978 (https://phabricator.wikimedia.org/T366874) (owner: 10Ayounsi) [07:41:55] FIRING: [20x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic1054:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:44:10] FIRING: [20x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic1054:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:44:25] (03PS1) 10Jcrespo: dbbackups: Prepare db2099 for decommission [puppet] - 10https://gerrit.wikimedia.org/r/1039999 (https://phabricator.wikimedia.org/T362883) [07:44:50] !log jynus@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2099.codfw.wmnet with reason: about to decommission [07:45:03] !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2099.codfw.wmnet with reason: about to decommission [07:45:17] !log jynus@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2097.codfw.wmnet with reason: about to decommission [07:45:20] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1038283 (https://phabricator.wikimedia.org/T366465) (owner: 10JMeybohm) [07:45:31] !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2097.codfw.wmnet with reason: about to decommission [07:45:39] (03CR) 10Jcrespo: [C:03+2] dbbackups: Prepare for db2097 decommission [puppet] - 10https://gerrit.wikimedia.org/r/1039979 (https://phabricator.wikimedia.org/T362802) (owner: 10Jcrespo) [07:45:45] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:46:03] (03CR) 10Jcrespo: [C:03+2] dbbackups: Prepare db2099 for decommission [puppet] - 10https://gerrit.wikimedia.org/r/1039999 (https://phabricator.wikimedia.org/T362883) (owner: 10Jcrespo) [07:46:25] PROBLEM - Check unit status of push_cross_cluster_settings_9400 on elastic1068 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:46:33] (03PS4) 10Ayounsi: Disable Accounting report alerting until bug is fixed [puppet] - 10https://gerrit.wikimedia.org/r/1039978 (https://phabricator.wikimedia.org/T366874) [07:47:17] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1039978 (https://phabricator.wikimedia.org/T366874) (owner: 10Ayounsi) [07:48:05] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host seaborgium.wikimedia.org [07:48:23] (03PS1) 10JMeybohm: Add ratelimit user to the wikikube/main clusters [puppet] - 10https://gerrit.wikimedia.org/r/1040060 (https://phabricator.wikimedia.org/T362310) [07:48:44] (03CR) 10Aklapper: [C:03+2] Rescue libphutil translations (languages below old export threshold) [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1039341 (https://phabricator.wikimedia.org/T366377) (owner: 10Pppery) [07:48:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:48:59] (03CR) 10Aklapper: [V:03+2 C:03+2] "Thanks! Applies cleanly locally thus +2" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1039341 (https://phabricator.wikimedia.org/T366377) (owner: 10Pppery) [07:51:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host seaborgium.wikimedia.org [07:51:55] FIRING: [11x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic1054:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:52:25] (03CR) 10Kevin Bazira: [C:03+1] ml-services: enable multiprocessing for eswiki-damaging and viwiki-reverted [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039776 (https://phabricator.wikimedia.org/T349274) (owner: 10Ilias Sarantopoulos) [07:56:21] PROBLEM - CirrusSearch full_text eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38 [07:56:25] RECOVERY - Check unit status of push_cross_cluster_settings_9400 on elastic1068 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:56:46] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot - ryankemper@cumin2002 - T366555 [07:56:55] FIRING: [21x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic1054:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:57:11] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host pki-root1002.eqiad.wmnet [07:57:20] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1190.eqiad.wmnet with reason: Maintenance [07:57:27] PROBLEM - Check unit status of push_cross_cluster_settings_9200 on elastic1100 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [07:57:33] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1190.eqiad.wmnet with reason: Maintenance [07:57:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1190 (T364299)', diff saved to https://phabricator.wikimedia.org/P64239 and previous config saved to /var/cache/conftool/dbconfig/20240607-075742-marostegui.json [07:57:46] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [07:59:10] FIRING: [21x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic1054:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:01:21] RECOVERY - CirrusSearch full_text eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=38 [08:03:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host pki-root1002.eqiad.wmnet [08:03:41] (03Abandoned) 10Aklapper: $wmgThrottlingExceptions for idwiki and enwiki 2024-04-25 to 2024-08-25 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031176 (https://phabricator.wikimedia.org/T363291) (owner: 10Wargo) [08:04:49] PROBLEM - Check unit status of push_cross_cluster_settings_9600 on elastic1102 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:04:55] PROBLEM - Check unit status of push_cross_cluster_settings_9400 on elastic1098 is CRITICAL: CRITICAL: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:06:55] FIRING: [13x] SystemdUnitFailed: push_cross_cluster_settings_9400.service on elastic1068:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:07:27] RECOVERY - Check unit status of push_cross_cluster_settings_9200 on elastic1100 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9200 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:08:17] (03PS5) 10Ayounsi: Disable Accounting report alerting until bug is fixed [puppet] - 10https://gerrit.wikimedia.org/r/1039978 (https://phabricator.wikimedia.org/T366874) [08:09:20] (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1039978 (https://phabricator.wikimedia.org/T366874) (owner: 10Ayounsi) [08:09:45] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2025.codfw.wmnet [08:12:29] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2025.codfw.wmnet [08:12:47] kubestagemaster2003 will briefly go down for a Ganeti reboot [08:14:22] (03PS1) 10JMeybohm: Create ratelimit namespace and releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040086 (https://phabricator.wikimedia.org/T362310) [08:14:23] PROBLEM - Host kubestagemaster2003 is DOWN: PING CRITICAL - Packet loss = 100% [08:14:49] RECOVERY - Check unit status of push_cross_cluster_settings_9600 on elastic1102 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9600 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:14:55] RECOVERY - Check unit status of push_cross_cluster_settings_9400 on elastic1098 is OK: OK: Status of the systemd unit push_cross_cluster_settings_9400 https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:15:32] !log deleted from zarcillo db2097, db2098, db2099 T362802 T366877 T362883 [08:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:38] T362802: decommission db2097.codfw.wmnet - https://phabricator.wikimedia.org/T362802 [08:15:39] T366877: decommission db2098 - https://phabricator.wikimedia.org/T366877 [08:15:39] T362883: decommission db2099.codfw.wmnet - https://phabricator.wikimedia.org/T362883 [08:16:04] FIRING: [2x] PuppetDisabled: Puppet disabled on mc1049:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=memcached&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [08:18:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2025.codfw.wmnet [08:18:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2025.codfw.wmnet [08:18:44] FIRING: [3x] ProbeDown: Service ganeti2025:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:18:57] FIRING: KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestagemaster2003.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:19:09] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4049.ulsfo.wmnet [08:19:10] (03PS1) 10Jcrespo: dbbackups: Remove all puppet references for db2097, db2098, db2099 [puppet] - 10https://gerrit.wikimedia.org/r/1040087 (https://phabricator.wikimedia.org/T358741) [08:19:25] !log fabfur@cumin1002 START - Cookbook sre.hosts.reboot-single for host cp4049.ulsfo.wmnet [08:20:05] (03CR) 10AikoChou: [C:03+1] ml-services: enable multiprocessing for eswiki-damaging and viwiki-reverted [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039776 (https://phabricator.wikimedia.org/T349274) (owner: 10Ilias Sarantopoulos) [08:20:10] (03CR) 10AikoChou: [C:03+1] ml-services: decrease maxReplicas in viwiki-reverted [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039980 (https://phabricator.wikimedia.org/T349274) (owner: 10Ilias Sarantopoulos) [08:20:31] (03CR) 10Jcrespo: "To be deployed around the same time as the decom script." [puppet] - 10https://gerrit.wikimedia.org/r/1040087 (https://phabricator.wikimedia.org/T358741) (owner: 10Jcrespo) [08:20:39] RECOVERY - Host kubestagemaster2003 is UP: PING OK - Packet loss = 0%, RTA = 30.65 ms [08:22:03] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2026.codfw.wmnet [08:22:04] (03CR) 10Jelto: [C:03+2] aptrepo::staging: use gitlab client to download file, fix get_all [puppet] - 10https://gerrit.wikimedia.org/r/1039217 (https://phabricator.wikimedia.org/T347004) (owner: 10Jelto) [08:23:44] RESOLVED: [3x] ProbeDown: Service ganeti2025:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:23:57] RESOLVED: KubernetesCalicoDown: kubestagemaster2003.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestagemaster2003.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [08:24:37] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1039735 (owner: 10EoghanGaffney) [08:28:22] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2026.codfw.wmnet [08:28:56] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4049.ulsfo.wmnet [08:30:15] (03CR) 10Giuseppe Lavagetto: [C:03+1] Add ratelimit user to the wikikube/main clusters [puppet] - 10https://gerrit.wikimedia.org/r/1040060 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [08:30:44] !log jynus@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2097.codfw.wmnet [08:31:50] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1039978 (https://phabricator.wikimedia.org/T366874) (owner: 10Ayounsi) [08:32:12] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove obsolete virt-star stub secret [labs/private] - 10https://gerrit.wikimedia.org/r/1039227 (owner: 10Muehlenhoff) [08:32:45] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove tendril stub cert [labs/private] - 10https://gerrit.wikimedia.org/r/1039189 (owner: 10Muehlenhoff) [08:32:53] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Remove obsolete LDAP stub secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1039184 (owner: 10Muehlenhoff) [08:33:07] (03CR) 10Jcrespo: [C:03+2] dbbackups: Remove all puppet references for db2097, db2098, db2099 [puppet] - 10https://gerrit.wikimedia.org/r/1040087 (https://phabricator.wikimedia.org/T358741) (owner: 10Jcrespo) [08:33:12] (03CR) 10Giuseppe Lavagetto: [C:03+1] Create ratelimit namespace and releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040086 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [08:33:59] (03CR) 10Vgutierrez: [V:03+1 C:04-1] lvs::realserver::ipip: Provide ferm MSS clamping support (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1035724 (https://phabricator.wikimedia.org/T365689) (owner: 10Vgutierrez) [08:34:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2026.codfw.wmnet [08:34:45] !log jynus@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2098.codfw.wmnet [08:34:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2026.codfw.wmnet [08:35:15] !log jynus@cumin1002 START - Cookbook sre.dns.netbox [08:37:32] !log jynus@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2097.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin1002" [08:38:26] (03CR) 10JMeybohm: [C:03+2] Add ratelimit user to the wikikube/main clusters [puppet] - 10https://gerrit.wikimedia.org/r/1040060 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [08:39:04] !log jynus@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2097.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin1002" [08:39:04] !log jynus@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:39:05] !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2097.codfw.wmnet [08:39:14] !log jynus@cumin1002 START - Cookbook sre.dns.netbox [08:39:15] 06SRE, 06collaboration-services, 06Release-Engineering-Team, 06Traffic, 05WMF-NDA: Move GitLab behind the CDN - https://phabricator.wikimedia.org/T366882 (10Jelto) 03NEW [08:39:19] (03CR) 10JMeybohm: [C:03+2] Create ratelimit namespace and releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040086 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [08:40:05] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1039677 (https://phabricator.wikimedia.org/T365696) (owner: 10Majavah) [08:40:41] !log jynus@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:40:42] !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2098.codfw.wmnet [08:41:05] !log jynus@cumin1002 START - Cookbook sre.hosts.decommission for hosts db2099.codfw.wmnet [08:42:19] (03Merged) 10jenkins-bot: Create ratelimit namespace and releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040086 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [08:43:39] (03PS2) 10Jcrespo: dbbackups: Remove all puppet references for db2097, db2098, db2099 [puppet] - 10https://gerrit.wikimedia.org/r/1040087 (https://phabricator.wikimedia.org/T358741) [08:44:32] (03CR) 10Majavah: [C:03+2] aptrepo: Add mirror for OpenTofu packages [puppet] - 10https://gerrit.wikimedia.org/r/1039677 (https://phabricator.wikimedia.org/T365696) (owner: 10Majavah) [08:45:45] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:46:55] !log jynus@cumin1002 START - Cookbook sre.dns.netbox [08:48:19] !log reboot dbprov1001,1002,2001,2002 [08:48:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:49:01] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2027.codfw.wmnet [08:49:22] !log import opentofu 1.7.2 to apt.wikimedia.org T365696 [08:49:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:26] T365696: Investigate how to run OpenTofu to manage Cloud VPS admin-only resources - https://phabricator.wikimedia.org/T365696 [08:50:22] !log jynus@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2099.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin1002" [08:51:46] !log jynus@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: db2099.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jynus@cumin1002" [08:51:46] !log jynus@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:51:47] !log jynus@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts db2099.codfw.wmnet [08:52:39] (03PS1) 10Jelto: conftool-data: add gitlab and replicas [puppet] - 10https://gerrit.wikimedia.org/r/1040094 (https://phabricator.wikimedia.org/T366882) [08:52:48] (03PS3) 10Jcrespo: dbbackups: Remove all puppet references for db2097, db2098, db2099 [puppet] - 10https://gerrit.wikimedia.org/r/1040087 (https://phabricator.wikimedia.org/T358741) [08:52:53] (03CR) 10Jcrespo: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1040087 (https://phabricator.wikimedia.org/T358741) (owner: 10Jcrespo) [08:53:01] (03CR) 10CI reject: [V:04-1] conftool-data: add gitlab and replicas [puppet] - 10https://gerrit.wikimedia.org/r/1040094 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [08:54:10] (03PS2) 10Jelto: conftool-data: add gitlab and replicas [puppet] - 10https://gerrit.wikimedia.org/r/1040094 (https://phabricator.wikimedia.org/T366882) [08:56:30] (03CR) 10JMeybohm: "I can see why it feels more explicit to have the variables set in the image. But I do also think it's the wrong approach to have applicati" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1039642 (owner: 10Giuseppe Lavagetto) [08:56:38] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission db2097.codfw.wmnet - https://phabricator.wikimedia.org/T362802#9870309 (10jcrespo) This is ready for dc ops. [08:56:45] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission db2097.codfw.wmnet - https://phabricator.wikimedia.org/T362802#9870304 (10jcrespo) a:05jcrespo→03None [08:57:32] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2027.codfw.wmnet [08:58:07] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission db2098.codfw.wmnet - https://phabricator.wikimedia.org/T366877#9870310 (10jcrespo) a:05jcrespo→03None [08:58:11] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission db2098.codfw.wmnet - https://phabricator.wikimedia.org/T366877#9870316 (10jcrespo) This is ready for dc ops. [08:59:57] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission db2099.codfw.wmnet - https://phabricator.wikimedia.org/T362883#9870317 (10jcrespo) a:05jcrespo→03None [08:59:59] 10ops-codfw, 06DC-Ops, 10decommission-hardware: decommission db2099.codfw.wmnet - https://phabricator.wikimedia.org/T362883#9870323 (10jcrespo) This is ready for dc-ops. [09:00:10] 06SRE, 06collaboration-services, 06Release-Engineering-Team, 06Traffic, and 2 others: Move GitLab behind the CDN - https://phabricator.wikimedia.org/T366882#9870324 (10Jelto) [09:03:20] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [09:03:47] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [09:04:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2027.codfw.wmnet [09:04:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2027.codfw.wmnet [09:06:16] (03PS1) 10JMeybohm: Don't deploy istio certificate for the ratelimit service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040096 (https://phabricator.wikimedia.org/T362310) [09:06:28] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2028.codfw.wmnet [09:11:52] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4049.ulsfo.wmnet [09:13:58] (03CR) 10Clément Goubert: [C:03+1] mw-debug: change mail_host [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039737 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [09:13:58] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2028.codfw.wmnet [09:15:45] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:16:16] (03CR) 10JMeybohm: [C:03+2] Don't deploy istio certificate for the ratelimit service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040096 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [09:16:21] (03CR) 10Muehlenhoff: [C:03+1] "Looks good to me. Eventually cross checks will be covered by Bitu, but useful in the interim as well." [puppet] - 10https://gerrit.wikimedia.org/r/999103 (owner: 10Majavah) [09:16:56] 10ops-codfw, 06SRE, 06DC-Ops, 10Observability-Metrics: Memory upgrade request for prometheus200[56] - https://phabricator.wikimedia.org/T360895#9870356 (10fgiunchedi) I didn't realize it at the time, though codfw in T354685 got 192GB per host, whereas a week later in T354684 eqiad got 384GB per host. If w... [09:17:06] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: enable multiprocessing for eswiki-damaging and viwiki-reverted [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039776 (https://phabricator.wikimedia.org/T349274) (owner: 10Ilias Sarantopoulos) [09:18:29] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2145.codfw.wmnet with reason: Maintenance [09:18:42] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2145.codfw.wmnet with reason: Maintenance [09:18:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:18:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2145 (T352010)', diff saved to https://phabricator.wikimedia.org/P64241 and previous config saved to /var/cache/conftool/dbconfig/20240607-091849-ladsgroup.json [09:18:53] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [09:19:46] (03Merged) 10jenkins-bot: Don't deploy istio certificate for the ratelimit service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040096 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [09:20:13] (03Merged) 10jenkins-bot: ml-services: enable multiprocessing for eswiki-damaging and viwiki-reverted [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039776 (https://phabricator.wikimedia.org/T349274) (owner: 10Ilias Sarantopoulos) [09:20:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2028.codfw.wmnet [09:20:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2028.codfw.wmnet [09:22:35] PROBLEM - Disk space on backup1011 is CRITICAL: DISK CRITICAL - free space: /srv/objectstorage 4173857MiB (3% inode=95%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=backup1011&var-datasource=eqiad+prometheus/ops [09:22:45] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [09:23:00] that's me doing maintenance [09:23:32] wait no, backup1011 is something else [09:24:20] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [09:24:26] (03CR) 10Filippo Giunchedi: "This is one required half, the other half is in templates/wmnet with an entry for discovery.wmnet zone similar to e.g." [dns] - 10https://gerrit.wikimedia.org/r/1039882 (https://phabricator.wikimedia.org/T356386) (owner: 10Andrea Denisse) [09:24:54] so right alert, should be fixed soon [09:25:17] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, let's merge early next week. This is the first step that needs to be merged" [puppet] - 10https://gerrit.wikimedia.org/r/1039406 (https://phabricator.wikimedia.org/T356386) (owner: 10Andrea Denisse) [09:25:36] (03CR) 10Filippo Giunchedi: "can't be merged yet but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1039887 (https://phabricator.wikimedia.org/T356386) (owner: 10Andrea Denisse) [09:25:38] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [09:26:31] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [09:26:48] (03PS2) 10EoghanGaffney: lists: Remove `server_uses_stunnel` option [puppet] - 10https://gerrit.wikimedia.org/r/1039735 [09:28:42] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [09:29:33] (03CR) 10Muehlenhoff: [C:03+1] "The error happens because you are syncing between a host in the Puppet 5 and a host in the Puppet 7 environments. Removing it for the migr" [puppet] - 10https://gerrit.wikimedia.org/r/1039735 (owner: 10EoghanGaffney) [09:30:19] (03CR) 10CI reject: [V:04-1] lists: Remove `server_uses_stunnel` option [puppet] - 10https://gerrit.wikimedia.org/r/1039735 (owner: 10EoghanGaffney) [09:30:39] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [09:30:51] (03PS3) 10EoghanGaffney: lists: Remove `server_uses_stunnel` option [puppet] - 10https://gerrit.wikimedia.org/r/1039735 [09:30:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host irc1002.wikimedia.org [09:34:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [09:34:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc1002.wikimedia.org [09:35:18] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [09:35:22] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [09:36:14] !log cgoubert@cumin1002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-staging-worker-codfw [09:36:21] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host irc2002.wikimedia.org [09:37:55] 06SRE, 06collaboration-services, 06Release-Engineering-Team, 06Traffic, 13Patch-For-Review: Move GitLab behind the CDN - https://phabricator.wikimedia.org/T366882#9870408 (10Peachey88) [09:38:42] 06SRE, 06Infrastructure-Foundations: Migrate bastions to Bookworm - https://phabricator.wikimedia.org/T343515#9870409 (10MoritzMuehlenhoff) [09:38:55] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-damaging' for release 'main' . [09:39:47] 06SRE, 06Infrastructure-Foundations: Migrate bastions to Bookworm - https://phabricator.wikimedia.org/T343515#9870410 (10MoritzMuehlenhoff) 05Open→03Resolved All bastions are on bookworm now [09:40:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host irc2002.wikimedia.org [09:40:38] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host pki1002.eqiad.wmnet [09:42:31] (03PS1) 10Majavah: Add fake opentofu admin passwords and tokens [labs/private] - 10https://gerrit.wikimedia.org/r/1040099 (https://phabricator.wikimedia.org/T365696) [09:42:35] RECOVERY - Disk space on backup1011 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=backup1011&var-datasource=eqiad+prometheus/ops [09:43:00] !log upgrading and restarting db1239 T360751 [09:43:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:03] T360751: Upgrade backup sources to MariaDB 10.6 - https://phabricator.wikimedia.org/T360751 [09:44:46] (03CR) 10Majavah: [V:03+2 C:03+2] Add fake opentofu admin passwords and tokens [labs/private] - 10https://gerrit.wikimedia.org/r/1040099 (https://phabricator.wikimedia.org/T365696) (owner: 10Majavah) [09:45:45] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:47:17] (03PS1) 10Jcrespo: dbbackups: Upgrade db1239 to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1040102 (https://phabricator.wikimedia.org/T360751) [09:47:31] (03PS2) 10Jcrespo: dbbackups: Upgrade db1239 to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1040102 (https://phabricator.wikimedia.org/T360751) [09:48:00] (03CR) 10Jcrespo: [C:03+2] dbbackups: Upgrade db1239 to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1040102 (https://phabricator.wikimedia.org/T360751) (owner: 10Jcrespo) [09:48:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:52:23] !log powercycle pki1002 [09:52:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:52:42] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [09:53:24] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [09:53:34] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [09:54:21] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [09:54:33] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-staging-worker-codfw [09:54:45] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [09:54:53] !log cgoubert@cumin1002 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:wikikube-staging-worker-eqiad [09:56:02] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/ratelimit: apply [09:56:09] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [09:56:12] (03CR) 10Jelto: "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1039735 (owner: 10EoghanGaffney) [09:56:24] (03CR) 10EoghanGaffney: [C:03+2] lists: Remove `server_uses_stunnel` option [puppet] - 10https://gerrit.wikimedia.org/r/1039735 (owner: 10EoghanGaffney) [09:56:36] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/ratelimit: apply [09:58:41] (03PS1) 10JMeybohm: ratelimit: Don't deploy latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040105 (https://phabricator.wikimedia.org/T362310) [09:58:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host pki1002.eqiad.wmnet [10:00:14] (03PS3) 10Giuseppe Lavagetto: Improve ability to override php-fpm configuration in kubernetes [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1039642 [10:00:28] (03CR) 10Giuseppe Lavagetto: "I'll add to this that we've not used environment variables extensively in part because they're hard to add." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1039642 (owner: 10Giuseppe Lavagetto) [10:13:42] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:wikikube-staging-worker-eqiad [10:14:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T352010)', diff saved to https://phabricator.wikimedia.org/P64242 and previous config saved to /var/cache/conftool/dbconfig/20240607-101436-ladsgroup.json [10:14:40] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [10:14:58] (03PS1) 10Muehlenhoff: Remove profile::base::use_linux510_on_buster [puppet] - 10https://gerrit.wikimedia.org/r/1040109 [10:15:31] (03PS2) 10JMeybohm: ratelimit: Don't deploy latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040105 (https://phabricator.wikimedia.org/T362310) [10:15:45] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:18:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:19:50] (03CR) 10JMeybohm: [C:03+1] "Cool, LGTM" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1039642 (owner: 10Giuseppe Lavagetto) [10:20:15] (03CR) 10JMeybohm: [C:03+2] ratelimit: Don't deploy latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040105 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [10:21:11] (03Merged) 10jenkins-bot: ratelimit: Don't deploy latest [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040105 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [10:22:05] (03CR) 10Hnowlan: thumbor: make tmp-dir configurable, default disabled (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/908501 (https://phabricator.wikimedia.org/T334488) (owner: 10Hnowlan) [10:22:56] (03PS2) 10Muehlenhoff: profile::maps::tlsproxy: Unconditionally use PKI [puppet] - 10https://gerrit.wikimedia.org/r/1039188 (https://phabricator.wikimedia.org/T360778) [10:23:45] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/ratelimit: apply [10:27:11] (03PS1) 10JMeybohm: ratelimit: Ensure LOCAL_CACHE_SIZE_IN_BYTES is not converted [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040114 (https://phabricator.wikimedia.org/T362310) [10:28:23] (03PS1) 10Majavah: P:openstack: New profile for running OpenTofu for Cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/1040115 (https://phabricator.wikimedia.org/T365696) [10:28:34] (03CR) 10Clément Goubert: [C:03+1] ratelimit: Ensure LOCAL_CACHE_SIZE_IN_BYTES is not converted [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040114 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [10:28:37] (03PS2) 10Majavah: P:openstack: New profile for running OpenTofu for Cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/1040115 (https://phabricator.wikimedia.org/T365696) [10:28:42] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1039188 (https://phabricator.wikimedia.org/T360778) (owner: 10Muehlenhoff) [10:28:58] (03CR) 10CI reject: [V:04-1] P:openstack: New profile for running OpenTofu for Cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/1040115 (https://phabricator.wikimedia.org/T365696) (owner: 10Majavah) [10:29:18] (03CR) 10JMeybohm: [C:03+2] ratelimit: Ensure LOCAL_CACHE_SIZE_IN_BYTES is not converted [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040114 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [10:29:44] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P64243 and previous config saved to /var/cache/conftool/dbconfig/20240607-102944-ladsgroup.json [10:31:37] (03Merged) 10jenkins-bot: ratelimit: Ensure LOCAL_CACHE_SIZE_IN_BYTES is not converted [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040114 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [10:31:49] (03PS1) 10Santiago Faci: MPIC chart: Adding missing properties and bumping MPIC version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040116 (https://phabricator.wikimedia.org/T362642) [10:32:19] (03PS3) 10Majavah: P:openstack: New profile for running OpenTofu for Cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/1040115 (https://phabricator.wikimedia.org/T365696) [10:32:41] (03CR) 10CI reject: [V:04-1] P:openstack: New profile for running OpenTofu for Cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/1040115 (https://phabricator.wikimedia.org/T365696) (owner: 10Majavah) [10:32:48] (03PS4) 10Majavah: P:openstack: New profile for running OpenTofu for Cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/1040115 (https://phabricator.wikimedia.org/T365696) [10:33:10] (03CR) 10CI reject: [V:04-1] P:openstack: New profile for running OpenTofu for Cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/1040115 (https://phabricator.wikimedia.org/T365696) (owner: 10Majavah) [10:33:59] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/ratelimit: apply [10:34:15] (03PS5) 10Majavah: P:openstack: New profile for running OpenTofu for Cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/1040115 (https://phabricator.wikimedia.org/T365696) [10:35:45] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1040115 (https://phabricator.wikimedia.org/T365696) (owner: 10Majavah) [10:36:38] (03CR) 10Pmiazga: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1039822 (https://phabricator.wikimedia.org/T355281) (owner: 10Pmiazga) [10:43:48] (03CR) 10Majavah: [V:03+1 C:03+2] P:openstack: New profile for running OpenTofu for Cloud VPS [puppet] - 10https://gerrit.wikimedia.org/r/1040115 (https://phabricator.wikimedia.org/T365696) (owner: 10Majavah) [10:44:50] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1040115 (https://phabricator.wikimedia.org/T365696) (owner: 10Majavah) [10:44:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196', diff saved to https://phabricator.wikimedia.org/P64244 and previous config saved to /var/cache/conftool/dbconfig/20240607-104452-ladsgroup.json [10:45:45] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:47:20] (03PS1) 10Jcrespo: dbbackups: Remove all production references to db2102 [puppet] - 10https://gerrit.wikimedia.org/r/1040117 (https://phabricator.wikimedia.org/T366892) [10:48:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:50:17] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reboot-single for host rdb2010.codfw.wmnet [10:50:31] !log cgoubert@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host rdb2010.codfw.wmnet [10:50:46] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reboot-single for host rdb2010.codfw.wmnet [10:54:53] (03PS1) 10Majavah: P:openstack: opentofu: Fix config file location [puppet] - 10https://gerrit.wikimedia.org/r/1040121 [10:55:11] (03PS2) 10Majavah: P:openstack: opentofu: Fix config file location [puppet] - 10https://gerrit.wikimedia.org/r/1040121 [10:55:40] (03PS1) 10Muehlenhoff: Deprecate system::role for Cloud VPS-specific roles [puppet] - 10https://gerrit.wikimedia.org/r/1040123 [10:56:42] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb2010.codfw.wmnet [10:57:02] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reboot-single for host rdb2009.codfw.wmnet [11:00:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1196 (T352010)', diff saved to https://phabricator.wikimedia.org/P64245 and previous config saved to /var/cache/conftool/dbconfig/20240607-110000-ladsgroup.json [11:00:04] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1206.eqiad.wmnet with reason: Maintenance [11:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240607T0700) [11:00:05] eoghan, jelto, arnoldokoth, and mutante: That opportune time for a GitLab version upgrades deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240607T1100). [11:00:10] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [11:00:17] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1206.eqiad.wmnet with reason: Maintenance [11:00:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1206 (T352010)', diff saved to https://phabricator.wikimedia.org/P64246 and previous config saved to /var/cache/conftool/dbconfig/20240607-110025-ladsgroup.json [11:00:29] !log jelto@cumin2002 START - Cookbook sre.hosts.reboot-single for host gitlab2002.wikimedia.org [11:00:45] FIRING: ProbeDown: Service docker-registry:443 has failed probes (http_docker-registry_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#docker-registry:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:01:26] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/ratelimit: apply [11:01:46] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/ratelimit: apply [11:02:04] (03PS1) 10Muehlenhoff: Deprecate system::role for wikikube roles [puppet] - 10https://gerrit.wikimedia.org/r/1040124 [11:02:47] PROBLEM - Host gitlab.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [11:03:35] RECOVERY - Host gitlab.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 30.47 ms [11:03:39] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb2009.codfw.wmnet [11:03:44] RESOLVED: ProbeDown: Service docker-registry:443 has failed probes (http_docker-registry_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#docker-registry:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:04:30] (03CR) 10Majavah: [C:03+2] P:openstack: opentofu: Fix config file location [puppet] - 10https://gerrit.wikimedia.org/r/1040121 (owner: 10Majavah) [11:05:15] !log jelto@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gitlab2002.wikimedia.org [11:05:45] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reboot-single for host rdb2008.codfw.wmnet [11:07:20] (03PS3) 10Slyngshede: Fix bug where SSH keys are imported incorrectly. [software/bitu] - 10https://gerrit.wikimedia.org/r/1038778 (https://phabricator.wikimedia.org/T366525) [11:11:04] (03PS1) 10Muehlenhoff: Deprecate system::role for search roles [puppet] - 10https://gerrit.wikimedia.org/r/1040125 [11:12:21] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb2008.codfw.wmnet [11:12:54] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reboot-single for host rdb2007.codfw.wmnet [11:14:19] (03PS2) 10Santiago Faci: MPIC chart: Adding missing properties and bumping MPIC version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040116 (https://phabricator.wikimedia.org/T362642) [11:15:45] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:15:52] (03CR) 10Brouberol: [C:03+1] MPIC chart: Adding missing properties and bumping MPIC version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040116 (https://phabricator.wikimedia.org/T362642) (owner: 10Santiago Faci) [11:18:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:20:43] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb2007.codfw.wmnet [11:22:19] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reboot-single for host rdb1014.eqiad.wmnet [11:23:16] (03CR) 10Santiago Faci: [C:03+2] MPIC chart: Adding missing properties and bumping MPIC version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040116 (https://phabricator.wikimedia.org/T362642) (owner: 10Santiago Faci) [11:23:44] FIRING: [2x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:24:07] (03Merged) 10jenkins-bot: MPIC chart: Adding missing properties and bumping MPIC version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040116 (https://phabricator.wikimedia.org/T362642) (owner: 10Santiago Faci) [11:25:45] RESOLVED: [2x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:26:27] (03PS70) 10Arnaudb: mariadb: add some logic to allow instance conversion [software/spicerack] - 10https://gerrit.wikimedia.org/r/1005531 (https://phabricator.wikimedia.org/T343674) [11:28:02] !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [11:28:16] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb1014.eqiad.wmnet [11:28:19] !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [11:28:44] FIRING: [2x] SystemdUnitFailed: gitlab-package-puller.service on apt-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:29:23] (03PS71) 10Arnaudb: mariadb: add some logic to allow instance conversion [software/spicerack] - 10https://gerrit.wikimedia.org/r/1005531 (https://phabricator.wikimedia.org/T343674) [11:29:30] (03PS11) 10Giuseppe Lavagetto: Add new chart statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039171 (https://phabricator.wikimedia.org/T365265) [11:29:30] (03PS7) 10Giuseppe Lavagetto: statsd: add deployment to mw-debug (codfw only) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039233 (https://phabricator.wikimedia.org/T365265) [11:29:30] (03PS7) 10Giuseppe Lavagetto: mw-debug: add statsd service everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039234 (https://phabricator.wikimedia.org/T365265) [11:29:31] (03PS2) 10Giuseppe Lavagetto: mediawiki: allow passing variables to php-fpm [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039724 [11:29:32] (03PS2) 10Giuseppe Lavagetto: mw-debug: start using php.envvars, expose statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039779 (https://phabricator.wikimedia.org/T365265) [11:29:33] (03PS2) 10Giuseppe Lavagetto: mw-debug: remove vintage setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039780 [11:29:50] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reboot-single for host rdb1012.eqiad.wmnet [11:30:07] (03PS1) 10Pmiazga: beta: introduce pl.wikivoyage.beta.wmcloud.org wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039809 (https://phabricator.wikimedia.org/T355281) [11:30:20] (03CR) 10CI reject: [V:04-1] mediawiki: allow passing variables to php-fpm [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039724 (owner: 10Giuseppe Lavagetto) [11:30:37] (03CR) 10CI reject: [V:04-1] mw-debug: start using php.envvars, expose statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039779 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto) [11:30:43] (03CR) 10CI reject: [V:04-1] mw-debug: remove vintage setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039780 (owner: 10Giuseppe Lavagetto) [11:30:54] (03CR) 10Pmiazga: "Tested with I477c4b72297d8e740461a029e0fd1c7bca818c2f on deployment-prep. The Polish Wikivoyage is working as expected" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039809 (https://phabricator.wikimedia.org/T355281) (owner: 10Pmiazga) [11:33:13] (03PS2) 10Pmiazga: beta: introduce pl.wikivoyage.beta.wmcloud.org wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039809 (https://phabricator.wikimedia.org/T355281) [11:35:32] (03CR) 10CI reject: [V:04-1] mariadb: add some logic to allow instance conversion [software/spicerack] - 10https://gerrit.wikimedia.org/r/1005531 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [11:35:56] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb1012.eqiad.wmnet [11:36:16] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reboot-single for host rdb1013.eqiad.wmnet [11:38:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T364299)', diff saved to https://phabricator.wikimedia.org/P64248 and previous config saved to /var/cache/conftool/dbconfig/20240607-113824-marostegui.json [11:38:28] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [11:42:09] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb1013.eqiad.wmnet [11:42:47] !log cgoubert@cumin1002 START - Cookbook sre.hosts.reboot-single for host rdb1011.eqiad.wmnet [11:43:44] FIRING: [3x] SystemdUnitFailed: netbox_ganeti_esams01_sync.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:45:09] PROBLEM - Docker registry HTTPS interface on registry1003 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - string schemaVersion not found on https://registry1003.eqiad.wmnet:443/v2/bullseye/manifests/latest - 362 bytes in 0.027 second response time https://wikitech.wikimedia.org/wiki/Docker [11:45:45] FIRING: [3x] SystemdUnitFailed: netbox_ganeti_esams01_sync.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:45:57] FIRING: ProbeDown: Service docker-registry:443 has failed probes (http_docker-registry_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#docker-registry:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:46:20] that's me rebooting redis I think [11:46:30] it should recover in a bit [11:47:11] RECOVERY - Docker registry HTTPS interface on registry1003 is OK: HTTP OK: HTTP/1.1 200 OK - 3746 bytes in 0.276 second response time https://wikitech.wikimedia.org/wiki/Docker [11:47:20] (03CR) 10Giuseppe Lavagetto: "Actually, the reason to do this is that we should use environment variables more to pass information to php-fpm, and having a good mechani" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1039642 (owner: 10Giuseppe Lavagetto) [11:48:41] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host rdb1011.eqiad.wmnet [11:48:44] FIRING: [3x] SystemdUnitFailed: netbox_ganeti_esams01_sync.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:50:57] RESOLVED: ProbeDown: Service docker-registry:443 has failed probes (http_docker-registry_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#docker-registry:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:51:06] there we go [11:51:19] sorry for the noise [11:51:33] (03PS1) 10JMeybohm: ratelimit: Allow ingress on the HTTP port for easier testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040138 (https://phabricator.wikimedia.org/T362310) [11:51:47] (03PS1) 10Dreamy Jazz: Blank translation of 'log-name-tag' in az [core] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1040139 (https://phabricator.wikimedia.org/T361695) [11:52:02] jouncebot: nowandnext [11:52:02] For the next 19 hour(s) and 7 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240607T0700) [11:52:02] In 19 hour(s) and 7 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240608T0700) [11:53:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P64249 and previous config saved to /var/cache/conftool/dbconfig/20240607-115333-marostegui.json [11:53:44] FIRING: [3x] SystemdUnitFailed: netbox_ganeti_esams01_sync.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:54:57] dduvall: Can I deploy the backport https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1040139 for https://phabricator.wikimedia.org/T361695 - Based on https://logstash.wikimedia.org/goto/8ec4f6ad2049384f6225911c1f552d8d there were 300 errors in the last hour. [11:55:45] FIRING: [3x] SystemdUnitFailed: netbox_ganeti_esams01_sync.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:56:06] (03PS3) 10Giuseppe Lavagetto: mediawiki: allow passing variables to php-fpm [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039724 [11:56:06] (03PS3) 10Giuseppe Lavagetto: mw-debug: start using php.envvars, expose statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039779 (https://phabricator.wikimedia.org/T365265) [11:56:06] (03PS3) 10Giuseppe Lavagetto: mw-debug: remove vintage setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039780 [11:56:27] It doesn't seem to necessarily fit into the criteria for an emergency deploy, but it could mean 21,000 errors in total until Monday when this fix could be backported. [11:57:41] (03CR) 10Pmiazga: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1039822 (https://phabricator.wikimedia.org/T355281) (owner: 10Pmiazga) [11:57:44] The error itself doesn't stop Special:Log working, but it does cause a loss in functionality (a dropdown option will not appear in Special:Log for users using the az language) [11:58:19] Also pinging thcipriani: [11:58:22] (03CR) 10Pmiazga: "I wanted to see results and misclicked on calling "Run puppet compiler" again. I only hope it's not going to trigger it for second time" [puppet] - 10https://gerrit.wikimedia.org/r/1039822 (https://phabricator.wikimedia.org/T355281) (owner: 10Pmiazga) [11:58:47] (03CR) 10JMeybohm: [C:03+2] flink-app: Update various modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039727 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [11:59:45] (03Merged) 10jenkins-bot: flink-app: Update various modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039727 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [12:00:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T352010)', diff saved to https://phabricator.wikimedia.org/P64250 and previous config saved to /var/cache/conftool/dbconfig/20240607-120051-ladsgroup.json [12:00:57] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [12:01:33] !log aokoth@cumin1002 START - Cookbook sre.hosts.reboot-single for host phab1004.eqiad.wmnet [12:03:08] (03CR) 10JMeybohm: [C:03+2] ratelimit: Allow ingress on the HTTP port for easier testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040138 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [12:04:09] Error: 502, Broken pipe at 2024-06-07 12:03:04 GMT [12:04:21] (03Merged) 10jenkins-bot: ratelimit: Allow ingress on the HTTP port for easier testing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040138 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [12:04:23] Phabricator is unreachable from esams, at least [12:05:26] (03CR) 10Clément Goubert: [C:03+2] beta: Add server_alias for wikivoyage.beta.wmcloud.org [puppet] - 10https://gerrit.wikimedia.org/r/1039822 (https://phabricator.wikimedia.org/T355281) (owner: 10Pmiazga) [12:05:42] Phab works for me (via esams as well) [12:05:44] Southparkfan, being in central Europe I am quite sure I'm going via esams and it loads here [12:05:55] looks like it's back online [12:06:02] (03PS1) 10Zabe: create u4c.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1040144 (https://phabricator.wikimedia.org/T366649) [12:06:12] I did encounter multiple 502s for a few minutes, though [12:06:33] yeah, I came across a 502 tab as well [12:06:51] (03PS1) 10Majavah: P:openstack: opentofu: fix configuration [puppet] - 10https://gerrit.wikimedia.org/r/1040145 [12:07:04] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/ratelimit: apply [12:07:05] southparkfan: I just rebooted it ( https://phabricator.wikimedia.org/T366555) . Should be back now. :) [12:07:20] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/ratelimit: apply [12:07:32] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/ratelimit: apply [12:07:48] !log aokoth@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host phab1004.eqiad.wmnet [12:07:57] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/ratelimit: apply [12:07:57] (03PS3) 10Ilias Sarantopoulos: ml-services: decrease maxReplicas in viwiki-reverted [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039980 (https://phabricator.wikimedia.org/T349274) [12:08:07] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/ratelimit: apply [12:08:09] arnoldokoth: oh true, I remember now... I never know what to make out of "12:00PM" and if that's noon or midnight. 24h scheme is very welcome in the phuture :) [12:08:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P64251 and previous config saved to /var/cache/conftool/dbconfig/20240607-120841-marostegui.json [12:08:55] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/ratelimit: apply [12:08:57] that task is protected, but at least it matches the hypothesis it's an issue between Phab and ATS comms ;) [12:09:10] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9600.service on elastic1086:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:09:57] Southparkfan: https://lists.wikimedia.org/hyperkitty/list/wikitech-l@lists.wikimedia.org/thread/YMXZBHXWIZ76V3EVZPIRGTFQAWQO3WWO/ [12:10:33] right, thanks andre! [12:10:40] * andre needs to reboot too [12:12:22] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: decrease maxReplicas in viwiki-reverted [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039980 (https://phabricator.wikimedia.org/T349274) (owner: 10Ilias Sarantopoulos) [12:12:36] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Test Puppet 8 readiness - https://phabricator.wikimedia.org/T366900 (10MoritzMuehlenhoff) 03NEW [12:13:14] (03Merged) 10jenkins-bot: ml-services: decrease maxReplicas in viwiki-reverted [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039980 (https://phabricator.wikimedia.org/T349274) (owner: 10Ilias Sarantopoulos) [12:14:01] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Test Puppet 8 readiness - https://phabricator.wikimedia.org/T366900#9870774 (10MoritzMuehlenhoff) p:05Triage→03Medium [12:14:28] (03PS3) 10Hnowlan: Upgrade base OS to Debian bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1039778 (https://phabricator.wikimedia.org/T355020) [12:15:45] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:16:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P64252 and previous config saved to /var/cache/conftool/dbconfig/20240607-121559-ladsgroup.json [12:16:04] FIRING: [2x] PuppetDisabled: Puppet disabled on mc1049:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=memcached&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [12:18:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:23:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T364299)', diff saved to https://phabricator.wikimedia.org/P64253 and previous config saved to /var/cache/conftool/dbconfig/20240607-122349-marostegui.json [12:23:52] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1199.eqiad.wmnet with reason: Maintenance [12:23:53] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [12:24:05] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1199.eqiad.wmnet with reason: Maintenance [12:24:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1199 (T364299)', diff saved to https://phabricator.wikimedia.org/P64254 and previous config saved to /var/cache/conftool/dbconfig/20240607-122413-marostegui.json [12:24:58] (03CR) 10CI reject: [V:04-1] Upgrade base OS to Debian bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1039778 (https://phabricator.wikimedia.org/T355020) (owner: 10Hnowlan) [12:29:55] thcipriani: dduvall: Just checking if you saw my message about wanting to do an emergency deploy for https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1040139? [12:31:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206', diff saved to https://phabricator.wikimedia.org/P64255 and previous config saved to /var/cache/conftool/dbconfig/20240607-123108-ladsgroup.json [12:31:51] !log isaranto@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [12:33:48] !log isaranto@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality-reverted' for release 'main' . [12:37:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T364069)', diff saved to https://phabricator.wikimedia.org/P64256 and previous config saved to /var/cache/conftool/dbconfig/20240607-123754-marostegui.json [12:37:59] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [12:38:44] !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [12:38:51] !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [12:40:43] !log dcausse@deploy1002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [12:41:00] !log dcausse@deploy1002 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [12:44:05] !log dcausse@deploy1002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply [12:44:40] !log dcausse@deploy1002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply [12:45:45] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:46:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1206 (T352010)', diff saved to https://phabricator.wikimedia.org/P64257 and previous config saved to /var/cache/conftool/dbconfig/20240607-124616-ladsgroup.json [12:46:20] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1207.eqiad.wmnet with reason: Maintenance [12:46:22] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [12:46:33] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1207.eqiad.wmnet with reason: Maintenance [12:46:34] (03PS1) 10Awight: Revert "Temporary monitoring for scraper" [puppet] - 10https://gerrit.wikimedia.org/r/1040075 [12:46:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1207 (T352010)', diff saved to https://phabricator.wikimedia.org/P64258 and previous config saved to /var/cache/conftool/dbconfig/20240607-124641-ladsgroup.json [12:48:21] (03CR) 10Elukey: [C:03+2] profile::services_proxy::envoy: add inference-staging to listeners [puppet] - 10https://gerrit.wikimedia.org/r/1039741 (https://phabricator.wikimedia.org/T366801) (owner: 10Ilias Sarantopoulos) [12:48:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:49:58] !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [12:50:02] (03CR) 10Gergő Tisza: [C:03+2] beta: introduce pl.wikivoyage.beta.wmcloud.org wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039809 (https://phabricator.wikimedia.org/T355281) (owner: 10Pmiazga) [12:50:41] (03Merged) 10jenkins-bot: beta: introduce pl.wikivoyage.beta.wmcloud.org wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039809 (https://phabricator.wikimedia.org/T355281) (owner: 10Pmiazga) [12:53:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P64259 and previous config saved to /var/cache/conftool/dbconfig/20240607-125303-marostegui.json [12:56:26] (03PS1) 10Kosta Harlan: [WIP] mediamoderation: Add one-off job for processing the Commons backlog [puppet] - 10https://gerrit.wikimedia.org/r/1040150 [12:57:17] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 121, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:57:55] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:59:42] (03CR) 10CI reject: [V:04-1] [WIP] mediamoderation: Add one-off job for processing the Commons backlog [puppet] - 10https://gerrit.wikimedia.org/r/1040150 (owner: 10Kosta Harlan) [13:00:28] (03PS1) 10DCausse: cirrus-streaming-updater: disable sanitizer on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040151 [13:01:48] !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:01:56] !log dcausse@deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [13:02:20] !log dcausse@deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:04:19] !log dcausse@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [13:04:42] !log dcausse@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [13:05:15] (03CR) 10Majavah: [C:03+2] openldap: cross-validate-accounts: Note shell users disabled in LDAP [puppet] - 10https://gerrit.wikimedia.org/r/999103 (owner: 10Majavah) [13:05:45] !log uploaded wmf-laptop 1.0.0 to component/wmf-laptop for bookworm-wikimedia [13:05:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:37] (03PS1) 10Elukey: admin_ng: update Bookworm-based Docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040153 (https://phabricator.wikimedia.org/T356252) [13:07:10] (03CR) 10Ottomata: "@swfrench@wikimedia.org what's the urgency on doing the redeployment? Does it need to happen or can it just wait until the next time we d" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039727 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [13:08:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248', diff saved to https://phabricator.wikimedia.org/P64260 and previous config saved to /var/cache/conftool/dbconfig/20240607-130811-marostegui.json [13:10:49] FIRING: [2x] PuppetDisabled: Puppet disabled on mc1049:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=memcached&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [13:12:42] (03CR) 10Elukey: "Hello folks! I thought that this was was the easiest to avoid a production-images version bump + another version bump in deployment-charts" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040153 (https://phabricator.wikimedia.org/T356252) (owner: 10Elukey) [13:15:19] (03PS1) 10Cathal Mooney: Validate IRB interface names correspond to vlan and refactor [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1040154 (https://phabricator.wikimedia.org/T366348) [13:15:45] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:15:49] RESOLVED: [2x] PuppetDisabled: Puppet disabled on mc1049:9100 - https://wikitech.wikimedia.org/wiki/Puppet/Runbooks#Puppet_Disabled - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?var-cluster=memcached&viewPanel=14 - https://alerts.wikimedia.org/?q=alertname%3DPuppetDisabled [13:18:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:23:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1248 (T364069)', diff saved to https://phabricator.wikimedia.org/P64261 and previous config saved to /var/cache/conftool/dbconfig/20240607-132319-marostegui.json [13:23:21] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1249.eqiad.wmnet with reason: Maintenance [13:23:23] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [13:23:34] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1249.eqiad.wmnet with reason: Maintenance [13:23:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1249 (T364069)', diff saved to https://phabricator.wikimedia.org/P64262 and previous config saved to /var/cache/conftool/dbconfig/20240607-132342-marostegui.json [13:27:37] (03PS1) 10Vgutierrez: hiera: Add dummy IPIP config for cloud environments [puppet] - 10https://gerrit.wikimedia.org/r/1040158 (https://phabricator.wikimedia.org/T366466) [13:27:59] (03PS3) 10Ssingh: geo-maps: define initial mapping for South America (magru) [dns] - 10https://gerrit.wikimedia.org/r/1025366 (https://phabricator.wikimedia.org/T346722) [13:28:46] (03PS1) 10Bking: Revert "dse-k8s: replace 'airflow-analytics-test' ns with 'airflow'" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040077 [13:28:57] (03CR) 10Bking: [V:03+2 C:03+2] Revert "dse-k8s: replace 'airflow-analytics-test' ns with 'airflow'" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040077 (owner: 10Bking) [13:29:27] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1040158 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [13:30:42] (03PS4) 10Ssingh: geo-maps: define initial mapping for South America (magru) [dns] - 10https://gerrit.wikimedia.org/r/1025366 (https://phabricator.wikimedia.org/T346722) [13:33:26] (03CR) 10Alexandros Kosiaris: mw-mcrouter: Switch helmfile.d to use the newer cache module (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038860 (owner: 10Alexandros Kosiaris) [13:33:40] (03PS1) 10Elukey: helmfile.d: update oauth2-proxy in aux's Jaeger config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040159 (https://phabricator.wikimedia.org/T356252) [13:34:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [13:35:50] (03CR) 10Alexandros Kosiaris: sextant cache: Allow defining mcrouter's clusterIP (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038858 (owner: 10Alexandros Kosiaris) [13:39:40] (03PS72) 10Arnaudb: mariadb: add some logic to allow instance conversion [software/spicerack] - 10https://gerrit.wikimedia.org/r/1005531 (https://phabricator.wikimedia.org/T343674) [13:39:48] (03PS2) 10Alexandros Kosiaris: sextant cache: Add new service major version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038857 [13:39:49] (03PS2) 10Alexandros Kosiaris: sextant cache: Allow defining mcrouter's clusterIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038858 [13:39:49] (03PS3) 10Alexandros Kosiaris: mcrouter: Bump chart modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038859 [13:39:49] (03PS3) 10Alexandros Kosiaris: mw-mcrouter: Switch helmfile.d to use the newer cache module [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038860 [13:43:21] (03CR) 10Vgutierrez: [C:03+2] hiera: Add dummy IPIP config for cloud environments [puppet] - 10https://gerrit.wikimedia.org/r/1040158 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [13:44:34] (03CR) 10Alexandros Kosiaris: [C:03+2] sextant cache: Add new service major version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038857 (owner: 10Alexandros Kosiaris) [13:45:24] (03Merged) 10jenkins-bot: sextant cache: Add new service major version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038857 (owner: 10Alexandros Kosiaris) [13:45:24] (03PS73) 10Arnaudb: mariadb: add some logic to allow instance conversion [software/spicerack] - 10https://gerrit.wikimedia.org/r/1005531 (https://phabricator.wikimedia.org/T343674) [13:45:45] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:46:04] (03CR) 10Alexandros Kosiaris: [C:03+2] sextant cache: Allow defining mcrouter's clusterIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038858 (owner: 10Alexandros Kosiaris) [13:46:53] (03Merged) 10jenkins-bot: sextant cache: Allow defining mcrouter's clusterIP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038858 (owner: 10Alexandros Kosiaris) [13:48:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:51:04] (03CR) 10CI reject: [V:04-1] mariadb: add some logic to allow instance conversion [software/spicerack] - 10https://gerrit.wikimedia.org/r/1005531 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [13:51:07] (03PS1) 10Eevans: Upgrade data-gateway (staging) to v1.0.6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040168 [13:53:03] (03PS74) 10Arnaudb: mariadb: add some logic to allow instance conversion [software/spicerack] - 10https://gerrit.wikimedia.org/r/1005531 (https://phabricator.wikimedia.org/T343674) [13:54:27] (03PS1) 10Filippo Giunchedi: k8s: send logs to per-cluster kafka topics [puppet] - 10https://gerrit.wikimedia.org/r/1040170 (https://phabricator.wikimedia.org/T366710) [13:54:49] (03CR) 10CI reject: [V:04-1] k8s: send logs to per-cluster kafka topics [puppet] - 10https://gerrit.wikimedia.org/r/1040170 (https://phabricator.wikimedia.org/T366710) (owner: 10Filippo Giunchedi) [13:57:13] (03PS75) 10Arnaudb: mariadb: add some logic to allow instance conversion [software/spicerack] - 10https://gerrit.wikimedia.org/r/1005531 (https://phabricator.wikimedia.org/T343674) [13:58:45] (03PS2) 10Filippo Giunchedi: k8s: send logs to per-cluster kafka topics [puppet] - 10https://gerrit.wikimedia.org/r/1040170 (https://phabricator.wikimedia.org/T366710) [14:00:06] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure: Test Puppet 8 readiness - https://phabricator.wikimedia.org/T366900#9871062 (10jhathaway) Thanks for opening this @MoritzMuehlenhoff. I expect one of the larger tasks to prepare for puppet 8 will be making our code compatible with the [strict sett... [14:00:11] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2797/co" [puppet] - 10https://gerrit.wikimedia.org/r/1040170 (https://phabricator.wikimedia.org/T366710) (owner: 10Filippo Giunchedi) [14:01:04] (03CR) 10JHathaway: [C:03+1] Remove profile::base::use_linux510_on_buster [puppet] - 10https://gerrit.wikimedia.org/r/1040109 (owner: 10Muehlenhoff) [14:01:46] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2798/co" [puppet] - 10https://gerrit.wikimedia.org/r/1040170 (https://phabricator.wikimedia.org/T366710) (owner: 10Filippo Giunchedi) [14:02:35] !log restart swift-proxy on ms-fe1009 ms-fe1011 ms-fe1012 ms-fe1014 T360913 [14:02:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:41] T360913: Swift proxy server misbehaviour (no longer calling `accept`?) - https://phabricator.wikimedia.org/T360913 [14:05:53] (03CR) 10Alexandros Kosiaris: [C:03+1] Add new chart statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039171 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto) [14:05:54] (03CR) 10Arnaudb: "Hope this one is a bit closer than the previous, I've removed the faulty testing as we agreed upon a later implementation and I don't want" [software/spicerack] - 10https://gerrit.wikimedia.org/r/1005531 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [14:11:14] (03CR) 10BBlack: [C:03+1] geo-maps: define initial mapping for South America (magru) [dns] - 10https://gerrit.wikimedia.org/r/1025366 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [14:11:52] (03CR) 10Ssingh: "Do not merge before Tuesday." [dns] - 10https://gerrit.wikimedia.org/r/1025366 (https://phabricator.wikimedia.org/T346722) (owner: 10Ssingh) [14:13:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T352010)', diff saved to https://phabricator.wikimedia.org/P64263 and previous config saved to /var/cache/conftool/dbconfig/20240607-141349-ladsgroup.json [14:13:54] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [14:15:45] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:16:12] (03PS1) 10Elukey: docker::reporter: update k8s_rules.ini [puppet] - 10https://gerrit.wikimedia.org/r/1040179 (https://phabricator.wikimedia.org/T356252) [14:18:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:18:47] (03PS4) 10Cathal Mooney: Change EVPN BGP YAML to group into clusters and add codfw switches [homer/public] - 10https://gerrit.wikimedia.org/r/1034889 (https://phabricator.wikimedia.org/T365169) [14:19:16] (03CR) 10CI reject: [V:04-1] Change EVPN BGP YAML to group into clusters and add codfw switches [homer/public] - 10https://gerrit.wikimedia.org/r/1034889 (https://phabricator.wikimedia.org/T365169) (owner: 10Cathal Mooney) [14:24:15] (03PS5) 10Cathal Mooney: Change EVPN BGP YAML to group into clusters and add codfw switches [homer/public] - 10https://gerrit.wikimedia.org/r/1034889 (https://phabricator.wikimedia.org/T365169) [14:25:18] (03CR) 10CI reject: [V:04-1] Change EVPN BGP YAML to group into clusters and add codfw switches [homer/public] - 10https://gerrit.wikimedia.org/r/1034889 (https://phabricator.wikimedia.org/T365169) (owner: 10Cathal Mooney) [14:28:04] (03PS6) 10Cathal Mooney: Change EVPN BGP YAML to group into clusters and add codfw switches [homer/public] - 10https://gerrit.wikimedia.org/r/1034889 (https://phabricator.wikimedia.org/T365169) [14:28:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P64264 and previous config saved to /var/cache/conftool/dbconfig/20240607-142856-ladsgroup.json [14:29:20] (03PS2) 10Elukey: services: update the rec-api's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018717 (https://phabricator.wikimedia.org/T205870) [14:29:21] (03CR) 10JHathaway: [C:03+2] mw-debug: change mail_host [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039737 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [14:29:35] (03CR) 10CI reject: [V:04-1] Change EVPN BGP YAML to group into clusters and add codfw switches [homer/public] - 10https://gerrit.wikimedia.org/r/1034889 (https://phabricator.wikimedia.org/T365169) (owner: 10Cathal Mooney) [14:29:44] (03CR) 10Clément Goubert: [C:03+1] admin_ng: update Bookworm-based Docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040153 (https://phabricator.wikimedia.org/T356252) (owner: 10Elukey) [14:29:48] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1039974 (https://phabricator.wikimedia.org/T353912) (owner: 10Filippo Giunchedi) [14:30:36] (03PS1) 10Eevans: cassandra-dev: upgrade to Java 11 [puppet] - 10https://gerrit.wikimedia.org/r/1040188 (https://phabricator.wikimedia.org/T350567) [14:31:03] (03PS7) 10Cathal Mooney: Change EVPN BGP YAML to group into clusters and add codfw switches [homer/public] - 10https://gerrit.wikimedia.org/r/1034889 (https://phabricator.wikimedia.org/T365169) [14:31:16] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1040188 (https://phabricator.wikimedia.org/T350567) (owner: 10Eevans) [14:31:27] urandom: wow java11, finally :D [14:31:31] (03CR) 10CI reject: [V:04-1] Change EVPN BGP YAML to group into clusters and add codfw switches [homer/public] - 10https://gerrit.wikimedia.org/r/1034889 (https://phabricator.wikimedia.org/T365169) (owner: 10Cathal Mooney) [14:33:24] (03PS8) 10Cathal Mooney: Change EVPN BGP YAML to group into clusters and add codfw switches [homer/public] - 10https://gerrit.wikimedia.org/r/1034889 (https://phabricator.wikimedia.org/T365169) [14:34:32] (03CR) 10Cathal Mooney: [C:03+2] Change EVPN BGP YAML to group into clusters and add codfw switches [homer/public] - 10https://gerrit.wikimedia.org/r/1034889 (https://phabricator.wikimedia.org/T365169) (owner: 10Cathal Mooney) [14:35:22] (03Merged) 10jenkins-bot: Change EVPN BGP YAML to group into clusters and add codfw switches [homer/public] - 10https://gerrit.wikimedia.org/r/1034889 (https://phabricator.wikimedia.org/T365169) (owner: 10Cathal Mooney) [14:36:14] (03CR) 10DCausse: [C:03+2] Search update pipeline: enable saneitizer explicitly for eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039736 (owner: 10Peter Fischer) [14:36:52] (03Abandoned) 10DCausse: cirrus-streaming-updater: disable sanitizer on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040151 (owner: 10DCausse) [14:37:13] (03Merged) 10jenkins-bot: Search update pipeline: enable saneitizer explicitly for eqiad and codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039736 (owner: 10Peter Fischer) [14:37:40] !log jhathaway@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [14:37:59] !log jhathaway@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [14:38:00] !log jhathaway@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [14:38:10] !log jhathaway@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [14:38:44] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:10] !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [14:39:45] !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [14:40:41] (03PS2) 10Eevans: cassandra-dev: upgrade to Java 11 [puppet] - 10https://gerrit.wikimedia.org/r/1040188 (https://phabricator.wikimedia.org/T350567) [14:41:22] (03CR) 10Scott French: [C:03+1] Upgrade data-gateway (staging) to v1.0.6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040168 (owner: 10Eevans) [14:42:37] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1040188 (https://phabricator.wikimedia.org/T350567) (owner: 10Eevans) [14:43:18] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [14:44:05] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209', diff saved to https://phabricator.wikimedia.org/P64265 and previous config saved to /var/cache/conftool/dbconfig/20240607-144404-ladsgroup.json [14:45:24] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add new entries for cr2-codfw peering to ssw1-d8-codfw - cmooney@cumin1002" [14:45:33] (03PS1) 10KartikMistry: Update Apertium to 2024-06-07-143238-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040195 (https://phabricator.wikimedia.org/T356252) [14:45:45] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:46:16] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add new entries for cr2-codfw peering to ssw1-d8-codfw - cmooney@cumin1002" [14:46:16] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:47:13] (03CR) 10Scott French: [C:03+1] "For the securityContext changes, as long as regular deployments are relatively frequent (e.g., likely to happen in the next few weeks), I " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039727 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [14:47:46] (03CR) 10Elukey: [C:03+1] "LGTM thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040195 (https://phabricator.wikimedia.org/T356252) (owner: 10KartikMistry) [14:48:03] (03CR) 10Eevans: [C:03+2] cassandra-dev: upgrade to Java 11 [puppet] - 10https://gerrit.wikimedia.org/r/1040188 (https://phabricator.wikimedia.org/T350567) (owner: 10Eevans) [14:48:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:49:15] (03CR) 10Bandrqaid654: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1039697 (owner: 10Ayounsi) [14:53:16] (03PS1) 10Ilias Sarantopoulos: ml-services: use envoy proxy in ores-legacy staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040196 (https://phabricator.wikimedia.org/T366801) [14:53:53] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:cassandra-dev: Apply update to Java 11 - eevans@cumin1002 [14:54:38] (03PS2) 10Ilias Sarantopoulos: ml-services: use envoy proxy in ores-legacy staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040196 (https://phabricator.wikimedia.org/T366801) [14:55:05] !log enabling port et-1/0/2 for 100G mode on cr2-codfw T364095 [14:55:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:08] T364095: Codfw row C/D switch installation & configuration - https://phabricator.wikimedia.org/T364095 [14:55:45] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:55:54] 10ops-codfw, 06SRE, 06DC-Ops, 10Observability-Metrics: Memory upgrade request for prometheus200[56] - https://phabricator.wikimedia.org/T360895#9871175 (10Jhancock.wm) I don't have any new on hands. But I can pull some of the extra dimm from the decommissioned servers. I'll leave enough to make sure the se... [14:56:10] (03CR) 10Elukey: [C:03+1] ml-services: use envoy proxy in ores-legacy staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040196 (https://phabricator.wikimedia.org/T366801) (owner: 10Ilias Sarantopoulos) [14:59:05] please hold off on any netbox changes. cc topranks [14:59:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1209 (T352010)', diff saved to https://phabricator.wikimedia.org/P64266 and previous config saved to /var/cache/conftool/dbconfig/20240607-145913-ladsgroup.json [14:59:15] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance [14:59:20] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [14:59:29] sukhe: ty [14:59:29] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance [14:59:30] FIRING: [2x] ProbeDown: Service wdqs1019:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1019:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:59:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1211 (T352010)', diff saved to https://phabricator.wikimedia.org/P64267 and previous config saved to /var/cache/conftool/dbconfig/20240607-145937-ladsgroup.json [15:01:08] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on netbox1002.eqiad.wmnet with reason: Restoring DB from backup on netboxdb1002 [15:01:22] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on netbox1002.eqiad.wmnet with reason: Restoring DB from backup on netboxdb1002 [15:04:30] FIRING: [3x] ProbeDown: Service wdqs1018:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:09:30] RESOLVED: [3x] ProbeDown: Service wdqs1018:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip6) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:10:26] !log disabling netbox service on primary netbox server netbox1001 to restore db from backup [15:10:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:23] (03CR) 10Ladsgroup: [C:03+1] "I'll deploy it on Monday" [dns] - 10https://gerrit.wikimedia.org/r/1040144 (https://phabricator.wikimedia.org/T366649) (owner: 10Zabe) [15:14:08] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:cassandra-dev: Apply update to Java 11 - eevans@cumin1002 [15:15:54] 06SRE, 10Cassandra, 06Data-Persistence, 13Patch-For-Review: Migrate Cassandra to Java 11 - https://phabricator.wikimedia.org/T350567#9871246 (10Eevans) >>! In T350567#9868944, @Eevans wrote: > cassandra-dev2001-{a,b} have been upgraded to Java 11 (canaries). And now all of cassandra-dev. [15:17:32] (03CR) 10JMeybohm: [C:03+2] "What Scott said. It would be nice to have it deployed within the next week as it's a pre-depencency for the preparations for the next k8s " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039727 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [15:18:26] (03CR) 10Clément Goubert: [C:03+2] miscweb: Update various modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032525 (https://phabricator.wikimedia.org/T362978) (owner: 10Clément Goubert) [15:18:42] (03PS1) 10Andrew Bogott: openstack clouds.yaml: specify 'interface: admin' when needed [puppet] - 10https://gerrit.wikimedia.org/r/1040204 [15:19:33] (03Merged) 10jenkins-bot: miscweb: Update various modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032525 (https://phabricator.wikimedia.org/T362978) (owner: 10Clément Goubert) [15:19:42] (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1040204 (owner: 10Andrew Bogott) [15:20:52] (03CR) 10Clément Goubert: [C:03+2] miscweb: Use a random miscweb image for default value [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038251 (https://phabricator.wikimedia.org/T362518) (owner: 10Clément Goubert) [15:21:06] 06SRE, 10Wikimedia-Mailing-lists: Fix lists.wmcloud.org - https://phabricator.wikimedia.org/T290110#9871260 (10Ladsgroup) 05Open→03Declined It was a test setup we built. Any further work should basically rebuild it from scratch. So let's decline this. [15:22:00] (03Merged) 10jenkins-bot: miscweb: Use a random miscweb image for default value [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038251 (https://phabricator.wikimedia.org/T362518) (owner: 10Clément Goubert) [15:22:14] (03CR) 10Andrew Bogott: [C:03+2] openstack clouds.yaml: specify 'interface: admin' when needed [puppet] - 10https://gerrit.wikimedia.org/r/1040204 (owner: 10Andrew Bogott) [15:23:02] !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [15:24:05] !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [15:24:44] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [15:24:45] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [15:24:58] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [15:25:26] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [15:29:52] (03CR) 10JMeybohm: [C:03+1] "Sweet, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040153 (https://phabricator.wikimedia.org/T356252) (owner: 10Elukey) [15:30:24] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [15:30:52] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [15:30:56] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission db2098.codfw.wmnet - https://phabricator.wikimedia.org/T366877#9871289 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm server and drives have been removed, but keeping them separate until recycling is picked up on 13th. [15:31:16] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [15:31:24] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission db2097.codfw.wmnet - https://phabricator.wikimedia.org/T362802#9871295 (10Jhancock.wm) [15:31:48] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission db2098.codfw.wmnet - https://phabricator.wikimedia.org/T366877#9871298 (10Jhancock.wm) [15:32:35] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission db2099.codfw.wmnet - https://phabricator.wikimedia.org/T362883#9871310 (10Jhancock.wm) [15:33:14] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission db2099.codfw.wmnet - https://phabricator.wikimedia.org/T362883#9871312 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm server and drives have been removed, but keeping them separate until recycling is picked up on 13th. [15:33:16] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware: decommission db2097.codfw.wmnet - https://phabricator.wikimedia.org/T362802#9871317 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm server and drives have been removed, but keeping them separate until recycling is picked up on 13th. [15:33:44] FIRING: [7x] SystemdUnitFailed: check_netbox_uncommitted_dns_changes.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:33:59] (03CR) 10JMeybohm: [C:03+1] docker::reporter: update k8s_rules.ini [puppet] - 10https://gerrit.wikimedia.org/r/1040179 (https://phabricator.wikimedia.org/T356252) (owner: 10Elukey) [15:34:29] (03Abandoned) 10JMeybohm: WIP: Allow for differnt staging values per DC [deployment-charts] - 10https://gerrit.wikimedia.org/r/885791 (owner: 10JMeybohm) [15:34:31] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:34:54] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [15:35:04] 10ops-codfw, 06SRE, 06DC-Ops, 10Observability-Metrics: Memory upgrade request for prometheus200[56] - https://phabricator.wikimedia.org/T360895#9871322 (10fgiunchedi) >>! In T360895#9871175, @Jhancock.wm wrote: > I don't have any new on hands. But I can pull some of the extra dimm from the decommissioned s... [15:35:06] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [15:36:03] topranks: netbox changes [15:36:07] -12 1H IN PTR et-1-0-2-103.cr2-codfw.wikimedia.org. [15:36:11] 13 1H IN PTR irb-103.ssw1-d8-codfw.codfw.wmnet. [15:36:15] diff --git a/6.3.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa b/6.3.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa [15:36:19] index e41bbe0..56b654d 100644 [15:36:23] --- a/6.3.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa [15:36:27] +++ b/6.3.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa [15:36:27] 06SRE, 10Wikimedia-Mailing-lists: Create a mailing list for Bangla Wikimoitree - https://phabricator.wikimedia.org/T365915#9871325 (10Ladsgroup) I asked affcom and CR and my understanding from their points is that the number of hubs should be low, around a dozen in total. This is something for global council t... [15:36:31] @@ -1,2 +1 @@ [15:36:35] -1.0.0.0.0.0.0.0.0.0.0.0.0.0.0.0 1H IN PTR et-1-0-2-103.cr2-codfw.wikimedia.org. [15:36:39] er wow sorry [15:37:47] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [15:43:22] (03PS2) 10Clément Goubert: Deprecate system::role for wikikube roles [puppet] - 10https://gerrit.wikimedia.org/r/1040124 (owner: 10Muehlenhoff) [15:44:35] (03PS3) 10Clément Goubert: Deprecate system::role for wikikube roles [puppet] - 10https://gerrit.wikimedia.org/r/1040124 (owner: 10Muehlenhoff) [15:44:48] sukhe: feel free to go ahead, I'll add them back later [15:45:12] ok, I will clear it up then. thanks [15:45:21] (just so to not block other DNS changes) [15:45:47] !log sukhe@cumin1002 START - Cookbook sre.dns.netbox [15:46:20] (03CR) 10Elukey: [C:03+2] docker::reporter: update k8s_rules.ini [puppet] - 10https://gerrit.wikimedia.org/r/1040179 (https://phabricator.wikimedia.org/T356252) (owner: 10Elukey) [15:52:38] !log sukhe@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: merging pending cr2-codfw changes - sukhe@cumin1002" [15:53:30] !log sukhe@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: merging pending cr2-codfw changes - sukhe@cumin1002" [15:53:30] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:54:17] (03PS1) 10Peter Fischer: Search update pipeline: enable rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040211 (https://phabricator.wikimedia.org/T362310) [15:55:38] (03CR) 10Scott French: [C:03+2] chromium-render: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037196 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [15:56:29] (03Merged) 10jenkins-bot: chromium-render: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037196 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [15:58:49] (03CR) 10Majavah: [C:04-2] "This Prometheus instance is used to monitor Toolforge infrastructure itself and for a few reasons[0] I don't want any non-infrastucture wo" [puppet] - 10https://gerrit.wikimedia.org/r/1039850 (https://phabricator.wikimedia.org/T363371) (owner: 10DErenrich) [15:58:58] (03CR) 10JMeybohm: [C:03+1] "The mcrouter exporter now gets resources set. They look reasonable, but I wanted to point it out." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037166 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [15:59:03] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/proton: apply [15:59:51] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/proton: apply [16:03:02] (03CR) 10JMeybohm: [C:04-1] "No diff in CI, let me check" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040211 (https://phabricator.wikimedia.org/T362310) (owner: 10Peter Fischer) [16:03:29] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [16:05:41] (03PS1) 10Majavah: P:toolforge::prometheus: Add clarification comment [puppet] - 10https://gerrit.wikimedia.org/r/1040213 [16:05:44] (03PS4) 10Hnowlan: Upgrade base OS to Debian bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1039778 (https://phabricator.wikimedia.org/T355020) [16:05:54] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/proton: apply [16:06:19] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for moved telxius transpoort eqiad drmrs - cmooney@cumin1002" [16:06:29] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/proton: apply [16:07:48] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for moved telxius transpoort eqiad drmrs - cmooney@cumin1002" [16:07:48] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:08:07] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/proton: apply [16:08:39] (03PS2) 10JMeybohm: Search update pipeline: enable rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040211 (https://phabricator.wikimedia.org/T362310) (owner: 10Peter Fischer) [16:09:10] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9600.service on elastic1086:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:09:15] (03CR) 10Majavah: [C:03+2] P:toolforge::prometheus: Add clarification comment [puppet] - 10https://gerrit.wikimedia.org/r/1040213 (owner: 10Majavah) [16:10:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T364299)', diff saved to https://phabricator.wikimedia.org/P64268 and previous config saved to /var/cache/conftool/dbconfig/20240607-161007-marostegui.json [16:10:13] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [16:12:04] (03CR) 10JMeybohm: [C:03+1] "Somebody did mess up when writing the docs, sorry. 😇" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040211 (https://phabricator.wikimedia.org/T362310) (owner: 10Peter Fischer) [16:12:31] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:16:20] ^^ this is me, ignore for now [16:17:31] RECOVERY - BFD status on cr1-eqiad is OK: UP: 21 AdminDown: 2 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [16:18:45] FIRING: Primary inbound port utilisation over 80% #page: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [16:19:05] ^ related? [16:19:15] topranks: I guess [16:19:31] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [16:19:51] bblack: sry for the noise, yep I was over-confident in thinking this would not cause any alerts [16:19:52] all ok [16:19:54] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/proton: apply [16:20:17] ok [16:20:17] ack [16:20:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T352010)', diff saved to https://phabricator.wikimedia.org/P64269 and previous config saved to /var/cache/conftool/dbconfig/20240607-162031-ladsgroup.json [16:20:35] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [16:20:56] FIRING: Primary outbound port utilisation over 80% #page: Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [16:21:09] guessing that's the other side of the same link [16:21:24] yeah it is [16:21:30] :) [16:22:54] hmm that is not ideal, but related I'm sure let me check [16:23:45] RESOLVED: Primary inbound port utilisation over 80% #page: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [16:25:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P64270 and previous config saved to /var/cache/conftool/dbconfig/20240607-162516-marostegui.json [16:25:56] RESOLVED: Primary outbound port utilisation over 80% #page: Device cr2-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [16:27:49] oh, maybe that was me? was running iperf at 10G in eqiad for a while (testing a fishy NIC) [16:28:54] sorry, I ran it twice back to back, so it was longer than I'd intended to '^^ [16:29:57] kamila_: hey, I was gonna say I'm struggling to explain what happened :) [16:30:18] what we do know is your NIC is working quite well and generating Gigabits per second of traffic [16:30:30] let's try not to do that again though :P [16:30:58] sorry '^^ [16:31:21] np... tbh between certain parts of our network (with newer/faster network gear) that would probably be fine [16:31:40] anything on the old virtual-chassis rows that is routing between rows (thus across our core routers), should be avoided [16:31:53] we can probably advise on src/dest hosts if you ever do need to do such a test [16:32:40] !log enabling new transport circuit from cr1-drmrs to cr2-eqiad T343385 [16:32:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:33:19] topranks: for now I'll just do a long-running low-bandwith thing now that I've checked that this works, and I'll try to ask next time '^^ thanks [16:34:04] kamila_: np, where are you testing from/to ? [16:35:27] PROBLEM - Check unit status of statograph_post on alert1001 is CRITICAL: CRITICAL: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:35:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P64271 and previous config saved to /var/cache/conftool/dbconfig/20240607-163539-ladsgroup.json [16:36:55] !log cdobbins@cumin1002 conftool action : set/pooled=no; selector: name=cp4048.ulsfo.wmnet [16:38:00] !log cdobbins@cumin1002 conftool action : set/pooled=yes; selector: name=4048.ulsfo.wmnet [16:39:33] topranks: the maybe fishy host is wikikube1001 (lost connection several times during a reimage and it just got a new NIC), I don't care where from (was using sretest1003) [16:39:44] *wikikube-ctrl1001 [16:40:22] ok give me a moment [16:40:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P64272 and previous config saved to /var/cache/conftool/dbconfig/20240607-164025-marostegui.json [16:41:54] kamila_: the ideal scenario is to keep the traffic within a single rack, but I'm not sure how realistic that is here none of the hosts there look like they are good for testing [16:41:54] https://netbox.wikimedia.org/dcim/racks/41/ [16:42:18] (03PS1) 10Giuseppe Lavagetto: Allow running CI in a container when using rootless podman [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040218 [16:42:49] that said, when you say it "lost connection" do you know what happened? [16:43:08] a throughput test may not be the best way to test the NIC for that kind of thing [16:43:30] topranks: I do not know what happened, so I'm not really sure what to test [16:43:45] observed behaviour was intermittent failures to pxe boot and dropped ssh connections [16:44:12] but I haven't seen anything weird since the install [16:44:30] The pxe boot thing could well be up the stack somewhere [16:44:34] SSH connections hard to say [16:44:40] I'd probably just test it with ping [16:44:44] yeah, and ssh connections could be anywhere too [16:44:52] ping I've already tried and it looked good [16:45:05] i.e. ping it constantly from sretest1003 (or bast1003 or cumin1002 maybe) [16:45:19] see if it has dropped any pings by the time you return Monday [16:45:24] I ran ping for about an hour and lost a single-digit number of packets [16:45:27] RECOVERY - Check unit status of statograph_post on alert1001 is OK: OK: Status of the systemd unit statograph_post https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [16:45:28] I'll do a longer one, ok [16:45:38] the other thing on the host is to issue "sudo ethtool -S enp59s0f1np1" [16:45:45] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:45:47] and check for any errors (CRC or otherwise) [16:46:41] ok, I'll do that, thank you! [16:47:50] it's showing zero so that's a good sign [16:48:35] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting permissions for analytics-privatedata-users (with kerberos) for Mareike Heuer - https://phabricator.wikimedia.org/T364715#9871559 (10Dzahn) @lbowmaker Just making sure - this ticket needs an action from Data-Engineering to mov... [16:48:36] yeah, I really haven't seen anything weird since the reimage but I spent a day trying to actually finish reimaging it, so I'm hesitant to call it fine [16:48:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:50:27] I think it's fairly unlikely there is any network problem, given the amount of traffic through it in the iperf test and that it's showing no errors for any of those packets [16:50:39] yeah, true [16:50:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P64273 and previous config saved to /var/cache/conftool/dbconfig/20240607-165047-ladsgroup.json [16:50:48] that's why I wanted to do a high-bandwith one [16:51:02] (but I'll be a bit more patient next time :D) [16:54:07] (03PS1) 10CDobbins: varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040219 (https://phabricator.wikimedia.org/T354718) [16:54:30] (03CR) 10CI reject: [V:04-1] varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040219 (https://phabricator.wikimedia.org/T354718) (owner: 10CDobbins) [16:55:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T364299)', diff saved to https://phabricator.wikimedia.org/P64274 and previous config saved to /var/cache/conftool/dbconfig/20240607-165533-marostegui.json [16:55:37] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1221.eqiad.wmnet with reason: Maintenance [16:55:38] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [16:55:50] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1221.eqiad.wmnet with reason: Maintenance [16:55:52] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [16:56:08] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [16:56:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1221 (T364299)', diff saved to https://phabricator.wikimedia.org/P64275 and previous config saved to /var/cache/conftool/dbconfig/20240607-165616-marostegui.json [16:59:50] kamila_: I get the logic, but in general reach out to us if you want to do any throughput testing across the network as depending on where the nodes are they may be in places (like here) where we don't have a huge amount of ample bandwidth between the two endpoints [17:00:12] particularly here the link from asw2-a-eqiad to the core routers is a bunch of 10G links in a bundle [17:00:23] there are multiples, so usually traffic is load-balanced across them [17:00:31] yeah, I'm sorry [17:00:46] but this is done on a SRC+DST IP basis, so a steady flow of 10G between two endpoints will end up on one of the links in a bundle [17:00:50] ah it's fine - no harm done [17:02:52] makes sense that the switch would put the flow on one port, I should have thought of that [17:03:42] another good place to look are the switch stats in LibreNMS, to see if there are any errors / problems on that side of the link [17:03:44] https://librenms.wikimedia.org/device/device=149/tab=port/port=30404/ [17:03:49] ^^ looks fairly clear in this case [17:04:29] ooh, that was what I wanted all along! [17:04:47] okay, I'll try to not feel like i have to do everything on my own and ask next time '^^ [17:05:27] (03PS1) 10Scott French: admin_ng: bump CPU resourcequota for proton [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040220 (https://phabricator.wikimedia.org/T362978) [17:05:29] (03PS1) 10Scott French: proton: drop replicas from 12 to 10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040221 (https://phabricator.wikimedia.org/T362978) [17:05:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T352010)', diff saved to https://phabricator.wikimedia.org/P64276 and previous config saved to /var/cache/conftool/dbconfig/20240607-170555-ladsgroup.json [17:05:59] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance [17:05:59] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [17:06:12] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance [17:06:14] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [17:06:27] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [17:06:35] kamila_: ha no probs - it's a very good instinct to have! [17:06:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2164 (T352010)', diff saved to https://phabricator.wikimedia.org/P64277 and previous config saved to /var/cache/conftool/dbconfig/20240607-170634-ladsgroup.json [17:06:44] but we are always around to help so feel free to reach out :) [17:06:58] thanks :-) [17:11:17] (03PS5) 10Majavah: wikitech: Stop loading OpenStackManager [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038750 (https://phabricator.wikimedia.org/T161553) [17:11:17] (03PS1) 10Majavah: Reapply "wikitech: Replace OSM class in Gerrit blocking hook" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1040222 [17:12:15] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on cr2-codfw,cr2-codfw IPv6,re0.cr2-codfw.mgmt with reason: bouncing fpc 1 pic 0 on cr2-codfw [17:12:31] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on cr2-codfw,cr2-codfw IPv6,re0.cr2-codfw.mgmt with reason: bouncing fpc 1 pic 0 on cr2-codfw [17:12:36] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Codfw row C/D switch installation & configuration - https://phabricator.wikimedia.org/T364095#9871635 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=4f6d5735-139a-4c58-be67-6179e7c2ab71) set by cmooney@cumin1002 fo... [17:13:00] (03PS2) 10Scott French: proton: drop replicas from 12 to 10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040221 (https://phabricator.wikimedia.org/T362978) [17:13:00] (03PS2) 10Scott French: admin_ng: bump CPU resourcequota for proton [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040220 (https://phabricator.wikimedia.org/T362978) [17:13:04] (03CR) 10Majavah: "This works much better than the original patch:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1040222 (owner: 10Majavah) [17:15:30] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Codfw row C/D switch installation & configuration - https://phabricator.wikimedia.org/T364095#9871642 (10cmooney) @Jhancock.wm can you make sure to set the links to status "connected" in Netbox as they are added? And add the circuit I... [17:17:49] (03CR) 10Scott French: "Opted for a "why not both" approach to dealing with the quota issue: lock in at 10 replicas (what we've been using for months) and bump th" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040221 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [17:22:33] (03CR) 10Andrea Denisse: [C:03+1] "I just added a couple of comments regarding style issues, other than that this patch LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1040170 (https://phabricator.wikimedia.org/T366710) (owner: 10Filippo Giunchedi) [17:23:54] !log disable IP transit to Lumen AS3356 from cr2-eqiad to allow line card reset T364095 [17:23:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:59] T364095: Codfw row C/D switch installation & configuration - https://phabricator.wikimedia.org/T364095 [17:24:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T352010)', diff saved to https://phabricator.wikimedia.org/P64278 and previous config saved to /var/cache/conftool/dbconfig/20240607-172432-ladsgroup.json [17:24:54] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [17:24:54] !log re-route traffic from cr2-eqord away from circuit to cr2-codfw to allow for line card reset T364095 [17:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:30] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:20:00 on cloudsw1-b1-codfw.mgmt,cr2-eqord,pfw3-codfw with reason: bouncing fpc 1 pic 0 on cr2-codfw [17:28:47] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:20:00 on cloudsw1-b1-codfw.mgmt,cr2-eqord,pfw3-codfw with reason: bouncing fpc 1 pic 0 on cr2-codfw [17:28:54] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Codfw row C/D switch installation & configuration - https://phabricator.wikimedia.org/T364095#9871683 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=ef0faea2-9357-469a-a5c3-aa5b1c50c748) set by cmooney@cumin1002 fo... [17:31:33] !log resetting line card 1/0 on cr2-codfw to enable new 100G link to ssw1-d8-codfw T364095 [17:31:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:36] T364095: Codfw row C/D switch installation & configuration - https://phabricator.wikimedia.org/T364095 [17:34:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [17:39:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P64279 and previous config saved to /var/cache/conftool/dbconfig/20240607-173942-ladsgroup.json [17:40:09] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: codfw row C/D upgrade racking task - https://phabricator.wikimedia.org/T360789#9871721 (10cmooney) I tried to bring the //ssw1-d8-codfw// link to //cr2-codfw// up, but it doesn't look ready? Ticked above, but checking on cr2-codfw the... [17:44:01] 06SRE, 10Wikimedia-Mailing-lists: Create a mailing list for Bangla Wikimoitree - https://phabricator.wikimedia.org/T365915#9871726 (10Tarunno) That means this process will get hung up indefinitely since historically it is proven that policy making takes years to come up to any usable form. However, we cannot c... [17:45:45] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:48:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [17:54:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145', diff saved to https://phabricator.wikimedia.org/P64280 and previous config saved to /var/cache/conftool/dbconfig/20240607-175450-ladsgroup.json [17:56:12] (03CR) 10Ebernhardson: [C:03+2] cirrus: Remove cirrus_index.py script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037616 (owner: 10Ebernhardson) [17:56:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T364299)', diff saved to https://phabricator.wikimedia.org/P64281 and previous config saved to /var/cache/conftool/dbconfig/20240607-175643-marostegui.json [17:56:46] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [17:57:04] (03Merged) 10jenkins-bot: cirrus: Remove cirrus_index.py script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037616 (owner: 10Ebernhardson) [17:57:25] (03Abandoned) 10Ebernhardson: cirrus: Report container log output on backfilling failure [deployment-charts] - 10https://gerrit.wikimedia.org/r/1016859 (owner: 10Ebernhardson) [17:57:51] (03Abandoned) 10Ebernhardson: cirrus: Add ability to backfill all wikis [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006567 (owner: 10Ebernhardson) [18:01:51] (03CR) 10Dzahn: [WIP] mediamoderation: Add one-off job for processing the Commons backlog (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1040150 (owner: 10Kosta Harlan) [18:09:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2145 (T352010)', diff saved to https://phabricator.wikimedia.org/P64282 and previous config saved to /var/cache/conftool/dbconfig/20240607-180958-ladsgroup.json [18:10:01] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2146.codfw.wmnet with reason: Maintenance [18:10:03] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [18:10:14] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2146.codfw.wmnet with reason: Maintenance [18:10:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2146 (T352010)', diff saved to https://phabricator.wikimedia.org/P64283 and previous config saved to /var/cache/conftool/dbconfig/20240607-181021-ladsgroup.json [18:11:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P64284 and previous config saved to /var/cache/conftool/dbconfig/20240607-181151-marostegui.json [18:13:31] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [18:15:45] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:18:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:21:28] (03PS2) 10CDobbins: varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040219 (https://phabricator.wikimedia.org/T354718) [18:21:50] (03CR) 10CI reject: [V:04-1] varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040219 (https://phabricator.wikimedia.org/T354718) (owner: 10CDobbins) [18:22:36] (03PS1) 10Dduvall: mediawiki.diff: Fix color regression and also use one more token [core] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1040081 (https://phabricator.wikimedia.org/T366845) [18:25:29] 06SRE, 10Wikimedia-Mailing-lists: Create a mailing list for Bangla Wikimoitree - https://phabricator.wikimedia.org/T365915#9871795 (10Bodhisattwa) >>! In T365915#9871325, @Ladsgroup wrote: > I asked affcom and CR and my understanding from their points is that the number of hubs should be low, around a in tota... [18:27:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221', diff saved to https://phabricator.wikimedia.org/P64285 and previous config saved to /var/cache/conftool/dbconfig/20240607-182700-marostegui.json [18:38:21] (03PS1) 10Majavah: P:exim::smarthost: log outbound mail with unsupported domains [puppet] - 10https://gerrit.wikimedia.org/r/1040238 (https://phabricator.wikimedia.org/T366935) [18:38:34] (03PS2) 10Majavah: P:exim::smarthost: log outbound mail with unsupported domains [puppet] - 10https://gerrit.wikimedia.org/r/1040238 (https://phabricator.wikimedia.org/T366935) [18:42:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1221 (T364299)', diff saved to https://phabricator.wikimedia.org/P64286 and previous config saved to /var/cache/conftool/dbconfig/20240607-184208-marostegui.json [18:42:11] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1238.eqiad.wmnet with reason: Maintenance [18:42:12] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [18:42:23] (03CR) 10Majavah: [C:03+2] P:exim::smarthost: log outbound mail with unsupported domains [puppet] - 10https://gerrit.wikimedia.org/r/1040238 (https://phabricator.wikimedia.org/T366935) (owner: 10Majavah) [18:42:24] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1238.eqiad.wmnet with reason: Maintenance [18:42:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1238 (T364299)', diff saved to https://phabricator.wikimedia.org/P64287 and previous config saved to /var/cache/conftool/dbconfig/20240607-184232-marostegui.json [18:44:48] (03PS1) 10Majavah: P:exim::smarthost: fix dkim domain listing [puppet] - 10https://gerrit.wikimedia.org/r/1040240 [18:45:12] (03CR) 10CI reject: [V:04-1] P:exim::smarthost: fix dkim domain listing [puppet] - 10https://gerrit.wikimedia.org/r/1040240 (owner: 10Majavah) [18:45:12] (03PS3) 10CDobbins: varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040219 (https://phabricator.wikimedia.org/T354718) [18:45:35] (03CR) 10CI reject: [V:04-1] varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040219 (https://phabricator.wikimedia.org/T354718) (owner: 10CDobbins) [18:46:52] (03PS2) 10Majavah: P:exim::smarthost: fix dkim domain listing [puppet] - 10https://gerrit.wikimedia.org/r/1040240 [18:47:13] (03CR) 10CI reject: [V:04-1] P:exim::smarthost: fix dkim domain listing [puppet] - 10https://gerrit.wikimedia.org/r/1040240 (owner: 10Majavah) [18:47:34] (03PS3) 10Majavah: P:exim::smarthost: fix dkim domain listing [puppet] - 10https://gerrit.wikimedia.org/r/1040240 [18:48:27] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2804/co" [puppet] - 10https://gerrit.wikimedia.org/r/1040240 (owner: 10Majavah) [18:51:11] (03CR) 10Majavah: [V:03+1 C:03+2] P:exim::smarthost: fix dkim domain listing [puppet] - 10https://gerrit.wikimedia.org/r/1040240 (owner: 10Majavah) [18:51:55] (03PS4) 10CDobbins: varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040219 (https://phabricator.wikimedia.org/T354718) [18:52:18] (03CR) 10CI reject: [V:04-1] varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040219 (https://phabricator.wikimedia.org/T354718) (owner: 10CDobbins) [18:55:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by dduvall@deploy1002 using scap backport" [core] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1040081 (https://phabricator.wikimedia.org/T366845) (owner: 10Dduvall) [18:57:44] (03PS5) 10CDobbins: varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040219 (https://phabricator.wikimedia.org/T354718) [18:57:52] (03PS1) 10Majavah: P:exim::smarthost: set log_message instead of message [puppet] - 10https://gerrit.wikimedia.org/r/1040241 [19:01:22] (03CR) 10Majavah: [C:03+2] P:exim::smarthost: set log_message instead of message [puppet] - 10https://gerrit.wikimedia.org/r/1040241 (owner: 10Majavah) [19:01:37] (03CR) 10CDanis: [C:03+2] helmfile.d: update oauth2-proxy in aux's Jaeger config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040159 (https://phabricator.wikimedia.org/T356252) (owner: 10Elukey) [19:02:33] (03PS2) 10Elukey: helmfile.d: update oauth2-proxy in aux's Jaeger config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040159 (https://phabricator.wikimedia.org/T356252) [19:03:16] (03CR) 10CDanis: [C:03+1] helmfile.d: update oauth2-proxy in aux's Jaeger config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040159 (https://phabricator.wikimedia.org/T356252) (owner: 10Elukey) [19:03:19] (03CR) 10CDanis: [C:03+2] helmfile.d: update oauth2-proxy in aux's Jaeger config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040159 (https://phabricator.wikimedia.org/T356252) (owner: 10Elukey) [19:03:27] (03PS1) 10Scott French: kubernetes: alert on persistent unavailable replicas [alerts] - 10https://gerrit.wikimedia.org/r/1040242 (https://phabricator.wikimedia.org/T366932) [19:03:28] (03CR) 10CDanis: [C:03+2] "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040159 (https://phabricator.wikimedia.org/T356252) (owner: 10Elukey) [19:05:38] (03Merged) 10jenkins-bot: helmfile.d: update oauth2-proxy in aux's Jaeger config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040159 (https://phabricator.wikimedia.org/T356252) (owner: 10Elukey) [19:05:50] (03PS1) 10Andrew Bogott: Add 'keystoneify' admin script [puppet] - 10https://gerrit.wikimedia.org/r/1040243 (https://phabricator.wikimedia.org/T358496) [19:06:39] !log cdanis@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [19:07:08] (03PS6) 10CDobbins: varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040219 (https://phabricator.wikimedia.org/T354718) [19:07:20] !log cdanis@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [19:09:11] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9871917 (10BCornwall) A problem that we may potentially run into is predictable device naming being different than in previous deployments. We have different... [19:12:25] (03PS7) 10CDobbins: varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040219 (https://phabricator.wikimedia.org/T354718) [19:15:45] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:18:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:21:00] (03CR) 10Eevans: [C:03+2] Upgrade data-gateway (staging) to v1.0.6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040168 (owner: 10Eevans) [19:24:03] (03Merged) 10jenkins-bot: Upgrade data-gateway (staging) to v1.0.6 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040168 (owner: 10Eevans) [19:25:06] !log eevans@deploy1002 helmfile [staging] START helmfile.d/services/data-gateway: apply [19:25:26] !log eevans@deploy1002 helmfile [staging] DONE helmfile.d/services/data-gateway: apply [19:25:47] (03Merged) 10jenkins-bot: mediawiki.diff: Fix color regression and also use one more token [core] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1040081 (https://phabricator.wikimedia.org/T366845) (owner: 10Dduvall) [19:26:03] !log dduvall@deploy1002 Started scap: Backport for [[gerrit:1040081|mediawiki.diff: Fix color regression and also use one more token (T366845)]] [19:26:06] T366845: Black dots display on moved lines in diffs - https://phabricator.wikimedia.org/T366845 [19:28:35] !log dduvall@deploy1002 dduvall: Backport for [[gerrit:1040081|mediawiki.diff: Fix color regression and also use one more token (T366845)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [19:33:21] !log dduvall@deploy1002 dduvall: Continuing with sync [19:40:04] (03PS8) 10CDobbins: varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040219 (https://phabricator.wikimedia.org/T354718) [19:42:13] !log dduvall@deploy1002 Finished scap: Backport for [[gerrit:1040081|mediawiki.diff: Fix color regression and also use one more token (T366845)]] (duration: 16m 10s) [19:42:18] T366845: Black dots display on moved lines in diffs - https://phabricator.wikimedia.org/T366845 [19:45:45] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:48:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:49:52] 06SRE, 06Infrastructure-Foundations, 10netops: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941 (10cmooney) 03NEW p:05Triage→03Medium [19:50:26] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Codfw row C/D switch installation & configuration - https://phabricator.wikimedia.org/T364095#9872013 (10cmooney) [19:50:26] 06SRE, 06Infrastructure-Foundations, 10netops: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941#9872014 (10cmooney) [19:52:32] 06SRE, 06Infrastructure-Foundations, 10netops: Switch BGP (EVPN) topology between rows/spines at core sites - https://phabricator.wikimedia.org/T365169#9872018 (10cmooney) 05Open→03Resolved Folk seem happy enough with this approach so I'll close this, automation has been added and working. [19:53:03] 06SRE, 06Infrastructure-Foundations, 10netops: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941#9872015 (10cmooney) [19:53:22] (03PS1) 10Func: CommonSettings: Restore the original behaviour of Reference Previews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039597 (https://phabricator.wikimedia.org/T366419) [19:59:11] 06SRE, 06Infrastructure-Foundations, 10netops: Move asw-c-codfw and asw-d-codfw CR uplinks to Spine switches - https://phabricator.wikimedia.org/T366941#9872044 (10cmooney) [20:00:48] (03PS9) 10CDobbins: varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040219 (https://phabricator.wikimedia.org/T354718) [20:08:58] (03CR) 10Scott French: "Wanted to get your thoughts on this after our discussion earlier today." [alerts] - 10https://gerrit.wikimedia.org/r/1040242 (https://phabricator.wikimedia.org/T366932) (owner: 10Scott French) [20:09:10] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9600.service on elastic1086:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:10:45] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting permissions for analytics-privatedata-users (with kerberos) for Mareike Heuer - https://phabricator.wikimedia.org/T364715#9872091 (10odimitrijevic) Approved [20:11:23] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users (no kerberos, no ssh) for HNordeen - https://phabricator.wikimedia.org/T364801#9872092 (10odimitrijevic) Approved [20:11:57] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users (no kerberos, no ssh) for HNordeen - https://phabricator.wikimedia.org/T364801#9872093 (10odimitrijevic) 05Stalled→03In progress [20:12:08] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users (no kerberos, no ssh) for HNordeen - https://phabricator.wikimedia.org/T364801#9872094 (10odimitrijevic) a:05Ahoelzl→03Dzahn [20:12:41] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Tchanders - https://phabricator.wikimedia.org/T366351#9872095 (10odimitrijevic) Approved [20:13:11] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Rae Adimer - https://phabricator.wikimedia.org/T365832#9872096 (10odimitrijevic) approved [20:13:43] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for rickijay - https://phabricator.wikimedia.org/T365574#9872097 (10odimitrijevic) Approved [20:14:07] (03PS1) 10CDobbins: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040253 [20:14:35] (03CR) 10BryanDavis: [V:03+1 C:03+1] Reapply "wikitech: Replace OSM class in Gerrit blocking hook" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1040222 (owner: 10Majavah) [20:14:41] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Ifrahkhanyaree_WMDE - https://phabricator.wikimedia.org/T366558#9872103 (10odimitrijevic) Approved [20:15:45] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:18:41] (03PS1) 10Dzahn: move linkrecommendation service IP in place, fix outdated comments [dns] - 10https://gerrit.wikimedia.org/r/1040260 [20:18:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:23:49] (03PS1) 10Dzahn: add LVS service IPs for gitlab and gitlab-ssh [dns] - 10https://gerrit.wikimedia.org/r/1040261 (https://phabricator.wikimedia.org/T366882) [20:25:03] (03PS1) 10CDobbins: varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040262 (https://phabricator.wikimedia.org/T354718) [20:25:19] PROBLEM - Disk space on thanos-be2003 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sdf1 215685 MB (5% inode=91%): /srv/swift-storage/sde1 194464 MB (5% inode=92%): /srv/swift-storage/sdh1 197761 MB (5% inode=92%): /srv/swift-storage/sdc1 207985 MB (5% inode=91%): /srv/swift-storage/sdd1 183630 MB (4% inode=92%): /srv/swift-storage/sdg1 199704 MB (5% inode=91%): /srv/swift-storage/sdi1 194246 MB (5% inode=92%): /srv/swift-s [20:25:19] j1 195246 MB (5% inode=91%): /srv/swift-storage/sdk1 149579 MB (3% inode=90%): /srv/swift-storage/sdl1 204640 MB (5% inode=92%): /srv/swift-storage/sdm1 193632 MB (5% inode=91%): /srv/swift-storage/sdn1 187205 MB (4% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2003&var-datasource=codfw+prometheus/ops [20:26:07] (03CR) 10Dzahn: "confusing because https://gerrit.wikimedia.org/r/c/operations/dns/+/656430 already added those the right way" [dns] - 10https://gerrit.wikimedia.org/r/1040260 (owner: 10Dzahn) [20:27:59] (03CR) 10Dzahn: [C:03+2] "Thanks! This has approval now from Olja." [puppet] - 10https://gerrit.wikimedia.org/r/1035545 (https://phabricator.wikimedia.org/T364715) (owner: 10Dzahn) [20:29:45] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting permissions for analytics-privatedata-users (with kerberos) for Mareike Heuer - https://phabricator.wikimedia.org/T364715#9872144 (10Dzahn) 05Stalled→03In progress [20:31:13] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting permissions for analytics-privatedata-users (with kerberos) for Mareike Heuer - https://phabricator.wikimedia.org/T364715#9872152 (10Dzahn) [20:32:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T352010)', diff saved to https://phabricator.wikimedia.org/P64288 and previous config saved to /var/cache/conftool/dbconfig/20240607-203253-ladsgroup.json [20:32:57] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [20:33:47] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting permissions for analytics-privatedata-users (with kerberos) for Mareike Heuer - https://phabricator.wikimedia.org/T364715#9872170 (10Dzahn) Kerberos principal created. [20:34:03] (03CR) 10Dzahn: [C:03+2] "kerberos principal created" [puppet] - 10https://gerrit.wikimedia.org/r/1035545 (https://phabricator.wikimedia.org/T364715) (owner: 10Dzahn) [20:37:26] (03CR) 10Ssingh: [C:03+1] "Nice catch!" [dns] - 10https://gerrit.wikimedia.org/r/1040260 (owner: 10Dzahn) [20:39:24] (03PS2) 10CDobbins: varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040262 (https://phabricator.wikimedia.org/T354718) [20:41:32] (03CR) 10Ssingh: "Looks good." [dns] - 10https://gerrit.wikimedia.org/r/1040261 (https://phabricator.wikimedia.org/T366882) (owner: 10Dzahn) [20:44:19] (03PS3) 10CDobbins: varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040262 (https://phabricator.wikimedia.org/T354718) [20:45:45] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:48:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P64289 and previous config saved to /var/cache/conftool/dbconfig/20240607-204801-ladsgroup.json [20:48:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:54:10] (03PS4) 10CDobbins: varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040262 (https://phabricator.wikimedia.org/T354718) [20:56:53] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting permissions for analytics-privatedata-users (with kerberos) for Mareike Heuer - https://phabricator.wikimedia.org/T364715#9872223 (10Dzahn) a:03Dzahn Mareike, you should have received 2 emails, one about changing the password for your Kerberos us... [20:58:49] (03PS5) 10CDobbins: varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040262 (https://phabricator.wikimedia.org/T354718) [21:03:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207', diff saved to https://phabricator.wikimedia.org/P64290 and previous config saved to /var/cache/conftool/dbconfig/20240607-210310-ladsgroup.json [21:18:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1207 (T352010)', diff saved to https://phabricator.wikimedia.org/P64291 and previous config saved to /var/cache/conftool/dbconfig/20240607-211818-ladsgroup.json [21:18:21] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1218.eqiad.wmnet with reason: Maintenance [21:18:22] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [21:18:35] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1218.eqiad.wmnet with reason: Maintenance [21:18:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1218 (T352010)', diff saved to https://phabricator.wikimedia.org/P64292 and previous config saved to /var/cache/conftool/dbconfig/20240607-211842-ladsgroup.json [21:28:47] (03PS6) 10CDobbins: varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040262 (https://phabricator.wikimedia.org/T354718) [21:34:41] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [21:36:05] PROBLEM - Disk space on thanos-be1001 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sdf1 201280 MB (5% inode=92%): /srv/swift-storage/sdg1 201314 MB (5% inode=91%): /srv/swift-storage/sdc1 187114 MB (4% inode=92%): /srv/swift-storage/sdi1 190856 MB (5% inode=92%): /srv/swift-storage/sde1 199455 MB (5% inode=92%): /srv/swift-storage/sdh1 192157 MB (5% inode=91%): /srv/swift-storage/sdj1 204148 MB (5% inode=91%): /srv/swift-s [21:36:05] k1 176300 MB (4% inode=91%): /srv/swift-storage/sdd1 152147 MB (3% inode=90%): /srv/swift-storage/sdm1 200051 MB (5% inode=91%): /srv/swift-storage/sdl1 194309 MB (5% inode=92%): /srv/swift-storage/sdn1 184065 MB (4% inode=92%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1001&var-datasource=eqiad+prometheus/ops [21:36:34] (03PS7) 10CDobbins: varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040262 (https://phabricator.wikimedia.org/T354718) [21:42:02] (03PS8) 10CDobbins: varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040262 (https://phabricator.wikimedia.org/T354718) [21:45:45] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:47:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T364069)', diff saved to https://phabricator.wikimedia.org/P64293 and previous config saved to /var/cache/conftool/dbconfig/20240607-214736-marostegui.json [21:47:40] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [21:48:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:56:25] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 213, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:57:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T364299)', diff saved to https://phabricator.wikimedia.org/P64294 and previous config saved to /var/cache/conftool/dbconfig/20240607-215716-marostegui.json [21:57:21] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [21:58:45] (03CR) 10Jdlrobson: CommonSettings: Restore the original behaviour of Reference Previews (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039597 (https://phabricator.wikimedia.org/T366419) (owner: 10Func) [21:59:19] (03CR) 10Jdlrobson: CommonSettings: Restore the original behaviour of Reference Previews (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039597 (https://phabricator.wikimedia.org/T366419) (owner: 10Func) [22:02:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P64295 and previous config saved to /var/cache/conftool/dbconfig/20240607-220244-marostegui.json [22:09:44] (03CR) 10Func: CommonSettings: Restore the original behaviour of Reference Previews (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1039597 (https://phabricator.wikimedia.org/T366419) (owner: 10Func) [22:12:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P64296 and previous config saved to /var/cache/conftool/dbconfig/20240607-221224-marostegui.json [22:15:45] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:17:25] PROBLEM - Host logstash2036 is DOWN: PING CRITICAL - Packet loss = 100% [22:17:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249', diff saved to https://phabricator.wikimedia.org/P64297 and previous config saved to /var/cache/conftool/dbconfig/20240607-221752-marostegui.json [22:18:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:19:21] RECOVERY - Host logstash2036 is UP: PING OK - Packet loss = 0%, RTA = 30.23 ms [22:26:24] (03PS9) 10CDobbins: varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040262 (https://phabricator.wikimedia.org/T354718) [22:27:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P64298 and previous config saved to /var/cache/conftool/dbconfig/20240607-222734-marostegui.json [22:31:02] (03PS10) 10CDobbins: varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040262 (https://phabricator.wikimedia.org/T354718) [22:33:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1249 (T364069)', diff saved to https://phabricator.wikimedia.org/P64299 and previous config saved to /var/cache/conftool/dbconfig/20240607-223300-marostegui.json [22:33:03] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [22:33:04] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [22:33:16] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [22:35:29] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:42:34] (03PS1) 10Ncmonitor: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1040277 [22:42:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T364299)', diff saved to https://phabricator.wikimedia.org/P64300 and previous config saved to /var/cache/conftool/dbconfig/20240607-224242-marostegui.json [22:42:45] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1241.eqiad.wmnet with reason: Maintenance [22:42:46] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [22:42:54] (03Abandoned) 10BCornwall: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1040277 (owner: 10Ncmonitor) [22:42:58] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1241.eqiad.wmnet with reason: Maintenance [22:43:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1241 (T364299)', diff saved to https://phabricator.wikimedia.org/P64301 and previous config saved to /var/cache/conftool/dbconfig/20240607-224306-marostegui.json [22:45:45] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:46:31] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 213, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:48:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:11:17] PROBLEM - Host logging-hd2001 is DOWN: PING CRITICAL - Packet loss = 100% [23:14:35] RECOVERY - Host logging-hd2001 is UP: PING OK - Packet loss = 0%, RTA = 30.35 ms [23:15:45] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:18:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:22:15] (03Abandoned) 10BCornwall: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1039847 (owner: 10Ncmonitor) [23:24:33] (03PS1) 10Ncmonitor: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1040278 [23:26:23] PROBLEM - Host logging-hd2002 is DOWN: PING CRITICAL - Packet loss = 100% [23:28:51] RECOVERY - Host logging-hd2002 is UP: PING OK - Packet loss = 0%, RTA = 30.35 ms [23:30:37] PROBLEM - OpenSearch health check for shards on 9200 on logging-hd2002 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fdf9e7c6e10: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech [23:30:37] a.org/wiki/Search%23Administration [23:31:37] RECOVERY - OpenSearch health check for shards on 9200 on logging-hd2002 is OK: OK - elasticsearch status production-elk7-codfw: cluster_name: production-elk7-codfw, status: yellow, timed_out: False, number_of_nodes: 21, number_of_data_nodes: 15, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 680, active_shards: 1313, relocating_shards: 0, initializing_shards: 1, unassigned_shards: 237, delayed_unassigne [23:31:37] 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 84.65506125080593 https://wikitech.wikimedia.org/wiki/Search%23Administration [23:31:56] (03Abandoned) 10BCornwall: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1040278 (owner: 10Ncmonitor) [23:35:59] PROBLEM - Host logging-hd2003 is DOWN: PING CRITICAL - Packet loss = 100% [23:37:53] RECOVERY - Host logging-hd2003 is UP: PING OK - Packet loss = 0%, RTA = 30.51 ms [23:38:36] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1039599 [23:38:36] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1039599 (owner: 10TrainBranchBot) [23:45:45] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:48:44] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed