[00:05:25] FIRING: SystemdUnitFailed: rsyslog-imfile-remedy.service on wikikube-worker1148:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:08:13] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1167976 [00:08:13] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1167976 (owner: 10TrainBranchBot) [00:18:38] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167258 (owner: 10Krinkle) [00:19:29] (03Merged) 10jenkins-bot: beta: Remove beta-specific 'http' entry for wgGraphAllowedDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167258 (owner: 10Krinkle) [00:21:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167259 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [00:21:52] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1037.eqiad.wmnet with reason: host reimage [00:22:17] (03Merged) 10jenkins-bot: beta: Move beta wikipedia canonical to beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167259 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [00:22:31] !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1167259|beta: Move beta wikipedia canonical to beta.wmcloud.org (T289318)]] [00:22:35] T289318: Move *.beta.wmflabs.org to *.beta.wmcloud.org - https://phabricator.wikimedia.org/T289318 [00:24:31] !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1167259|beta: Move beta wikipedia canonical to beta.wmcloud.org (T289318)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:26:24] (03PS1) 10Krinkle: beta: Change FileRepo zone URL to upload.wikimedia.beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167985 (https://phabricator.wikimedia.org/T289318) [00:27:59] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1037.eqiad.wmnet with reason: host reimage [00:28:28] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1167976 (owner: 10TrainBranchBot) [00:29:20] !log krinkle@deploy1003 krinkle: Continuing with sync [00:34:45] !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1167259|beta: Move beta wikipedia canonical to beta.wmcloud.org (T289318)]] (duration: 12m 13s) [00:34:49] T289318: Move *.beta.wmflabs.org to *.beta.wmcloud.org - https://phabricator.wikimedia.org/T289318 [00:39:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [00:42:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167985 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [00:43:30] (03Merged) 10jenkins-bot: beta: Change FileRepo zone URL to upload.wikimedia.beta.wmcloud.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167985 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [00:43:46] !log krinkle@deploy1003 Started scap sync-world: Backport for [[gerrit:1167985|beta: Change FileRepo zone URL to upload.wikimedia.beta.wmcloud.org (T289318)]] [00:43:53] T289318: Move *.beta.wmflabs.org to *.beta.wmcloud.org - https://phabricator.wikimedia.org/T289318 [00:45:41] !log krinkle@deploy1003 krinkle: Backport for [[gerrit:1167985|beta: Change FileRepo zone URL to upload.wikimedia.beta.wmcloud.org (T289318)]] synced to the testservers (see https://wikitech.wikimedia.org/wiki/Mwdebug). Changes can now be verified there. [00:46:40] PROBLEM - Disk space on releases1003 is CRITICAL: DISK CRITICAL - /srv/docker/overlay2/c0c858105d9a6d6edb9405fa560c5bfba6e11a5808e356f3fc849e196f5c4227/merged is not accessible: Permission denied https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [00:47:23] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1037.eqiad.wmnet with OS bookworm [00:47:34] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: SSD firmware update for cloudcephosd10[35-41] - https://phabricator.wikimedia.org/T396651#10993884 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by andrew@cumin2002 for host cloudcephosd1037.eqiad.wm... [00:50:01] !log krinkle@deploy1003 krinkle: Continuing with sync [00:53:58] (03Abandoned) 10Andrew Bogott: cloudcephosd1037: update nic names for Bookworm. [puppet] - 10https://gerrit.wikimedia.org/r/1167916 (https://phabricator.wikimedia.org/T396651) (owner: 10Andrew Bogott) [00:55:16] !log krinkle@deploy1003 Finished scap sync-world: Backport for [[gerrit:1167985|beta: Change FileRepo zone URL to upload.wikimedia.beta.wmcloud.org (T289318)]] (duration: 11m 30s) [00:55:20] T289318: Move *.beta.wmflabs.org to *.beta.wmcloud.org - https://phabricator.wikimedia.org/T289318 [00:57:34] !log andrew@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1038.eqiad.wmnet [01:03:43] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1038.eqiad.wmnet [01:06:40] RECOVERY - Disk space on releases1003 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=releases1003&var-datasource=eqiad+prometheus/ops [01:17:28] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1038.eqiad.wmnet [01:17:31] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cloudcephosd1038.eqiad.wmnet [01:50:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [01:55:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [02:01:42] !log andrew@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1039.eqiad.wmnet [02:09:34] !log andrew@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1039.eqiad.wmnet [02:10:05] !log andrew@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1039.eqiad.wmnet [02:10:43] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts cloudcephosd1039.eqiad.wmnet [02:11:29] !log andrew@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1039.eqiad.wmnet [02:11:35] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hardware.upgrade-firmware (exit_code=97) upgrade firmware for hosts cloudcephosd1039.eqiad.wmnet [02:13:52] !log andrew@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1040.eqiad.wmnet [02:17:03] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1040.eqiad.wmnet [02:30:25] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1040.eqiad.wmnet [02:30:28] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cloudcephosd1040.eqiad.wmnet [03:05:25] RESOLVED: SystemdUnitFailed: rsyslog-imfile-remedy.service on wikikube-worker1148:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:13:38] !log andrew@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts cloudcephosd1041.eqiad.wmnet [03:15:12] FIRING: [21x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [03:21:23] !log andrew@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudcephosd1041.eqiad.wmnet [03:35:02] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcephosd1041.eqiad.wmnet [03:35:05] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts cloudcephosd1041.eqiad.wmnet [03:54:00] 06SRE, 06serviceops, 10Wikimedia-Site-requests, 13Patch-For-Review: Change $wgMaxArticleSize limit from byte-based to character-based - https://phabricator.wikimedia.org/T275319#10993982 (10cscott) My current position is still outlined in T275319#9826396 above, and I'd love to help get some traction on tho... [03:54:40] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: SSD firmware update for cloudcephosd10[35-41] - https://phabricator.wikimedia.org/T396651#10993983 (10Andrew) 05Open→03Resolved I upgraded the firmware on all of these. My attempts to get them to bookworm at the s... [05:03:43] FIRING: CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [05:03:43] FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [05:03:58] FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [05:04:08] FIRING: [19x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:16:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:20:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:25:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:35:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:38:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [05:39:26] (03CR) 10Ayounsi: [C:03+1] "nice! to be fully tested but the approach lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1167883 (owner: 10JHathaway) [05:40:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:48:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [05:51:00] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [05:55:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [05:57:59] (03PS2) 10Giuseppe Lavagetto: cache-text: remove static rate limiting [puppet] - 10https://gerrit.wikimedia.org/r/1166798 (https://phabricator.wikimedia.org/T398668) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250711T0600) [06:00:56] PROBLEM - Host rpki2003 is DOWN: PING CRITICAL - Packet loss = 100% [06:01:29] that should come back up ^ [06:04:10] FIRING: GanetiBGPDown: BGP session down between ganeti2034 and lsw1-a4-codfw - group Ganeti4 - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=lsw1-a4-codfw:9804&var-bgp_group=Ganeti4&var-bgp_neighbor=ganeti2034 - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown [06:05:42] FIRING: JobUnavailable: Reduced availability for job jmx_idp in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:05:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [06:06:04] RECOVERY - Host rpki2003 is UP: PING WARNING - Packet loss = 75%, RTA = 33.79 ms [06:09:10] RESOLVED: GanetiBGPDown: BGP session down between ganeti2034 and lsw1-a4-codfw - group Ganeti4 - https://wikitech.wikimedia.org/wiki/Ganeti#GanetiBGPDown - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=codfw&var-device=lsw1-a4-codfw:9804&var-bgp_group=Ganeti4&var-bgp_neighbor=ganeti2034 - https://alerts.wikimedia.org/?q=alertname%3DGanetiBGPDown [06:10:42] RESOLVED: JobUnavailable: Reduced availability for job jmx_idp in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:11:16] PROBLEM - Host rpki2003 is DOWN: PING CRITICAL - Packet loss = 100% [06:11:30] RECOVERY - Host rpki2003 is UP: PING OK - Packet loss = 0%, RTA = 33.94 ms [06:25:25] (03PS1) 10Marostegui: mariadb: Productionize es1048 [puppet] - 10https://gerrit.wikimedia.org/r/1168036 (https://phabricator.wikimedia.org/T395771) [06:26:31] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize es1048 [puppet] - 10https://gerrit.wikimedia.org/r/1168036 (https://phabricator.wikimedia.org/T395771) (owner: 10Marostegui) [06:30:10] (03PS1) 10Marostegui: db2213: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1168037 (https://phabricator.wikimedia.org/T398928) [06:30:47] (03CR) 10Marostegui: [C:03+2] db2213: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1168037 (https://phabricator.wikimedia.org/T398928) (owner: 10Marostegui) [06:31:53] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2213.codfw.wmnet with reason: Maintenance [06:31:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2213 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P78892 and previous config saved to /var/cache/conftool/dbconfig/20250711-063156-marostegui.json [06:39:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2213 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P78893 and previous config saved to /var/cache/conftool/dbconfig/20250711-063922-root.json [06:54:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2213 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P78894 and previous config saved to /var/cache/conftool/dbconfig/20250711-065428-root.json [06:55:40] (03PS1) 10Krinkle: varnish: Improve GeoIP to use cookie domain similar to prod [puppet] - 10https://gerrit.wikimedia.org/r/1168038 (https://phabricator.wikimedia.org/T99226) [06:56:52] (03PS2) 10Krinkle: varnish: Improve GeoIP to use cookie domain similar to prod [puppet] - 10https://gerrit.wikimedia.org/r/1168038 (https://phabricator.wikimedia.org/T99226) [06:56:54] (03CR) 10Krinkle: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1168038 (https://phabricator.wikimedia.org/T99226) (owner: 10Krinkle) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250711T0700) [07:08:43] FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:08:48] FIRING: [19x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:09:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2213 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P78895 and previous config saved to /var/cache/conftool/dbconfig/20250711-070933-root.json [07:18:42] FIRING: [21x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [07:18:43] RESOLVED: CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:18:43] FIRING: [19x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:23:31] 10ops-codfw, 06SRE, 06DC-Ops: Inbound errors on interface cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://phabricator.wikimedia.org/T399097#10994182 (10cmooney) The link remains down, Arelion are awaiting a replacement card for an optical system in Atlanta it seems:... [07:23:43] FIRING: [5x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:23:48] FIRING: [19x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:24:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2213 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P78896 and previous config saved to /var/cache/conftool/dbconfig/20250711-072439-root.json [07:28:43] FIRING: [4x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:28:48] FIRING: [18x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:28:52] 10ops-codfw, 06SRE, 06DC-Ops: Arelion IC-374549 100G Transport outage (cr1-codfw -> cr1-eqiad) July 2025 - https://phabricator.wikimedia.org/T399097#10994194 (10cmooney) [07:33:43] FIRING: [18x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:34:44] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for addshore - https://phabricator.wikimedia.org/T399152#10994199 (10MoritzMuehlenhoff) [07:34:51] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for addshore - https://phabricator.wikimedia.org/T399152#10994201 (10MoritzMuehlenhoff) @Milimetric @Ahoelzl @Ottomata This needs your approval [07:36:48] !log jmm@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1003.eqiad.wmnet with OS trixie [07:36:58] 06SRE, 06Infrastructure-Foundations: Prepare our custom installer and the base layer for Trixie - https://phabricator.wikimedia.org/T391083#10994206 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin1003 for host sretest1003.eqiad.wmnet with OS trixie [07:37:10] (03PS1) 10Jgiannelos: changeprop: Ignore more commons NS on pcs rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168041 [07:38:43] FIRING: [17x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:43:43] FIRING: [16x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:43:48] FIRING: [3x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:43:58] FIRING: [4x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:48:43] FIRING: [10x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:48:43] RESOLVED: [3x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:48:53] RESOLVED: [3x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:50:09] (03PS3) 10Krinkle: varnish: Improve GeoIP to use cookie domain similar to prod [puppet] - 10https://gerrit.wikimedia.org/r/1168038 (https://phabricator.wikimedia.org/T99226) [07:50:58] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for aprum - https://phabricator.wikimedia.org/T398650#10994228 (10cmooney) [07:53:43] RESOLVED: [9x] CategoriesQueryServiceUpdateLagTooHigh: Categories Query service lag is above 2 days - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DCategoriesQueryServiceUpdateLagTooHigh [07:56:22] !log jmm@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1003.eqiad.wmnet with reason: host reimage [08:02:21] (03CR) 10Krinkle: "I'm looking at `./modules/varnish/files/tests/docker_run.sh cp1110.eqiad.wmnet 1168038` (after simulating a nearby failure on PS2) to look" [puppet] - 10https://gerrit.wikimedia.org/r/1168038 (https://phabricator.wikimedia.org/T99226) (owner: 10Krinkle) [08:02:34] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1003.eqiad.wmnet with reason: host reimage [08:06:45] 06SRE, 10decommission-hardware, 06Infrastructure-Foundations: decommission puppetserver2003 - https://phabricator.wikimedia.org/T398607#10994286 (10MoritzMuehlenhoff) [08:18:05] !log jmm@cumin1003 START - Cookbook sre.hosts.decommission for hosts puppetserver2003.codfw.wmnet [08:19:50] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2223.codfw.wmnet with reason: Maintenance [08:19:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2223 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P78897 and previous config saved to /var/cache/conftool/dbconfig/20250711-081953-marostegui.json [08:20:47] jmm@cumin1003 decommission (PID 1245901) is awaiting input [08:26:45] (03PS2) 10Jgiannelos: changeprop: Ignore more namespace on pcs transclusion rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168041 [08:27:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2223 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P78898 and previous config saved to /var/cache/conftool/dbconfig/20250711-082725-root.json [08:27:36] (03PS3) 10Jgiannelos: changeprop: Ignore more namespace on pcs transclusion rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168041 (https://phabricator.wikimedia.org/T397072) [08:30:46] !log jmm@cumin1003 START - Cookbook sre.dns.netbox [08:33:13] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: Replace Exim on VRTS servers with Postfix - https://phabricator.wikimedia.org/T378028#10994459 (10MoritzMuehlenhoff) >>! In T378028#10993697, @Dzahn wrote: > But another question comes to mind.. and that is.. do VRTS machi... [08:33:58] !log jmm@cumin1003 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: puppetserver2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" [08:34:17] !log jmm@cumin1003 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: puppetserver2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - jmm@cumin1003" [08:34:17] !log jmm@cumin1003 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:34:18] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts puppetserver2003.codfw.wmnet [08:34:27] 06SRE, 10decommission-hardware, 06Infrastructure-Foundations: decommission puppetserver2003 - https://phabricator.wikimedia.org/T398607#10994460 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by jmm@cumin1003 for hosts: `puppetserver2003.codfw.wmnet` - puppetserver2003.codfw.wmnet (**PASS**... [08:36:08] (03PS1) 10Muehlenhoff: Remove puppetserver2003 [puppet] - 10https://gerrit.wikimedia.org/r/1168121 (https://phabricator.wikimedia.org/T398607) [08:37:36] (03CR) 10Arnaudb: [C:03+1] "lgtm! thanks for the amend! lets merge and move on to the next thing!" [puppet] - 10https://gerrit.wikimedia.org/r/1129920 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [08:41:28] (03PS1) 10Ayounsi: magru: add Ufinet transit [homer/public] - 10https://gerrit.wikimedia.org/r/1168122 [08:41:45] (03PS2) 10Ayounsi: magru: add Ufinet transit [homer/public] - 10https://gerrit.wikimedia.org/r/1168122 (https://phabricator.wikimedia.org/T389767) [08:41:47] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: sync [08:42:06] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: sync [08:42:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2223 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P78899 and previous config saved to /var/cache/conftool/dbconfig/20250711-084230-root.json [08:45:37] (03CR) 10Elukey: [C:03+2] profile::docker::reporter: add Wikikube and ML serve prod clusters [puppet] - 10https://gerrit.wikimedia.org/r/1167885 (https://phabricator.wikimedia.org/T397696) (owner: 10Elukey) [08:51:28] !log elukey@deploy1003 helmfile [eqiad] START helmfile.d/admin 'sync'. [08:51:29] !log elukey@deploy1003 helmfile [eqiad] DONE helmfile.d/admin 'sync'. [08:51:43] (03Abandoned) 10Slyngshede: data.yaml offboarding trokhymovych [puppet] - 10https://gerrit.wikimedia.org/r/1164707 (owner: 10Slyngshede) [08:51:51] !log elukey@deploy1003 helmfile [codfw] START helmfile.d/admin 'sync'. [08:51:55] !log elukey@deploy1003 helmfile [codfw] DONE helmfile.d/admin 'sync'. [08:54:25] (03PS4) 10Jgiannelos: changeprop: Ignore more namespace on pcs transclusion rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168041 (https://phabricator.wikimedia.org/T397072) [08:57:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2223 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P78900 and previous config saved to /var/cache/conftool/dbconfig/20250711-085736-root.json [09:00:14] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2213 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1168124 (https://phabricator.wikimedia.org/T399280) [09:01:09] (03CR) 10Filippo Giunchedi: "Thank you for the patch -- LGTM, I'm thinking we can add raises=False to task_comment() since I don't think it is fatal if we don't commen" [cookbooks] - 10https://gerrit.wikimedia.org/r/1167887 (owner: 10Volans) [09:02:25] (03CR) 10Elukey: [C:03+2] "I think it is worth to try it, let's see how it goes and if we have to follow up or not!" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1156315 (owner: 10Hashar) [09:03:29] (03CR) 10Elukey: [C:03+2] "Done" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1156733 (owner: 10Hashar) [09:03:39] (03CR) 10Muehlenhoff: [C:03+2] Remove puppetserver2003 [puppet] - 10https://gerrit.wikimedia.org/r/1168121 (https://phabricator.wikimedia.org/T398607) (owner: 10Muehlenhoff) [09:03:55] (03Abandoned) 10Elukey: TEST - fix http_boot_once for reimage [cookbooks] - 10https://gerrit.wikimedia.org/r/1166378 (owner: 10Elukey) [09:04:16] !log jmm@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1003.eqiad.wmnet with OS trixie [09:04:22] 06SRE, 06Infrastructure-Foundations: Prepare our custom installer and the base layer for Trixie - https://phabricator.wikimedia.org/T391083#10994543 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1003 for host sretest1003.eqiad.wmnet with OS trixie executed with errors: - srete... [09:05:27] 10ops-codfw, 06SRE, 06DC-Ops, 10decommission-hardware, and 2 others: decommission puppetserver2003 - https://phabricator.wikimedia.org/T398607#10994546 (10MoritzMuehlenhoff) [09:06:42] (03CR) 10Elukey: [C:03+1] I/F: simplify Phabricator usage [cookbooks] - 10https://gerrit.wikimedia.org/r/1167886 (owner: 10Volans) [09:07:53] (03PS5) 10Jgiannelos: changeprop: Ignore more namespace on pcs transclusion rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168041 (https://phabricator.wikimedia.org/T397072) [09:09:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [09:12:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2223 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P78901 and previous config saved to /var/cache/conftool/dbconfig/20250711-091242-root.json [09:12:47] (03PS6) 10Jgiannelos: changeprop: Ignore more namespace on pcs transclusion rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168041 (https://phabricator.wikimedia.org/T397072) [09:13:25] (03PS7) 10Jgiannelos: changeprop: Ignore more namespace on pcs transclusion rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168041 (https://phabricator.wikimedia.org/T397072) [09:15:19] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 22 hosts with reason: Primary switchover s5 T399280 [09:15:22] T399280: Switchover s5 master (db2192 -> db2213) - https://phabricator.wikimedia.org/T399280 [09:15:59] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: Replace Exim on VRTS servers with Postfix - https://phabricator.wikimedia.org/T378028#10994585 (10Arnoldokoth) @Dzahn We used to run it on VMs but we kept running into resource issues (especially with `clamav`) even after... [09:16:39] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Netbox: remove old cr2-codfw Switch Control Board inventory items - https://phabricator.wikimedia.org/T398940#10994586 (10ayounsi) We can remove them from Netbox if they're not in the device anymore. and add them to the spare tracking... [09:18:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db2213 from API/vslow/dump T399280', diff saved to https://phabricator.wikimedia.org/P78902 and previous config saved to /var/cache/conftool/dbconfig/20250711-091812-root.json [09:20:35] (03CR) 10Gmodena: "LGTM. Just left two nit/questions." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167438 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [09:21:57] jmm@cumin1003 reimage (PID 1251864) is awaiting input [09:22:14] (03PS2) 10Elukey: EventStreamConfig: add the maps.tiles_change_bookworm stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167438 (https://phabricator.wikimedia.org/T381565) [09:22:36] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2213 to s5 master [puppet] - 10https://gerrit.wikimedia.org/r/1168124 (https://phabricator.wikimedia.org/T399280) (owner: 10Gerrit maintenance bot) [09:22:52] (03CR) 10Filippo Giunchedi: "LGTM overall, see inline" [puppet] - 10https://gerrit.wikimedia.org/r/1167157 (https://phabricator.wikimedia.org/T397003) (owner: 10Tiziano Fogli) [09:23:29] (03PS8) 10Jgiannelos: changeprop: Ignore more namespaces on pcs transclusion rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168041 (https://phabricator.wikimedia.org/T397072) [09:24:35] (03CR) 10Elukey: "Thanks Gabriele!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167438 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [09:24:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [09:25:15] !log imported perccli for trixie-wikimedia T391083 [09:25:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:19] T391083: Prepare our custom installer and the base layer for Trixie - https://phabricator.wikimedia.org/T391083 [09:27:49] !log jmm@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1003.eqiad.wmnet with OS trixie [09:27:59] 06SRE, 06Infrastructure-Foundations: Prepare our custom installer and the base layer for Trixie - https://phabricator.wikimedia.org/T391083#10994632 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin1003 for host sretest1003.eqiad.wmnet with OS trixie [09:28:06] (03PS9) 10Jgiannelos: changeprop: Ignore more namespaces on pcs transclusion rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168041 (https://phabricator.wikimedia.org/T397072) [09:29:27] !log Starting s5 codfw failover from db2192 to db2213 - T399280 [09:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:29:30] T399280: Switchover s5 master (db2192 -> db2213) - https://phabricator.wikimedia.org/T399280 [09:30:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2213 to s5 primary T399280', diff saved to https://phabricator.wikimedia.org/P78903 and previous config saved to /var/cache/conftool/dbconfig/20250711-093006-marostegui.json [09:31:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2192 T399280', diff saved to https://phabricator.wikimedia.org/P78904 and previous config saved to /var/cache/conftool/dbconfig/20250711-093115-root.json [09:33:32] (03CR) 10Hnowlan: [C:03+1] changeprop: Ignore more namespaces on pcs transclusion rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168041 (https://phabricator.wikimedia.org/T397072) (owner: 10Jgiannelos) [09:34:26] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1167886 (owner: 10Volans) [09:37:03] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: Replace Exim on VRTS servers with Postfix - https://phabricator.wikimedia.org/T378028#10994674 (10MoritzMuehlenhoff) That said, if you striclty need a physical host for the tests, you could use puppetserver2003. I decommed... [09:38:03] (03PS1) 10Marostegui: db2192: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1168131 (https://phabricator.wikimedia.org/T398928) [09:38:44] (03CR) 10Marostegui: [C:03+2] db2192: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1168131 (https://phabricator.wikimedia.org/T398928) (owner: 10Marostegui) [09:39:14] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2192.codfw.wmnet with reason: Maintenance [09:42:51] 10SRE-swift-storage, 10Thumbor: Gradually drop all thumbnails as a one-off clean up - https://phabricator.wikimedia.org/T379942#10994702 (10Ladsgroup) [09:43:13] (03PS1) 10Cathal Mooney: admin: grant aranyap access and add to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1168132 (https://phabricator.wikimedia.org/T398650) [09:44:11] (03CR) 10CI reject: [V:04-1] admin: grant aranyap access and add to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1168132 (https://phabricator.wikimedia.org/T398650) (owner: 10Cathal Mooney) [09:45:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2192 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P78905 and previous config saved to /var/cache/conftool/dbconfig/20250711-094527-root.json [09:45:40] (03PS2) 10Cathal Mooney: admin: grant aranyap access and add to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1168132 (https://phabricator.wikimedia.org/T398650) [09:46:59] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1168132 (https://phabricator.wikimedia.org/T398650) (owner: 10Cathal Mooney) [09:55:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:55:20] (03CR) 10Cathal Mooney: [C:03+2] admin: grant aranyap access and add to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1168132 (https://phabricator.wikimedia.org/T398650) (owner: 10Cathal Mooney) [09:55:36] (03CR) 10Hnowlan: [C:03+2] changeprop: Ignore more namespaces on pcs transclusion rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168041 (https://phabricator.wikimedia.org/T397072) (owner: 10Jgiannelos) [09:57:11] (03CR) 10Ladsgroup: "None of the tables being dropped exist in production but maybe we should reload the triggers or something but it's noop AFAIK" [puppet] - 10https://gerrit.wikimedia.org/r/1167576 (https://phabricator.wikimedia.org/T398946) (owner: 10Ladsgroup) [09:57:29] (03Merged) 10jenkins-bot: changeprop: Ignore more namespaces on pcs transclusion rules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168041 (https://phabricator.wikimedia.org/T397072) (owner: 10Jgiannelos) [09:58:41] (03CR) 10Ladsgroup: [C:03+1] "I try to merge and deploy it next week if noone beats me to it." [dumps] - 10https://gerrit.wikimedia.org/r/1149464 (owner: 10Amire80) [09:59:26] (03PS2) 10Ayounsi: WIP: use Homer to configure the network [cookbooks] - 10https://gerrit.wikimedia.org/r/1166407 [10:00:05] (03CR) 10Btullis: [C:03+1] Make functionally identical descriptions the same [dumps] - 10https://gerrit.wikimedia.org/r/1149464 (owner: 10Amire80) [10:00:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2192 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P78906 and previous config saved to /var/cache/conftool/dbconfig/20250711-100033-root.json [10:01:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2192', diff saved to https://phabricator.wikimedia.org/P78907 and previous config saved to /var/cache/conftool/dbconfig/20250711-100106-root.json [10:03:04] (03CR) 10Btullis: [C:03+1] "When merged, this will be automatically added to the mediawiki-cli image during its next full build, due to this: https://gitlab.wikimedia" [dumps] - 10https://gerrit.wikimedia.org/r/1149464 (owner: 10Amire80) [10:03:58] (03CR) 10Gmodena: [C:03+1] EventStreamConfig: add the maps.tiles_change_bookworm stream (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167438 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [10:05:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2192 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P78908 and previous config saved to /var/cache/conftool/dbconfig/20250711-100522-root.json [10:07:00] (03CR) 10CI reject: [V:04-1] WIP: use Homer to configure the network [cookbooks] - 10https://gerrit.wikimedia.org/r/1166407 (owner: 10Ayounsi) [10:09:43] (03CR) 10Elukey: EventStreamConfig: add the maps.tiles_change_bookworm stream (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167438 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [10:10:41] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users group (LDAP and kerberos), for aprum - https://phabricator.wikimedia.org/T398650#10994757 (10cmooney) Hi @aranyap. I have added your public ssh key and username 'aranyap' to the //analytics-privatedata-users... [10:10:45] (03PS3) 10Ayounsi: WIP: use Homer to configure the network [cookbooks] - 10https://gerrit.wikimedia.org/r/1166407 [10:11:39] (03PS1) 10Hnowlan: changeprop: correct amended regex [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168144 (https://phabricator.wikimedia.org/T397072) [10:11:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [10:12:30] (03CR) 10Jgiannelos: [C:03+1] changeprop: correct amended regex [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168144 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan) [10:15:59] (03PS2) 10Hnowlan: changeprop: correct amended regex [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168144 (https://phabricator.wikimedia.org/T397072) [10:17:25] (03CR) 10CI reject: [V:04-1] WIP: use Homer to configure the network [cookbooks] - 10https://gerrit.wikimedia.org/r/1166407 (owner: 10Ayounsi) [10:18:59] (03PS4) 10Ayounsi: WIP: use Homer to configure the network [cookbooks] - 10https://gerrit.wikimedia.org/r/1166407 [10:19:00] jmm@cumin1003 reimage (PID 1251864) is awaiting input [10:19:58] (03CR) 10Hnowlan: [C:03+2] changeprop: correct amended regex [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168144 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan) [10:20:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2192 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P78909 and previous config saved to /var/cache/conftool/dbconfig/20250711-102027-root.json [10:25:40] (03PS1) 10Muehlenhoff: late-command: Check whether qemu_fw_cfg.ko is present [puppet] - 10https://gerrit.wikimedia.org/r/1168145 (https://phabricator.wikimedia.org/T391083) [10:26:27] !log jmm@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest1003.eqiad.wmnet with OS trixie [10:27:44] (03PS2) 10Elukey: services: configure tegola in codfw to use maps-test [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165550 (https://phabricator.wikimedia.org/T381565) [10:27:44] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Prepare our custom installer and the base layer for Trixie - https://phabricator.wikimedia.org/T391083#10994794 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1003 for host sretest1003.eqiad.wmnet with OS trixie execute... [10:27:44] (03CR) 10Elukey: services: configure tegola in codfw to use maps-test (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1165550 (https://phabricator.wikimedia.org/T381565) (owner: 10Elukey) [10:27:55] 06SRE, 06Infrastructure-Foundations, 10netops: Cloudcephosd: migrate to single network uplink - https://phabricator.wikimedia.org/T399180#10994796 (10cmooney) [10:29:23] (03Merged) 10jenkins-bot: changeprop: correct amended regex [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168144 (https://phabricator.wikimedia.org/T397072) (owner: 10Hnowlan) [10:30:40] !log hnowlan@deploy1003 helmfile [staging] START helmfile.d/services/changeprop: apply [10:30:48] !log hnowlan@deploy1003 helmfile [staging] DONE helmfile.d/services/changeprop: apply [10:31:26] !log hnowlan@deploy1003 helmfile [codfw] START helmfile.d/services/changeprop: apply [10:31:52] !log hnowlan@deploy1003 helmfile [codfw] DONE helmfile.d/services/changeprop: apply [10:32:05] !log hnowlan@deploy1003 helmfile [eqiad] START helmfile.d/services/changeprop: apply [10:32:15] !log hnowlan@deploy1003 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [10:34:35] (03PS1) 10Ladsgroup: private_tables: Drop private tables that don't exist in production [puppet] - 10https://gerrit.wikimedia.org/r/1168148 (https://phabricator.wikimedia.org/T398945) [10:35:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2192 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P78910 and previous config saved to /var/cache/conftool/dbconfig/20250711-103533-root.json [10:36:35] (03CR) 10CI reject: [V:04-1] WIP: use Homer to configure the network [cookbooks] - 10https://gerrit.wikimedia.org/r/1166407 (owner: 10Ayounsi) [10:37:58] (03PS1) 10Hnowlan: profile::hcaptcha: don't serve / or robots.txt [puppet] - 10https://gerrit.wikimedia.org/r/1168149 (https://phabricator.wikimedia.org/T397841) [10:39:51] (03PS1) 10Tiziano Fogli: nrpe wrapper: add wrapper to be invoked a systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/1168150 (https://phabricator.wikimedia.org/T395446) [10:41:01] (03PS5) 10Ayounsi: WIP: use Homer to configure the network [cookbooks] - 10https://gerrit.wikimedia.org/r/1166407 [10:41:08] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (shell membership, ssh key) for STran - https://phabricator.wikimedia.org/T399107#10994846 (10STran) [10:49:32] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (shell membership, ssh key) for STran - https://phabricator.wikimedia.org/T399107#10994883 (10cmooney) [10:50:37] (03PS1) 10Cathal Mooney: admin: add user 'stran' to analytics-privatedata-users and enable kerberos [puppet] - 10https://gerrit.wikimedia.org/r/1168152 (https://phabricator.wikimedia.org/T399107) [10:50:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2192 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P78911 and previous config saved to /var/cache/conftool/dbconfig/20250711-105039-root.json [10:50:43] PROBLEM - ganeti-noded running on ganeti1036 is CRITICAL: PROCS CRITICAL: 3 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [10:51:43] RECOVERY - ganeti-noded running on ganeti1036 is OK: PROCS OK: 2 processes with UID = 0 (root), command name ganeti-noded https://wikitech.wikimedia.org/wiki/Ganeti [10:51:57] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users group (shell membership, ssh key) for STran - https://phabricator.wikimedia.org/T399107#10994891 (10cmooney) I verified the above key over slack and can confirm that it is not in use for WMCS access. [10:58:34] (03PS1) 10Btullis: Increase the limitranges for the spark-history service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168153 (https://phabricator.wikimedia.org/T396617) [10:59:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1039 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P78912 and previous config saved to /var/cache/conftool/dbconfig/20250711-105922-root.json [11:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250711T0700) [11:00:04] jelto, arnoldokoth, and mutante: GitLab version upgrades (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250711T1100). Please do the needful. [11:03:41] (03CR) 10Marostegui: "thanks, fine to merge from my side" [puppet] - 10https://gerrit.wikimedia.org/r/1167576 (https://phabricator.wikimedia.org/T398946) (owner: 10Ladsgroup) [11:14:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1039 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P78913 and previous config saved to /var/cache/conftool/dbconfig/20250711-111428-root.json [11:15:13] FIRING: [21x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [11:20:55] (03CR) 10Ayounsi: [C:03+1] late-command: Check whether qemu_fw_cfg.ko is present [puppet] - 10https://gerrit.wikimedia.org/r/1168145 (https://phabricator.wikimedia.org/T391083) (owner: 10Muehlenhoff) [11:26:32] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es1031.eqiad.wmnet with reason: Maintenance [11:26:45] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1037.eqiad.wmnet with OS bullseye [11:29:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1039 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P78914 and previous config saved to /var/cache/conftool/dbconfig/20250711-112933-root.json [11:29:42] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2192.codfw.wmnet with reason: Maintenance [11:30:13] !log fceratto@cumin1002 START - Cookbook sre.hosts.remove-downtime for es1031.eqiad.wmnet [11:30:13] !log fceratto@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for es1031.eqiad.wmnet [11:31:04] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es1032.eqiad.wmnet with reason: Maintenance [11:34:58] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es1034.eqiad.wmnet with reason: Maintenance [11:35:32] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depool es1034 for upgrade', diff saved to https://phabricator.wikimedia.org/P78915 and previous config saved to /var/cache/conftool/dbconfig/20250711-113532-fceratto.json [11:44:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1039 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P78916 and previous config saved to /var/cache/conftool/dbconfig/20250711-114439-root.json [11:45:00] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1168152 (https://phabricator.wikimedia.org/T399107) (owner: 10Cathal Mooney) [11:45:51] (03PS1) 10Btullis: Enable greater timeouts and rewriting for the spark-history service [puppet] - 10https://gerrit.wikimedia.org/r/1168165 (https://phabricator.wikimedia.org/T396617) [11:46:35] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/6242/co" [puppet] - 10https://gerrit.wikimedia.org/r/1168165 (https://phabricator.wikimedia.org/T396617) (owner: 10Btullis) [11:46:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [11:52:26] 06SRE, 10SRE-Access-Requests: Add Sowmya Guru to list of "WMDE group" approvers on Wikitech - https://phabricator.wikimedia.org/T398686#10995078 (10sowmya.guru) Hey folks the NDA is signed by me! [11:52:46] !log fceratto@cumin1002 START - Cookbook sre.hosts.reboot-single for host es1034.eqiad.wmnet [11:56:26] PROBLEM - Ensure trafficserver_exporter is running for instance backend on cp7006 is CRITICAL: PROCS CRITICAL: 3 processes with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:57:26] RECOVERY - Ensure trafficserver_exporter is running for instance backend on cp7006 is OK: PROCS OK: 1 process with args /usr/bin/python3 /usr/bin/prometheus-trafficserver-exporter --no-procstats --no-ssl-verification --endpoint http://127.0.0.1:3128/_stats --port 9122 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [11:58:00] (03PS1) 10Jcrespo: mariadb: Upgrade db2200 to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1168166 (https://phabricator.wikimedia.org/T399298) [12:01:29] !log gmodena@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [12:01:41] !log gmodena@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [12:02:00] (03CR) 10Muehlenhoff: [C:03+2] late-command: Check whether qemu_fw_cfg.ko is present [puppet] - 10https://gerrit.wikimedia.org/r/1168145 (https://phabricator.wikimedia.org/T391083) (owner: 10Muehlenhoff) [12:03:14] !log fceratto@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host es1034.eqiad.wmnet [12:04:10] !log gmodena@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [12:04:15] !log gmodena@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [12:06:51] !log gmodena@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [12:06:53] !log gmodena@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [12:08:09] (03CR) 10Daimona Eaytoy: mariadb: Remove tables that are not cataloged from filtered_tables.txt (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1167576 (https://phabricator.wikimedia.org/T398946) (owner: 10Ladsgroup) [12:09:58] (03CR) 10Muehlenhoff: openstack: nova: Load nf_conntrack module at boot (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1167899 (https://phabricator.wikimedia.org/T399212) (owner: 10FNegri) [12:16:45] !log fceratto@cumin1002 START - Cookbook sre.hosts.remove-downtime for es1034.eqiad.wmnet [12:16:45] !log fceratto@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for es1034.eqiad.wmnet [12:17:11] !log andrew@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1037.eqiad.wmnet with OS bullseye [12:17:32] (03PS1) 10Marostegui: db2187: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1168167 (https://phabricator.wikimedia.org/T399298) [12:17:34] !log gmodena@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [12:17:38] !log gmodena@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [12:17:52] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool es1034 gradually with 4 steps - Pooling in [12:17:54] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es1034 gradually with 4 steps - Pooling in [12:18:15] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool es1034 gradually with 4 steps - Pooling in [12:18:18] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.pool (exit_code=99) es1034 gradually with 4 steps - Pooling in [12:18:30] !log fceratto@cumin1002 START - Cookbook sre.mysql.pool es1034 gradually with 4 steps - Pooling in [12:18:38] !log jmm@cumin1003 START - Cookbook sre.hosts.reimage for host sretest1003.eqiad.wmnet with OS trixie [12:20:46] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1037.eqiad.wmnet with OS bullseye [12:22:26] !log gmodena@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [12:22:31] !log gmodena@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [12:24:49] !log gmodena@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [12:24:53] !log gmodena@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [12:28:09] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on es1032.eqiad.wmnet with reason: Maintenance [12:28:47] !log fceratto@cumin1002 START - Cookbook sre.mysql.depool es1032 - Depooling RO host [12:28:51] (03PS1) 10Daimona Eaytoy: Clean up some settings for special wikis no longer in wikipedia group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1168169 (https://phabricator.wikimedia.org/T183549) [12:28:56] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.depool (exit_code=99) es1032 - Depooling RO host [12:29:40] (03CR) 10CI reject: [V:04-1] Clean up some settings for special wikis no longer in wikipedia group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1168169 (https://phabricator.wikimedia.org/T183549) (owner: 10Daimona Eaytoy) [12:30:12] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1049.eqiad.wmnet with OS bookworm [12:30:14] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1051.eqiad.wmnet with OS bookworm [12:30:14] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephosd1050.eqiad.wmnet with OS bookworm [12:30:20] !log fceratto@cumin1002 START - Cookbook sre.mysql.depool es1032 - Depooling RO host [12:30:24] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.depool (exit_code=99) es1032 - Depooling RO host [12:30:27] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10995162 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1049.eq... [12:30:28] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10995163 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1051.eq... [12:30:31] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10995164 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephosd1050.eq... [12:31:11] (03CR) 10Cathal Mooney: [C:03+2] admin: add user 'stran' to analytics-privatedata-users and enable kerberos [puppet] - 10https://gerrit.wikimedia.org/r/1168152 (https://phabricator.wikimedia.org/T399107) (owner: 10Cathal Mooney) [12:33:12] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#10995177 (10Jclark-ctr) [12:33:15] !log jmm@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1003.eqiad.wmnet with reason: host reimage [12:34:38] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users group (shell membership, ssh key) for STran - https://phabricator.wikimedia.org/T399107#10995186 (10cmooney) Ok @STran I think you should be good to go now if you want to test the access. [12:38:22] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1003.eqiad.wmnet with reason: host reimage [12:39:08] 10ops-codfw, 06SRE, 06DC-Ops: Arelion IC-374549 100G Transport outage (cr1-codfw -> cr1-eqiad) July 2025 - https://phabricator.wikimedia.org/T399097#10995188 (10cmooney) Still no fix ` 7/11/2025 10:22:47 AM ETA is 1:00 PM UTC 7/11/2025 9:23:02 AM We have escalated with our vendor to ensure the testing a... [12:39:26] hello, is T399297 the task to keep in eye on regarding the beta sites being down? [12:39:26] T399297: Widespread instances down in project deployment-prep - https://phabricator.wikimedia.org/T399297 [12:40:28] !incidents [12:40:28] No incidents occurred in the past 24 hours for team SRE [12:40:35] 🥳 [12:41:33] (03PS4) 10Dreamy Jazz: WIP: Prep hCaptcha config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148390 (https://phabricator.wikimedia.org/T382148) (owner: 10Reedy) [12:42:08] (03CR) 10Marostegui: [C:03+2] db2187: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1168167 (https://phabricator.wikimedia.org/T399298) (owner: 10Marostegui) [12:42:45] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2187.codfw.wmnet with reason: Maintenance [12:42:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2187 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P78919 and previous config saved to /var/cache/conftool/dbconfig/20250711-124249-marostegui.json [12:49:14] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage [12:49:15] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage [12:49:29] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage [12:50:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2187 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P78921 and previous config saved to /var/cache/conftool/dbconfig/20250711-125022-root.json [12:52:35] (03PS1) 10Btullis: Increase the CPU and memory limits for the spark-history service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168171 (https://phabricator.wikimedia.org/T396617) [12:52:55] !log jmm@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1003.eqiad.wmnet with OS trixie [12:52:57] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1051.eqiad.wmnet with reason: host reimage [12:53:02] 06SRE, 06Infrastructure-Foundations: Prepare our custom installer and the base layer for Trixie - https://phabricator.wikimedia.org/T391083#10995233 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin1003 for host sretest1003.eqiad.wmnet with OS trixie completed: - sretest1003 (**P... [12:55:29] (03PS1) 10Muehlenhoff: Remove obsolete Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1168172 [12:56:22] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1049.eqiad.wmnet with reason: host reimage [12:57:15] 06SRE, 06Infrastructure-Foundations: Prepare our custom installer and the base layer for Trixie - https://phabricator.wikimedia.org/T391083#10995239 (10MoritzMuehlenhoff) Installations with Trixie are now possible, which directly install the backport of Puppet 7, all known issues affecting the Puppet base clas... [12:57:46] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [12:57:53] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [13:00:22] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 - https://phabricator.wikimedia.org/T387145#10995251 (10VRiley-WMF) I will look into this. I believe it may be due to lvs1017's nic being misconfigured. I will update it and test it out [13:03:11] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1050.eqiad.wmnet with reason: host reimage [13:03:56] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) es1034 gradually with 4 steps - Pooling in [13:05:01] (03PS6) 10Ayounsi: WIP: use Homer to configure the network [cookbooks] - 10https://gerrit.wikimedia.org/r/1166407 [13:05:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2187 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P78924 and previous config saved to /var/cache/conftool/dbconfig/20250711-130528-root.json [13:10:17] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1051.eqiad.wmnet with OS bookworm [13:10:30] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10995338 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1051.eqiad.... [13:11:35] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd1037.eqiad.wmnet with OS bullseye [13:11:50] (03PS3) 10Ayounsi: magru: add Ufinet transit [homer/public] - 10https://gerrit.wikimedia.org/r/1168122 (https://phabricator.wikimedia.org/T389767) [13:12:57] (03PS5) 10Dreamy Jazz: WIP: Prep hCaptcha config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148390 (https://phabricator.wikimedia.org/T382148) (owner: 10Reedy) [13:14:04] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1049.eqiad.wmnet with OS bookworm [13:14:16] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10995347 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1049.eqiad.... [13:14:56] (03PS1) 10Jcrespo: raid: Do not use the pipe symbol '|' as a separator for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/1168176 (https://phabricator.wikimedia.org/T395446) [13:15:10] (03PS2) 10Jcrespo: raid: Do not use the pipe symbol '|' as a separator for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/1168176 (https://phabricator.wikimedia.org/T395446) [13:16:50] (03CR) 10Jcrespo: "This is a draft so I do not forget over the weekend. This is (I belive) a bug on raid output." [puppet] - 10https://gerrit.wikimedia.org/r/1168176 (https://phabricator.wikimedia.org/T395446) (owner: 10Jcrespo) [13:17:20] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete Cumin alias [puppet] - 10https://gerrit.wikimedia.org/r/1168172 (owner: 10Muehlenhoff) [13:18:14] (03PS2) 10Gmodena: eventbus: register with team-data-engineering. [alerts] - 10https://gerrit.wikimedia.org/r/1168119 (https://phabricator.wikimedia.org/T398437) [13:18:39] (03PS3) 10Gmodena: eventgate: alert on traffic deviation. [alerts] - 10https://gerrit.wikimedia.org/r/1167620 (https://phabricator.wikimedia.org/T398437) [13:19:43] (03PS3) 10Jcrespo: raid: Do not use the pipe symbol '|' as a separator for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/1168176 (https://phabricator.wikimedia.org/T395446) [13:20:28] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1050.eqiad.wmnet with OS bookworm [13:20:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2187 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P78925 and previous config saved to /var/cache/conftool/dbconfig/20250711-132034-root.json [13:20:41] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware), 13Patch-For-Review: Q4:rack/setup/install cloudcephosd10[48-51] - https://phabricator.wikimedia.org/T394333#10995381 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephosd1050.eqiad.... [13:20:58] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet (Puppet 7.0): Phase out cergen - https://phabricator.wikimedia.org/T357750#10995383 (10hashar) [13:21:39] (03PS1) 10Dreamy Jazz: Enable hCaptcha on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1168178 (https://phabricator.wikimedia.org/T382148) [13:22:17] (03PS6) 10Dreamy Jazz: WIP: Prep hCaptcha config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1148390 (https://phabricator.wikimedia.org/T382148) (owner: 10Reedy) [13:22:30] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: Replace Exim on VRTS servers with Postfix - https://phabricator.wikimedia.org/T378028#10995390 (10Arnoldokoth) Thanks @MoritzMuehlenhoff We'll consider that... But I'm doubtful we "strictly" need to test this on hardware.... [13:22:54] 06SRE, 10SRE-Access-Requests: Add Sowmya Guru to list of "WMDE group" approvers on Wikitech - https://phabricator.wikimedia.org/T398686#10995393 (10cmooney) >>! In T398686#10995078, @sowmya.guru wrote: > The NDA is signed by me! Thanks! Once we get confirmation it's on file I will get going on the access. [13:23:16] (03PS7) 10Ayounsi: WIP: use Homer to configure the network [cookbooks] - 10https://gerrit.wikimedia.org/r/1166407 [13:24:36] FIRING: [2x] ProbeDown: Service install1004:8080 has failed probes (http_squid_ip4) - https://wikitech.wikimedia.org/wiki/HTTP_proxy - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:24:45] PROBLEM - SSH on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:25:53] PROBLEM - Squid on install1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/HTTP_proxy [13:28:21] (03CR) 10Ayounsi: "`configure-switch-interfaces` tested in https://phabricator.wikimedia.org/P78926" [cookbooks] - 10https://gerrit.wikimedia.org/r/1166407 (owner: 10Ayounsi) [13:28:34] FIRING: [4x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:28:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [13:29:50] (03PS1) 10Muehlenhoff: icinga: Use systemd::sysuser to create the metamonitor system user [puppet] - 10https://gerrit.wikimedia.org/r/1168179 [13:34:00] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1168179 (owner: 10Muehlenhoff) [13:35:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2187 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P78927 and previous config saved to /var/cache/conftool/dbconfig/20250711-133539-root.json [13:36:45] FIRING: Traffic bill over quota: Alert for device cr4-ulsfo.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [13:37:54] hmmm [13:39:21] (03PS1) 10Btullis: Use sed to identify any md based swaps during cephosd server reimage [puppet] - 10https://gerrit.wikimedia.org/r/1168181 (https://phabricator.wikimedia.org/T399281) [13:40:30] (03CR) 10Andrew Bogott: [C:03+1] Use sed to identify any md based swaps during cephosd server reimage [puppet] - 10https://gerrit.wikimedia.org/r/1168181 (https://phabricator.wikimedia.org/T399281) (owner: 10Btullis) [13:40:58] (03CR) 10Btullis: [C:03+2] Use sed to identify any md based swaps during cephosd server reimage [puppet] - 10https://gerrit.wikimedia.org/r/1168181 (https://phabricator.wikimedia.org/T399281) (owner: 10Btullis) [13:41:11] (03PS2) 10Muehlenhoff: icinga: Use systemd::sysuser to create the metamonitor system user [puppet] - 10https://gerrit.wikimedia.org/r/1168179 [13:41:33] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1037.eqiad.wmnet with OS bullseye [13:45:25] PROBLEM - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [13:45:29] PROBLEM - grafana-rw.wikimedia.org requires authentication on grafana1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [13:48:11] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd1037.eqiad.wmnet with OS bullseye [13:48:31] !log btullis@cumin1003 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephosd1037.eqiad.wmnet with OS bullseye [13:48:33] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1168179 (owner: 10Muehlenhoff) [13:48:39] RECOVERY - SSH on install1004 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u4 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:48:45] RECOVERY - Squid on install1004 is OK: TCP OK - 0.001 second response time on 208.80.154.74 port 8080 https://wikitech.wikimedia.org/wiki/HTTP_proxy [13:49:15] RECOVERY - grafana-next-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 562 bytes in 0.049 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [13:49:19] RECOVERY - grafana-rw.wikimedia.org requires authentication on grafana1002 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 552 bytes in 0.053 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [13:49:36] FIRING: [4x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:53:34] RESOLVED: [4x] ProbeDown: Ripe Atlas anchor atlas1001:80 is not returning HTTP 200 OK on port 80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:55:06] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd1037.eqiad.wmnet with OS bullseye [13:55:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [13:56:45] RESOLVED: Traffic bill over quota: Alert for device cr4-ulsfo.wikimedia.org - Traffic bill over quota - https://alerts.wikimedia.org/?q=alertname%3DTraffic+bill+over+quota [13:58:14] (03PS1) 10Marostegui: db2242: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1168184 (https://phabricator.wikimedia.org/T399298) [13:58:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [13:58:51] (03CR) 10Marostegui: [C:03+2] db2242: Migrate to MariaDB 10.11 [puppet] - 10https://gerrit.wikimedia.org/r/1168184 (https://phabricator.wikimedia.org/T399298) (owner: 10Marostegui) [13:59:15] !log root@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2242.codfw.wmnet with reason: Maintenance [13:59:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2242 for migration to mariadb 10.11', diff saved to https://phabricator.wikimedia.org/P78929 and previous config saved to /var/cache/conftool/dbconfig/20250711-135919-marostegui.json [14:00:00] that's the NTT link (eqsin -> ulsfo) [14:00:05] 09:56:45 <+jinxer-wm> RESOLVED: Traffic bill over quota: Alert for device cr4-ulsfo.wikimedia.org - Traffic bill over quota - [14:00:39] (03PS2) 10Jforrester: Add phan and use it to detect duplicated array keys [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167941 (owner: 10Daimona Eaytoy) [14:03:55] (03CR) 10Jforrester: [C:03+1] "Neat!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167941 (owner: 10Daimona Eaytoy) [14:06:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2242 (re)pooling @ 25%: 10', diff saved to https://phabricator.wikimedia.org/P78930 and previous config saved to /var/cache/conftool/dbconfig/20250711-140648-root.json [14:09:58] !log sudo swapoff /dev/md1 on cloudcephosd1036 T399281 [14:10:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:10] T399281: 2025-07-11 Toolforge tools not responding - https://phabricator.wikimedia.org/T399281 [14:13:08] (03CR) 10Cathal Mooney: [C:03+1] magru: add Ufinet transit [homer/public] - 10https://gerrit.wikimedia.org/r/1168122 (https://phabricator.wikimedia.org/T389767) (owner: 10Ayounsi) [14:13:13] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Nice!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167941 (owner: 10Daimona Eaytoy) [14:13:46] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] Add phan and use it to detect duplicated array keys (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167941 (owner: 10Daimona Eaytoy) [14:13:50] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, and 2 others: Replace Exim on VRTS servers with Postfix - https://phabricator.wikimedia.org/T378028#10995529 (10Dzahn) Thanks all. I am not sure though if the request was for "temp testing setup" or just for "a new system to replace the... [14:16:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:21:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [14:21:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2242 (re)pooling @ 50%: 10', diff saved to https://phabricator.wikimedia.org/P78931 and previous config saved to /var/cache/conftool/dbconfig/20250711-142154-root.json [14:24:00] !log btullis@cumin1003 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1037.eqiad.wmnet with OS bullseye [14:25:29] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [14:25:34] !log btullis@cumin1003 START - Cookbook sre.hosts.reimage for host cloudcephosd1037.eqiad.wmnet with OS bullseye [14:27:56] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [14:37:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2242 (re)pooling @ 75%: 10', diff saved to https://phabricator.wikimedia.org/P78932 and previous config saved to /var/cache/conftool/dbconfig/20250711-143659-root.json [14:44:29] !log btullis@cumin1003 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1037.eqiad.wmnet with reason: host reimage [14:48:52] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1037.eqiad.wmnet with reason: host reimage [14:52:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2242 (re)pooling @ 100%: 10', diff saved to https://phabricator.wikimedia.org/P78933 and previous config saved to /var/cache/conftool/dbconfig/20250711-145205-root.json [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:06:49] !log btullis@cumin1003 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1037.eqiad.wmnet with OS bullseye [15:15:13] FIRING: [21x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [15:16:14] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on aqs1012 - https://phabricator.wikimedia.org/T396970#10995789 (10Eevans) >>! In T396970#10989045, @Eevans wrote: >>>! In T396970#10965457, @VRiley-WMF wrote: >> Is there a time when we can plan for me to look and try to swap at least one of those drives? I'll nee... [15:16:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:16:45] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Netbox: remove old cr2-codfw Switch Control Board inventory items - https://phabricator.wikimedia.org/T398940#10995790 (10RobH) >>! In T398940#10994586, @ayounsi wrote: > We can remove them from Netbox if they're not in the device anym... [15:22:26] (03Merged) 10jenkins-bot: Increase the limitranges for the spark-history service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168153 (https://phabricator.wikimedia.org/T396617) (owner: 10Btullis) [15:28:09] !log sukhe@cumin1003 START - Cookbook sre.dns.admin DNS admin: depool site eqsin [reason: testing issues with primary arelion link, T399221] [15:28:09] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: depool site eqsin [reason: testing issues with primary arelion link, T399221] [15:28:09] 10ops-eqiad, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q4:rack/setup/install ml-serve101[2345] - https://phabricator.wikimedia.org/T393948#10995827 (10elukey) @Jclark-ctr I have the feeling that we'll have to pause this work for a bit of time, I'll need to set some time off to figure out what's different a... [15:28:09] T399221: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221 [15:29:03] (03PS1) 10Fabfur: cache::haproxy: add x_analytics log variable to http frontend too [puppet] - 10https://gerrit.wikimedia.org/r/1168198 (https://phabricator.wikimedia.org/T399167) [15:32:38] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1168198 (https://phabricator.wikimedia.org/T399167) (owner: 10Fabfur) [15:32:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [15:35:22] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:36:59] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:37:44] (03PS1) 10Cwhite: logstash: convert numerics - remove field removal and tracking [puppet] - 10https://gerrit.wikimedia.org/r/1168201 (https://phabricator.wikimedia.org/T234565) [15:37:45] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [15:38:00] !log elukey@cumin1003 START - Cookbook sre.hosts.provision for host cp2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:38:15] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [15:38:43] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1036.eqiad.wmnet with OS bullseye [15:39:37] !log elukey@cumin1003 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host cp2045.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:41:48] FIRING: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [15:43:15] RESOLVED: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [15:43:45] (03CR) 10CI reject: [V:04-1] logstash: convert numerics - remove field removal and tracking [puppet] - 10https://gerrit.wikimedia.org/r/1168201 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [15:44:16] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [15:44:31] !log gmodena@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mw-content-history-reconcile-enrich-next: apply [15:51:05] (03PS2) 10Cwhite: logstash: convert numerics - remove field removal and tracking [puppet] - 10https://gerrit.wikimedia.org/r/1168201 (https://phabricator.wikimedia.org/T234565) [15:54:03] !log un-drain Arelion CCT from codfw to eqsin T399221 [15:54:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:08] T399221: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221 [15:54:49] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10995939 (10elukey) @Jhancock.wm I tried with 2045 since I wasn't able to log in on 2044, I get the same failures in provisioning: no nics reported. As Riccardo pointed... [15:55:08] (03CR) 10Cwhite: [C:03+2] logstash: convert numerics - remove field removal and tracking [puppet] - 10https://gerrit.wikimedia.org/r/1168201 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [15:56:01] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168209 [15:56:41] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1036.eqiad.wmnet with OS bullseye [15:56:48] RESOLVED: PuppetZeroResources: Puppet has failed generate resources on wdqs2025:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [16:06:59] 10ops-codfw, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4:rack/setup/install cp20[43-58] codfw - https://phabricator.wikimedia.org/T392851#10996028 (10wiki_willy) Hi @elukey - can you or @Volans send me an email summarizing everything you need from Dell? I'll add the Technical Account Rep to the email thre... [16:16:04] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1036.eqiad.wmnet with reason: host reimage [16:19:53] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1036.eqiad.wmnet with reason: host reimage [16:21:24] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168214 [16:22:05] (03CR) 10Dzahn: [V:03+1 C:03+2] "thanks! no diff in compiler https://puppet-compiler.wmflabs.org/output/1129920/6245/" [puppet] - 10https://gerrit.wikimedia.org/r/1129920 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [16:22:26] (03Abandoned) 10Dbrant: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1168209 (owner: 10PipelineBot) [16:29:31] (03PS1) 10Dzahn: gerrit: also rename "passive" to "spare" server in MOTD [puppet] - 10https://gerrit.wikimedia.org/r/1168216 (https://phabricator.wikimedia.org/T387833) [16:30:01] (03CR) 10Dzahn: [V:03+1 C:03+2] "noop in prod confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/1129920 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [16:31:55] !log drain Arelion CCT from codfw to eqsin - still see minor packet loss which is affecting purged T399221 [16:31:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:31:59] T399221: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221 [16:38:08] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1036.eqiad.wmnet with OS bullseye [16:39:47] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1035.eqiad.wmnet with OS bullseye [16:51:07] !log andrew@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cloudcephosd1035.eqiad.wmnet with OS bullseye [16:51:47] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1035.eqiad.wmnet with OS bullseye [17:10:17] !log sukhe@cumin1003 START - Cookbook sre.dns.admin DNS admin: pool site eqsin [reason: done testing issues with primary arelion link, T399221] [17:10:21] T399221: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221 [17:10:22] !log sukhe@cumin1003 END (PASS) - Cookbook sre.dns.admin (exit_code=0) DNS admin: pool site eqsin [reason: done testing issues with primary arelion link, T399221] [17:11:01] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1035.eqiad.wmnet with reason: host reimage [17:14:03] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1035.eqiad.wmnet with reason: host reimage [17:16:43] (03Abandoned) 10Cwhite: add docs for string_to_numeric_conversion_failure [software/ecs] - 10https://gerrit.wikimedia.org/r/1166008 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [17:22:58] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10996311 (10VRiley-WMF) @Marostegui thanks! I will be installing this as a "new" unit of db1259 [17:23:14] (03CR) 10Krinkle: "Yes." [puppet] - 10https://gerrit.wikimedia.org/r/1167266 (https://phabricator.wikimedia.org/T289318) (owner: 10Krinkle) [17:26:45] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10996337 (10Marostegui) >>! In T393296#10996311, @VRiley-WMF wrote: > @Marostegui thanks! I will be installing this as a "new" unit of db1259 <3 [17:27:09] (03CR) 10Daimona Eaytoy: Add phan and use it to detect duplicated array keys (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1167941 (owner: 10Daimona Eaytoy) [17:28:07] (03CR) 10Cwhite: [C:03+2] logstash: remove filter_on_templates v1 [puppet] - 10https://gerrit.wikimedia.org/r/1167942 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite) [17:28:26] (03PS2) 10Cwhite: logstash: rename filter-on-templates.rb [puppet] - 10https://gerrit.wikimedia.org/r/1167943 (https://phabricator.wikimedia.org/T234565) [17:32:32] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1035.eqiad.wmnet with OS bullseye [17:39:03] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1006.eqiad.wmnet with OS bullseye [17:44:18] (03CR) 10Dzahn: [C:03+2] gerrit: also rename "passive" to "spare" server in MOTD [puppet] - 10https://gerrit.wikimedia.org/r/1168216 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [17:44:24] (03PS2) 10Dzahn: gerrit: also rename "passive" to "spare" server in MOTD [puppet] - 10https://gerrit.wikimedia.org/r/1168216 (https://phabricator.wikimedia.org/T387833) [17:45:12] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1007.eqiad.wmnet with OS bullseye [17:48:24] !log andrew@cumin2002 START - Cookbook sre.hosts.reimage for host cloudcephosd1008.eqiad.wmnet with OS bullseye [17:48:34] (03CR) 10Dzahn: [C:03+2] gerrit: also rename "passive" to "spare" server in MOTD [puppet] - 10https://gerrit.wikimedia.org/r/1168216 (https://phabricator.wikimedia.org/T387833) (owner: 10Dzahn) [17:54:33] (03PS1) 10Andrew Bogott: Revert "cloudceph osd.yaml: update some nic names for Bookworm reimages" [puppet] - 10https://gerrit.wikimedia.org/r/1168227 (https://phabricator.wikimedia.org/T399281) [17:55:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:57:04] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1006.eqiad.wmnet with reason: host reimage [18:03:27] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1006.eqiad.wmnet with reason: host reimage [18:03:34] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1007.eqiad.wmnet with reason: host reimage [18:05:27] FIRING: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:06:55] !log andrew@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephosd1008.eqiad.wmnet with reason: host reimage [18:07:01] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1007.eqiad.wmnet with reason: host reimage [18:07:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [18:09:40] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephosd1008.eqiad.wmnet with reason: host reimage [18:10:27] RESOLVED: [2x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:12:58] RESOLVED: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2009:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [18:13:38] (03CR) 10Andrew Bogott: [C:03+2] Revert "cloudceph osd.yaml: update some nic names for Bookworm reimages" [puppet] - 10https://gerrit.wikimedia.org/r/1168227 (https://phabricator.wikimedia.org/T399281) (owner: 10Andrew Bogott) [18:20:03] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1006.eqiad.wmnet with OS bullseye [18:23:11] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1007.eqiad.wmnet with OS bullseye [18:24:45] !log andrew@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephosd1008.eqiad.wmnet with OS bullseye [18:36:38] (03PS2) 10Dreamy Jazz: Enable hCaptcha on test2wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1168178 (https://phabricator.wikimedia.org/T382148) [18:49:53] PROBLEM - Host cloudnet2006-dev is DOWN: PING CRITICAL - Packet loss = 100% [18:52:04] 06SRE, 10SRE-Access-Requests: Add Sowmya Guru to list of "WMDE group" approvers on Wikitech - https://phabricator.wikimedia.org/T398686#10996416 (10KFrancis) Hello all, I am confirming the NDA is fully signed. Thanks! [18:52:23] RECOVERY - Host cloudnet2006-dev is UP: PING OK - Packet loss = 0%, RTA = 33.27 ms [19:10:44] (03PS1) 10Cwhite: Revert "logstash: remove event.duration when value is hyphen" [puppet] - 10https://gerrit.wikimedia.org/r/1168234 [19:11:07] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10996452 (10VRiley-WMF) Just to verify with you @Marostegui the server is now in netbox. However, this seed server only has a single 1.92TB drive, while the other server has ten 1.92 drives. Is it safe... [19:15:13] FIRING: [21x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [19:28:32] (03PS1) 10Dreamy Jazz: Document Trust and Safety Product Team database tables [puppet] - 10https://gerrit.wikimedia.org/r/1168235 (https://phabricator.wikimedia.org/T399302) [19:33:04] (03CR) 10Dreamy Jazz: Document Trust and Safety Product Team database tables (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1168235 (https://phabricator.wikimedia.org/T399302) (owner: 10Dreamy Jazz) [19:34:00] (03PS2) 10Dreamy Jazz: mariadb: Document Trust and Safety Product Team database tables [puppet] - 10https://gerrit.wikimedia.org/r/1168235 (https://phabricator.wikimedia.org/T399302) [19:36:12] (03CR) 10Dreamy Jazz: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1168235 (https://phabricator.wikimedia.org/T399302) (owner: 10Dreamy Jazz) [20:00:27] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10996515 (10Marostegui) Yes, absolutely! Go for it [21:07:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:08:39] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:10:13] PROBLEM - Postfix SMTP on crm2001 is CRITICAL: CRITICAL - Certificate crm2001.codfw.wmnet expires in 15 day(s) (Sun 27 Jul 2025 09:10:00 PM GMT +0000). https://wikitech.wikimedia.org/wiki/Mail%23Troubleshooting [21:18:56] (03CR) 10BryanDavis: "Cause of T399216. The `hieradata/common/profile/*` files are not loaded by any codepath for a Cloud VPS instance as far as I can tell." [labs/private] - 10https://gerrit.wikimedia.org/r/1155221 (https://phabricator.wikimedia.org/T397841) (owner: 10Kamila Součková) [21:43:23] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [21:43:56] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10996704 (10VRiley-WMF) Provisioning now... [21:46:36] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt db1259 - vriley@cumin1002" [21:46:41] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt db1259 - vriley@cumin1002" [21:46:41] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [21:48:17] !log vriley@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host db1259 [21:48:45] FIRING: WidespreadPuppetFailure: Puppet has failed in eqiad - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=6 - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure [21:49:03] 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Traffic: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221#10996713 (10cmooney) Arelion came back to say they did move a path but that they see CRC errors inbound from us in codfw: ` 2025-07-11 19:48 Hello Team, We ha... [21:49:14] 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations, 10netops, 06Traffic: eqsin purged consumers lag - https://phabricator.wikimedia.org/T399221#10996717 (10cmooney) p:05Triage→03High [21:49:30] !log vriley@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host db1259 [21:50:23] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host db1259.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [21:55:06] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:et-1/0/2 (Transport: cr1-eqiad:et-1/1/2 (Arelion, IC-374549) {#12267}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [22:07:25] RESOLVED: SystemdUnitFailed: httpbb_kubernetes_mw-api-ext_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:08:39] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:10:27] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db1259.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [22:26:23] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host db1259.eqiad.wmnet with OS bookworm [22:26:32] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10996761 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host db1259.eqiad.wmnet with OS bookworm [22:47:49] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed yet again - https://phabricator.wikimedia.org/T393296#10996821 (10VRiley-WMF) Before proceeding with the imaging, I wanted to make sure, it's okay for me to wipe these drives, correct? I think that's why it may fail on the reimage [23:15:13] FIRING: [21x] CertAlmostExpired: Certificate for service asw1-b3-magru.mgmt.magru.wmnet:32767 is about to expire - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [23:16:10] vriley@cumin1002 reimage (PID 2981850) is awaiting input [23:38:17] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1168275 [23:38:17] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1168275 (owner: 10TrainBranchBot) [23:50:36] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1168275 (owner: 10TrainBranchBot) [23:57:00] (03CR) 10Ladsgroup: "If you don't set Hosts: footer, the check experimental trigger PCC on all production hosts which is an extremely expensive operation and s" [puppet] - 10https://gerrit.wikimedia.org/r/1168235 (https://phabricator.wikimedia.org/T399302) (owner: 10Dreamy Jazz)