[00:08:43] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1132781 [00:08:43] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1132781 (owner: 10TrainBranchBot) [00:27:01] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1132781 (owner: 10TrainBranchBot) [00:34:37] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390535#10696537 (10phaultfinder) [00:55:36] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390535#10696566 (10phaultfinder) [01:08:26] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.44.0-wmf.23 [core] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1132787 (https://phabricator.wikimedia.org/T386218) [01:08:28] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/1.44.0-wmf.23 [core] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1132787 (https://phabricator.wikimedia.org/T386218) (owner: 10TrainBranchBot) [01:20:48] (03Merged) 10jenkins-bot: Branch commit for wmf/1.44.0-wmf.23 [core] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1132787 (https://phabricator.wikimedia.org/T386218) (owner: 10TrainBranchBot) [01:23:54] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Test hot disk swap on Supermicro database hosts - https://phabricator.wikimedia.org/T388684#10696611 (10Papaul) @Marostegui thank you for letting me use db2243 for testing the disk replacing process. I did some testing on my end and i was able to get the disk back in th... [02:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250401T0200) [02:03:26] FIRING: [4x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:47:02] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [02:59:38] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390535#10697050 (10phaultfinder) [03:00:04] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250401T0300) [03:01:38] (03PS1) 10TrainBranchBot: testwikis to 1.44.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132794 (https://phabricator.wikimedia.org/T386218) [03:01:40] (03CR) 10TrainBranchBot: [C:03+2] testwikis to 1.44.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132794 (https://phabricator.wikimedia.org/T386218) (owner: 10TrainBranchBot) [03:02:27] (03Merged) 10jenkins-bot: testwikis to 1.44.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132794 (https://phabricator.wikimedia.org/T386218) (owner: 10TrainBranchBot) [03:02:50] !log mwpresync@deploy1003 Started scap sync-world: testwikis to 1.44.0-wmf.23 refs T386218 [03:02:52] T386218: 1.44.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T386218 [03:06:25] FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:07:02] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [03:28:39] FIRING: [2x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [03:30:21] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [04:00:05] Deploy window Automatic removal of all obsolete MediaWiki versions from the deployment and bare metal servers (except the most-recent obsolete version) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250401T0400) [04:00:42] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390535#10697102 (10phaultfinder) [04:04:44] !log mwpresync@deploy1003 Pruned MediaWiki: 1.44.0-wmf.20 (duration: 04m 34s) [04:44:41] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390535#10697114 (10phaultfinder) [04:55:39] (03PS1) 10Kevin Bazira: ml-services: update article-country image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132804 (https://phabricator.wikimedia.org/T375821) [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:14:38] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390535#10697122 (10phaultfinder) [05:14:50] !log [WDQS Deploy] Gearing up for deploy of wdqs `0.3.155`. Pre-deploy tests passing on canary `wdqs1016` [05:14:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:18:09] (03CR) 10Fabfur: [C:03+1] hieradata: move profile::acme_chief::certificates to profile [puppet] - 10https://gerrit.wikimedia.org/r/1131270 (owner: 10Filippo Giunchedi) [05:20:51] !log ryankemper@deploy1003 Started deploy [wdqs/wdqs@557a834]: 0.3.155 [05:22:40] !log [WDQS Deploy] Tests passing following deploy of `0.3.155` on canary `wdqs1015`; proceeding to rest of fleet [05:22:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:29:12] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Test hot disk swap on Supermicro database hosts - https://phabricator.wikimedia.org/T388684#10697124 (10Marostegui) Thank you @Papaul for the great investigations! My hopes are that the following command won't be needed when we will be dealing with a real failure: ` Cf... [05:33:41] !log ryankemper@deploy1003 Finished deploy [wdqs/wdqs@557a834]: 0.3.155 (duration: 12m 49s) [05:44:39] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390535#10697130 (10phaultfinder) [05:52:07] (03CR) 10Ayounsi: [C:03+2] gNMIc: collect BFD states [puppet] - 10https://gerrit.wikimedia.org/r/1131320 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [05:56:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:59:37] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390535#10697152 (10phaultfinder) [06:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250401T0600) [06:00:05] marostegui, Amir1, and federico3: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Primary database switchover . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250401T0600). [06:02:44] FIRING: [3x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [06:03:26] FIRING: [4x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:03:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [06:06:12] ^ looking [06:08:50] Depooled wdqs2016 and wdqs2017; they are the only two servers above 10 min lag [06:08:58] (03PS1) 10Marostegui: mariadb: Productionize db1257 [puppet] - 10https://gerrit.wikimedia.org/r/1132815 (https://phabricator.wikimedia.org/T381475) [06:09:39] (03CR) 10Marostegui: [C:03+2] mariadb: Productionize db1257 [puppet] - 10https://gerrit.wikimedia.org/r/1132815 (https://phabricator.wikimedia.org/T381475) (owner: 10Marostegui) [06:13:30] FIRING: Emergency syslog message: Alert for device asw1-b13-drmrs.mgmt.drmrs.wmnet - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [06:13:39] FIRING: [3x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [06:14:07] !log marostegui@cumin1002 START - Cookbook sre.mysql.clone of db1211.eqiad.wmnet onto db1257.eqiad.wmnet [06:15:21] (03CR) 10Marostegui: "I am testing this now" [cookbooks] - 10https://gerrit.wikimedia.org/r/1132618 (https://phabricator.wikimedia.org/T390217) (owner: 10Federico Ceratto) [06:18:31] RESOLVED: Emergency syslog message: Device asw1-b13-drmrs.mgmt.drmrs.wmnet recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [06:18:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [06:21:00] Still seeing a big spike in queries, not positive if related to deploy or not. Gut says unrelated though. Lag alert resolved for now with the two hosts depooled [06:24:35] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390535#10697197 (10phaultfinder) [06:27:58] (03CR) 10Joal: "LGTM! This needs to be synchronized with https://gerrit.wikimedia.org/r/c/analytics/refinery/+/1132031" [puppet] - 10https://gerrit.wikimedia.org/r/1132756 (https://phabricator.wikimedia.org/T390247) (owner: 10Aleksandar Mastilovic) [06:28:02] (03CR) 10Joal: [C:03+1] Upgrade the Gobblin JAR version to 1.0.6 [puppet] - 10https://gerrit.wikimedia.org/r/1132756 (https://phabricator.wikimedia.org/T390247) (owner: 10Aleksandar Mastilovic) [06:28:39] RESOLVED: [3x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [06:30:21] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [06:32:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [06:34:37] (03CR) 10Ayounsi: [C:03+2] gNMIc: start collecting metrics from fasw, ignore asw1-eqsin VC [puppet] - 10https://gerrit.wikimedia.org/r/1132675 (https://phabricator.wikimedia.org/T390052) (owner: 10Ayounsi) [06:40:58] FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2016:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [06:44:27] (03CR) 10Ayounsi: [C:03+2] sre.network.tls: allow running it on more types [cookbooks] - 10https://gerrit.wikimedia.org/r/1132671 (https://phabricator.wikimedia.org/T390052) (owner: 10Ayounsi) [06:50:00] (03PS1) 10Filippo Giunchedi: add trixie-wikimedia to apt [puppet] - 10https://gerrit.wikimedia.org/r/1132991 [06:50:20] (03CR) 10Filippo Giunchedi: [C:03+2] alertmanager: open dcops tasks with title as summary [puppet] - 10https://gerrit.wikimedia.org/r/1132637 (https://phabricator.wikimedia.org/T388641) (owner: 10Filippo Giunchedi) [06:50:39] (03Merged) 10jenkins-bot: sre.network.tls: allow running it on more types [cookbooks] - 10https://gerrit.wikimedia.org/r/1132671 (https://phabricator.wikimedia.org/T390052) (owner: 10Ayounsi) [06:52:02] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [06:52:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [06:53:40] 10ops-codfw, 06DC-Ops: Unresponsive management for maps2009.mgmt:22 - https://phabricator.wikimedia.org/T390659 (10phaultfinder) 03NEW [06:53:41] 10ops-codfw, 06DC-Ops: Unresponsive management for cp2035.mgmt:22 - https://phabricator.wikimedia.org/T390658 (10phaultfinder) 03NEW [06:53:42] 10ops-codfw, 06DC-Ops: Outbound errors on interface cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://phabricator.wikimedia.org/T390660 (10phaultfinder) 03NEW [06:55:02] that's expected ^ [06:56:51] XioNoX: ^ now open in the correct project [06:58:11] jouncebot: refresh [06:58:12] I refreshed my knowledge about deployments. [06:58:15] jouncebot: nowandnext [06:58:15] For the next 0 hour(s) and 1 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250401T0600) [06:58:15] In 0 hour(s) and 1 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250401T0700) [06:58:46] godog: nice! [06:58:47] there are no changes planned :] [06:59:33] godog: and seems like a real issue https://grafana.wikimedia.org/d/5p97dAASz/network-device-queue-and-error-stats?orgId=1&var-site=ulsfo%20prometheus%2Fops&var-device=cr4-ulsfo:9804&var-interface=xe-0%2F1%2F1&from=now-24h&to=now&viewPanel=43 [07:00:04] Amir1, Urbanecm, and awight: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250401T0700). [07:00:04] No Gerrit patches in the queue for this window AFAICS. [07:00:16] (03CR) 10Muehlenhoff: "Looks good, a few comments inline" [puppet] - 10https://gerrit.wikimedia.org/r/1132991 (owner: 10Filippo Giunchedi) [07:00:49] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132993 [07:01:30] 10ops-codfw, 06DC-Ops: Unresponsive management for cp2035.mgmt:22 - https://phabricator.wikimedia.org/T390658#10697276 (10fgiunchedi) @Dzahn @Jclark-ctr @Jhancock.wm @Papaul @VRiley-WMF this is how the management alerts for example look like Note that there might be duplicates with the old titles during the t... [07:01:47] XioNoX: indeed! nice [07:03:48] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [07:04:30] FIRING: Emergency syslog message: Alert for device cloudsw1-c8-eqiad.mgmt.eqiad.wmnet - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [07:05:33] (03CR) 10Filippo Giunchedi: add trixie-wikimedia to apt (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1132991 (owner: 10Filippo Giunchedi) [07:05:36] (03PS2) 10Filippo Giunchedi: add trixie-wikimedia to apt [puppet] - 10https://gerrit.wikimedia.org/r/1132991 [07:06:29] (03PS1) 10Muehlenhoff: Add raid5-4dev.cfg Partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/1132995 (https://phabricator.wikimedia.org/T156955) [07:06:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [07:06:55] (03PS1) 10Kosta Harlan: EmailAuth: Enable "enforce" mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132996 (https://phabricator.wikimedia.org/T390662) [07:07:00] (03CR) 10CI reject: [V:04-1] Add raid5-4dev.cfg Partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/1132995 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [07:08:35] RESOLVED: [3x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [07:08:39] FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:09:30] RESOLVED: Emergency syslog message: Device cloudsw1-c8-eqiad.mgmt.eqiad.wmnet recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [07:10:21] (03PS2) 10Muehlenhoff: Add raid5-4dev.cfg Partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/1132995 (https://phabricator.wikimedia.org/T156955) [07:11:31] (03CR) 10Slyngshede: [C:03+1] "Makes sense to me." [software/bitu] - 10https://gerrit.wikimedia.org/r/1131452 (owner: 10Hashar) [07:12:00] (03CR) 10Jelto: miscweb: os-report: use puppetdb from external_services (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131952 (https://phabricator.wikimedia.org/T350794) (owner: 10Jelto) [07:13:34] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [07:13:51] (03CR) 10Slyngshede: [C:03+1] "Very nice, much more readable as well." [software/bitu] - 10https://gerrit.wikimedia.org/r/1131471 (owner: 10Hashar) [07:16:10] (03CR) 10Slyngshede: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1131453 (owner: 10Hashar) [07:16:39] (03CR) 10Slyngshede: [C:03+1] "Cool" [software/bitu] - 10https://gerrit.wikimedia.org/r/1131455 (owner: 10Hashar) [07:17:22] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device fasw2-c1b-eqiad [07:17:37] (03CR) 10Ilias Sarantopoulos: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132804 (https://phabricator.wikimedia.org/T375821) (owner: 10Kevin Bazira) [07:18:34] FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2016:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [07:19:37] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device fasw2-c1b-eqiad [07:19:39] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390535#10697312 (10phaultfinder) [07:19:57] (03CR) 10Alexandros Kosiaris: [C:03+2] wikifunctions: Add group{0,1,2} releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132684 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris) [07:19:57] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device mr1-ulsfo [07:20:48] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti4007.ulsfo.wmnet with OS bookworm [07:20:58] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in ulsfo to Bookworm - https://phabricator.wikimedia.org/T382511#10697313 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti4007.ulsfo.wmnet with OS bookworm [07:21:09] (03CR) 10Kevin Bazira: [C:03+2] ml-services: update article-country image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132804 (https://phabricator.wikimedia.org/T375821) (owner: 10Kevin Bazira) [07:21:25] (03Merged) 10jenkins-bot: wikifunctions: Add group{0,1,2} releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132684 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris) [07:22:38] (03Merged) 10jenkins-bot: ml-services: update article-country image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132804 (https://phabricator.wikimedia.org/T375821) (owner: 10Kevin Bazira) [07:23:31] (03PS1) 10Muehlenhoff: maps-test: Back to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1133056 [07:23:35] FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2016:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [07:24:12] (03PS6) 10Hashar: Simplify invocation of clients integrations [software/bitu] - 10https://gerrit.wikimedia.org/r/1131460 [07:24:12] (03PS3) 10Hashar: Fix handling of status code in Gerrit integration [software/bitu] - 10https://gerrit.wikimedia.org/r/1131471 [07:24:12] (03PS2) 10Hashar: Add a basic test for user_block in LDAP [software/bitu] - 10https://gerrit.wikimedia.org/r/1132019 [07:24:19] !log kevinbazira@deploy1003 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [07:25:26] (03PS1) 10Volans: Adapt calls for Spicerack v10.0.0 [cookbooks] - 10https://gerrit.wikimedia.org/r/1133057 [07:26:10] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device mr1-ulsfo [07:28:27] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device mr1-ulsfo [07:28:28] !log ayounsi@cumin1002 END (ERROR) - Cookbook sre.network.tls (exit_code=97) for network device mr1-ulsfo [07:30:48] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device mr1-ulsfo [07:30:52] !log kevinbazira@deploy1003 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-models' for release 'main' . [07:31:00] !log ayounsi@cumin1002 END (ERROR) - Cookbook sre.network.tls (exit_code=97) for network device mr1-ulsfo [07:33:44] (03PS1) 10Ayounsi: Allow gNMI to mgmt routers control-plane [homer/public] - 10https://gerrit.wikimedia.org/r/1133058 (https://phabricator.wikimedia.org/T390052) [07:34:48] !log kevinbazira@deploy1003 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-models' for release 'main' . [07:35:09] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device mr1-codfw [07:37:28] (03PS1) 10Ayounsi: Enable gNMI on management routers [homer/public] - 10https://gerrit.wikimedia.org/r/1133060 (https://phabricator.wikimedia.org/T390052) [07:37:55] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti4007.ulsfo.wmnet with reason: host reimage [07:41:11] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device mr1-codfw [07:41:27] (03CR) 10Slyngshede: [C:03+2] P:idp Limit groups sent from CAS to Spiderpig (redo) [puppet] - 10https://gerrit.wikimedia.org/r/1131975 (https://phabricator.wikimedia.org/T389869) (owner: 10Ahmon Dancy) [07:41:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti4007.ulsfo.wmnet with reason: host reimage [07:43:40] FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2016:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [07:44:31] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device mr1-magru [07:45:37] (03CR) 10Jgiannelos: [C:03+2] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132683 (owner: 10PipelineBot) [07:46:02] (03CR) 10Slyngshede: [C:03+1] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1133056 (owner: 10Muehlenhoff) [07:46:44] !log jgiannelos@deploy1003 helmfile [staging] START helmfile.d/services/mobileapps: apply [07:47:08] !log jgiannelos@deploy1003 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [07:48:34] FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2016:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [07:49:29] (03PS1) 10Muehlenhoff: Update approvers for airflow-ml-ops [puppet] - 10https://gerrit.wikimedia.org/r/1133063 [07:50:24] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Ben.buchenau - https://phabricator.wikimedia.org/T386904#10697392 (10Jelto) >>! In T386904#10694216, @Ben.buchenau wrote: > Thanks. Fixed my SSH login issue locally, as I had a typo in my `.ssh/config` which made the connection... [07:50:42] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device mr1-magru [07:52:20] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device mr1-drmrs [07:53:32] (03CR) 10Muehlenhoff: [C:03+2] maps-test: Back to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1133056 (owner: 10Muehlenhoff) [07:55:45] FIRING: [3x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [07:58:32] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device mr1-drmrs [07:59:38] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390535#10697403 (10phaultfinder) [07:59:44] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device mr1-esams [07:59:49] !log jgiannelos@deploy1003 helmfile [codfw] START helmfile.d/services/mobileapps: apply [08:00:28] !log jgiannelos@deploy1003 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [08:00:39] (03CR) 10Elukey: [C:03+1] Adapt calls for Spicerack v10.0.0 [cookbooks] - 10https://gerrit.wikimedia.org/r/1133057 (owner: 10Volans) [08:00:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti4007.ulsfo.wmnet with OS bookworm [08:00:45] RESOLVED: [3x] CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [08:00:53] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in ulsfo to Bookworm - https://phabricator.wikimedia.org/T382511#10697405 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti4007.ulsfo.wmnet with OS bookworm completed: - ganeti4007 (**PASS*... [08:02:38] (03CR) 10Elukey: "I trust your experience with partman, looks good from a quick pass :D" [puppet] - 10https://gerrit.wikimedia.org/r/1132995 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [08:03:29] !log jgiannelos@deploy1003 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [08:03:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [08:03:51] (03CR) 10Brouberol: [C:03+2] wdqs: fix monitoring user-agents [puppet] - 10https://gerrit.wikimedia.org/r/1131940 (owner: 10DCausse) [08:04:26] !log jgiannelos@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [08:04:40] (03CR) 10Alexandros Kosiaris: [C:03+2] "I 'll merge this in the interest of not having it linger again. I 'll also deploy, it's a noop, with the exception of some telemetry relat" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131786 (owner: 10Alexandros Kosiaris) [08:05:11] !log restarting blazegraph on wdqs2016 [08:05:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:05:50] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device mr1-esams [08:06:14] (03Merged) 10jenkins-bot: cxserver: Bump all sextant modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131786 (owner: 10Alexandros Kosiaris) [08:07:43] FIRING: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [08:08:36] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/1132991 (owner: 10Filippo Giunchedi) [08:11:04] !log akosiaris@deploy1003 helmfile [staging] START helmfile.d/services/cxserver: apply [08:11:24] !log akosiaris@deploy1003 helmfile [staging] DONE helmfile.d/services/cxserver: apply [08:11:27] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device mr1-eqsin [08:12:34] !log akosiaris@deploy1003 helmfile [codfw] START helmfile.d/services/cxserver: apply [08:12:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [08:12:59] !log akosiaris@deploy1003 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [08:13:37] FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2016:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [08:14:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4007.ulsfo.wmnet [08:14:38] !log T390665: restart blazegraph on wdqs2017 [08:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:40] T390665: wdqs2016 and 2017 not consuming updates - https://phabricator.wikimedia.org/T390665 [08:16:31] !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/services/cxserver: apply [08:17:03] !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [08:17:56] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5183/console" [puppet] - 10https://gerrit.wikimedia.org/r/1131270 (owner: 10Filippo Giunchedi) [08:18:03] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device mr1-eqsin [08:18:18] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device mr1-eqiad [08:19:09] (03PS1) 10Alexandros Kosiaris: typos: Add wnmet as a typo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133069 [08:19:40] (03CR) 10Vgutierrez: [V:03+1 C:03+1] hieradata: move profile::acme_chief::certificates to profile [puppet] - 10https://gerrit.wikimedia.org/r/1131270 (owner: 10Filippo Giunchedi) [08:20:35] (03PS1) 10Elukey: docker_registry_ha: set debug logging for nginx [puppet] - 10https://gerrit.wikimedia.org/r/1133070 (https://phabricator.wikimedia.org/T390251) [08:20:59] 07Puppet, 06Infrastructure-Foundations: Improve the user experience adding new nodes to puppet - https://phabricator.wikimedia.org/T389932#10697436 (10Joe) >>! In T389932#10694961, @jhathaway wrote: > One issue with using just the FQDN is that is breaks tools which rely on matching other hostnames, for instanc... [08:22:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4007.ulsfo.wmnet [08:24:06] (03CR) 10Alexandros Kosiaris: [C:03+1] docker_registry_ha: set debug logging for nginx [puppet] - 10https://gerrit.wikimedia.org/r/1133070 (https://phabricator.wikimedia.org/T390251) (owner: 10Elukey) [08:24:08] (03CR) 10Elukey: [C:03+2] docker_registry_ha: set debug logging for nginx [puppet] - 10https://gerrit.wikimedia.org/r/1133070 (https://phabricator.wikimedia.org/T390251) (owner: 10Elukey) [08:24:10] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device mr1-eqiad [08:27:08] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device msw2-codfw [08:28:43] FIRING: [2x] ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [08:28:46] 07Puppet, 06Infrastructure-Foundations: Improve the user experience adding new nodes to puppet - https://phabricator.wikimedia.org/T389932#10697450 (10Joe) Alternatively, we can ofc remove the TLD from the matching expression in the pseudocode I posted. I don't think that having long regexes is really feasible... [08:29:05] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti4007.ulsfo.wmnet to cluster ulsfo and group 1 [08:29:28] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device msw2-codfw [08:29:32] !log set debug logging for registry*'s nginx - T390251 [08:29:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:29:35] T390251: docker-registry.wikimedia.org keeps serving bad blobs - https://phabricator.wikimedia.org/T390251 [08:29:37] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390535#10697452 (10phaultfinder) [08:31:51] FIRING: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.eqiad.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [08:32:07] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device msw1-codfw [08:33:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti4007.ulsfo.wmnet to cluster ulsfo and group 1 [08:34:21] !log jmm@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ganeti4007 [08:34:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ganeti4007 [08:35:13] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device msw1-codfw [08:35:50] !log temporary disable puppet on cumin1002 for the spicerack upgrade to v10.0.0 [08:35:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:04] !log failover ganeti master in ulsfo to ganeti4005 T382511 [08:36:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:07] T382511: Update Ganeti servers in ulsfo to Bookworm - https://phabricator.wikimedia.org/T382511 [08:36:40] (03CR) 10Volans: [C:03+2] Adapt calls for Spicerack v10.0.0 [cookbooks] - 10https://gerrit.wikimedia.org/r/1133057 (owner: 10Volans) [08:36:46] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device msw2-eqiad [08:36:51] RESOLVED: SwaggerProbeHasFailures: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://cxserver.svc.eqiad.wmnet:4002 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [08:38:40] FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2016:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [08:38:41] !log marostegui@cumin1002 START - Cookbook sre.mysql.pool db1211 slowly with 10 steps - Pool db1211.eqiad.wmnet in after cloning [08:39:49] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device msw2-eqiad [08:42:43] FIRING: BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2013:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [08:43:13] (03Merged) 10jenkins-bot: Adapt calls for Spicerack v10.0.0 [cookbooks] - 10https://gerrit.wikimedia.org/r/1133057 (owner: 10Volans) [08:43:37] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device msw1-eqiad [08:44:43] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390535#10697478 (10phaultfinder) [08:45:06] !log upgrading spicerack to v10.0.0 on cumin2002 [08:45:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:05] !log dcausse@deploy1003 Started deploy [wdqs/wdqs@354b5ac]: revert T326311, deletion query way too slow [08:46:07] !log Drain Lumen cct from codfw to ulsfo due to instability T390660 [08:46:08] T326311: Deletion of Lexemes appears to leak triples related to its forms and senses - https://phabricator.wikimedia.org/T326311 [08:46:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:10] T390660: Outbound errors on interface cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://phabricator.wikimedia.org/T390660 [08:46:40] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.tls (exit_code=0) for network device msw1-eqiad [08:47:43] FIRING: [5x] BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [08:48:31] (03CR) 10Muehlenhoff: "*phew*" [puppet] - 10https://gerrit.wikimedia.org/r/1132995 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [08:48:33] (03CR) 10Muehlenhoff: [C:03+2] Add raid5-4dev.cfg Partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/1132995 (https://phabricator.wikimedia.org/T156955) (owner: 10Muehlenhoff) [08:50:13] !log hashar@deploy1003 Started deploy [integration/docroot@5256e19]: build: Updating eslint-config-wikimedia to 0.29.1 [08:50:20] (03CR) 10Filippo Giunchedi: [C:03+2] add trixie-wikimedia to apt [puppet] - 10https://gerrit.wikimedia.org/r/1132991 (owner: 10Filippo Giunchedi) [08:50:22] !log hashar@deploy1003 Finished deploy [integration/docroot@5256e19]: build: Updating eslint-config-wikimedia to 0.29.1 (duration: 00m 09s) [08:52:16] (03PS1) 10Ayounsi: Enable gNMI on management switches [homer/public] - 10https://gerrit.wikimedia.org/r/1133072 (https://phabricator.wikimedia.org/T390052) [08:52:43] FIRING: [7x] BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1018:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [08:54:27] FIRING: [2x] ProbeDown: Service wdqs2012:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:54:56] 07Puppet, 06Infrastructure-Foundations: Improve the user experience adding new nodes to puppet - https://phabricator.wikimedia.org/T389932#10697487 (10FCeratto-WMF) FWIW I would suggest prioritizing readability and safety for **prod** configuration, and git-diff friendliness. When converting the existing confi... [08:56:29] (03PS1) 10Muehlenhoff: maps/test: Adapt partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/1133073 (https://phabricator.wikimedia.org/T381565) [08:56:54] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet [08:57:34] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in ulsfo to Bookworm - https://phabricator.wikimedia.org/T382511#10697491 (10ops-monitoring-bot) Draining ganeti4008.ulsfo.wmnet of running VMs [08:57:43] RESOLVED: [7x] BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1018:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [08:58:21] !log dcausse@deploy1003 Finished deploy [wdqs/wdqs@354b5ac]: revert T326311, deletion query way too slow (duration: 12m 15s) [08:58:23] T326311: Deletion of Lexemes appears to leak triples related to its forms and senses - https://phabricator.wikimedia.org/T326311 [08:59:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet [08:59:27] RESOLVED: [2x] ProbeDown: Service wdqs2012:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2012:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:00:09] 10ops-codfw, 06SRE, 06DC-Ops: Outbound errors on interface cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://phabricator.wikimedia.org/T390660#10697500 (10cmooney) This port, and the circuit to cr4-ulsfo that it's connected to, was part of a larger outage we e... [09:00:26] jouncebot: next [09:00:26] In 0 hour(s) and 59 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250401T1000) [09:00:33] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti4008.ulsfo.wmnet [09:00:47] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in ulsfo to Bookworm - https://phabricator.wikimedia.org/T382511#10697511 (10ops-monitoring-bot) Draining ganeti4008.ulsfo.wmnet of running VMs [09:03:30] (03CR) 10Cathal Mooney: [C:03+1] Allow gNMI to mgmt routers control-plane [homer/public] - 10https://gerrit.wikimedia.org/r/1133058 (https://phabricator.wikimedia.org/T390052) (owner: 10Ayounsi) [09:03:41] 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on - https://phabricator.wikimedia.org/T390669 (10Odeline_Marteau1) 03NEW [09:04:29] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1117838 (owner: 10Muehlenhoff) [09:04:48] 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on - https://phabricator.wikimedia.org/T390669#10697535 (10Odeline_Marteau1) [09:04:50] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in ulsfo to Bookworm - https://phabricator.wikimedia.org/T382511#10697536 (10MoritzMuehlenhoff) [09:05:18] (03CR) 10Ayounsi: [C:03+2] Allow gNMI to mgmt routers control-plane [homer/public] - 10https://gerrit.wikimedia.org/r/1133058 (https://phabricator.wikimedia.org/T390052) (owner: 10Ayounsi) [09:05:49] (03Merged) 10jenkins-bot: Allow gNMI to mgmt routers control-plane [homer/public] - 10https://gerrit.wikimedia.org/r/1133058 (https://phabricator.wikimedia.org/T390052) (owner: 10Ayounsi) [09:06:04] (03CR) 10Vgutierrez: "sorry for the late response, I was out on PTO. Yes, this looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/1130162 (https://phabricator.wikimedia.org/T387569) (owner: 10Bking) [09:06:09] (03PS1) 10Muehlenhoff: Switch ganeti4008 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1133074 [09:07:02] (03CR) 10Muehlenhoff: [C:03+2] Remove use of openstack-db repository component [puppet] - 10https://gerrit.wikimedia.org/r/1117838 (owner: 10Muehlenhoff) [09:08:35] RESOLVED: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2016:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [09:10:48] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5184/console" [puppet] - 10https://gerrit.wikimedia.org/r/1128859 (https://phabricator.wikimedia.org/T389181) (owner: 10Hashar) [09:12:03] (03CR) 10Vgutierrez: [C:03+1] "no impact on CDN nodes, so it's ok from my point of view" [puppet] - 10https://gerrit.wikimedia.org/r/1128859 (https://phabricator.wikimedia.org/T389181) (owner: 10Hashar) [09:14:10] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe2003 - https://phabricator.wikimedia.org/T390578#10697564 (10MatthewVernon) [09:14:13] FIRING: [2x] BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2010:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [09:15:53] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe200[5-7] - https://phabricator.wikimedia.org/T389634#10697565 (10MatthewVernon) [09:16:45] (03PS2) 10Ayounsi: Enable gNMI on management routers [homer/public] - 10https://gerrit.wikimedia.org/r/1133060 (https://phabricator.wikimedia.org/T390052) [09:16:45] (03PS2) 10Ayounsi: Enable gNMI on management switches [homer/public] - 10https://gerrit.wikimedia.org/r/1133072 (https://phabricator.wikimedia.org/T390052) [09:16:45] (03PS1) 10Ayounsi: gnmi policy: fix small issue [homer/public] - 10https://gerrit.wikimedia.org/r/1133075 (https://phabricator.wikimedia.org/T390052) [09:17:24] (03CR) 10Brouberol: [C:03+1] "Nevermind, the tests seem to pass!" [alerts] - 10https://gerrit.wikimedia.org/r/1132663 (https://phabricator.wikimedia.org/T386177) (owner: 10Joal) [09:17:36] (03CR) 10Ayounsi: [C:03+2] gnmi policy: fix small issue [homer/public] - 10https://gerrit.wikimedia.org/r/1133075 (https://phabricator.wikimedia.org/T390052) (owner: 10Ayounsi) [09:18:06] (03Merged) 10jenkins-bot: gnmi policy: fix small issue [homer/public] - 10https://gerrit.wikimedia.org/r/1133075 (https://phabricator.wikimedia.org/T390052) (owner: 10Ayounsi) [09:18:08] 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on - https://phabricator.wikimedia.org/T390669#10697577 (10Aklapper) [09:18:43] FIRING: [2x] ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [09:19:10] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [09:19:13] FIRING: [4x] BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [09:19:16] 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on  - https://phabricator.wikimedia.org/T390669#10697579 (10Aklapper) 05Open→03Declined > Link to site: QGIS That is not a link to a site. > Wikimedia Affiliate supporting project: Odeline-Marteau1 That is not a Wikimedia Affiliate. See... [09:19:28] FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2016:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [09:19:46] 06SRE, 10Maps, 06Traffic: Allow Wikimedia Maps usage on  - https://phabricator.wikimedia.org/T390669#10697582 (10Aklapper) a:05Odeline_Marteau1→03None [09:20:11] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [09:20:27] FIRING: [2x] ProbeDown: Service wdqs2010:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2010:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:20:40] (03PS2) 10Btullis: presto: Double the heap size for the coordinator [puppet] - 10https://gerrit.wikimedia.org/r/1132769 (https://phabricator.wikimedia.org/T390623) [09:21:18] (03CR) 10Muehlenhoff: [C:03+2] maps/test: Adapt partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/1133073 (https://phabricator.wikimedia.org/T381565) (owner: 10Muehlenhoff) [09:21:34] (03PS1) 10Btullis: Temporarily exclude an-workers in rack F6 for a hard drive replacement [puppet] - 10https://gerrit.wikimedia.org/r/1133076 (https://phabricator.wikimedia.org/T390169) [09:23:35] FIRING: ErrorBudgetBurn: wdqs - wdqs-update-lag - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [09:23:36] (03CR) 10Btullis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1133076 (https://phabricator.wikimedia.org/T390169) (owner: 10Btullis) [09:24:13] FIRING: [5x] BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1018:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [09:24:28] RESOLVED: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2016:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [09:24:39] RESOLVED: [4x] BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [09:25:27] RESOLVED: [4x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:26:48] !log ayounsi@cumin1002 START - Cookbook sre.network.tls for network device mr1-ulsfo [09:27:13] !log ayounsi@cumin1002 END (ERROR) - Cookbook sre.network.tls (exit_code=97) for network device mr1-ulsfo [09:27:28] 07Puppet, 06Infrastructure-Foundations: Improve the user experience adding new nodes to puppet - https://phabricator.wikimedia.org/T389932#10697616 (10fgiunchedi) With my pontoon hat on: what I did is basically the same as the suggested `node/data.yaml`, i.e. map roles to (cloud) hostnames and `pontoon-enc` ta... [09:29:13] FIRING: [5x] BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1018:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [09:30:28] FIRING: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2016:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [09:32:12] FIRING: [8x] ProbeDown: Service wdqs1019:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:32:16] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host maps-test2001.codfw.wmnet with OS bookworm [09:32:34] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#10697627 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host maps-test2001.codfw.wmnet with OS bookworm [09:32:57] RESOLVED: [3x] ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_search_sparql_endpoint_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:34:13] RESOLVED: [5x] BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1018:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [09:34:28] FIRING: [5x] BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs1018:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [09:34:36] (03CR) 10Hnowlan: [C:03+2] trafficserver: gateway-check ignore list, roll pcs/mobileapps to more wikis [puppet] - 10https://gerrit.wikimedia.org/r/1131748 (https://phabricator.wikimedia.org/T388140) (owner: 10Hnowlan) [09:34:53] (03PS1) 10Isabelle Hurbain-Palatin: Enable Parsoid Read Views for Mobile Front End on dagwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133078 (https://phabricator.wikimedia.org/T381002) [09:35:13] RESOLVED: [2x] BlazegraphFreeAllocatorsDecreasingRapidly: Blazegraph instance wdqs2007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [09:35:28] RESOLVED: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs2016:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [09:36:05] (03PS1) 10Hnowlan: mobileapps: increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133079 (https://phabricator.wikimedia.org/T388140) [09:38:26] !log gmodena@deploy1003 Started deploy [airflow-dags/search@ed0fc78]: Deploy mjolnir-2.7.0.dev.conda.tgz [09:39:22] !log gmodena@deploy1003 Finished deploy [airflow-dags/search@ed0fc78]: Deploy mjolnir-2.7.0.dev.conda.tgz (duration: 01m 29s) [09:42:46] !log volans@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:05:00 on sretest1001.eqiad.wmnet with reason: test [09:44:11] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Ben.buchenau - https://phabricator.wikimedia.org/T386904#10697644 (10Ben.buchenau) Thank you! I am able to login to both tools. So I think you can close the ticket for now. Best, Ben [09:45:07] (03PS1) 10Hashar: Use wikidata familly in $wgCirrusSearchSimilarityProfile [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133081 [09:45:23] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Ben.buchenau - https://phabricator.wikimedia.org/T386904#10697665 (10Jelto) 05Open→03Resolved Great thanks for the quick feedback. I'll close the task. [09:45:29] (03CR) 10Hashar: Fix wgCirrusSearchSimilarityProfiles (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1125092 (owner: 10Hashar) [09:46:12] (03CR) 10Hashar: "That is a follow up from Timo comment on the fixup of `wgCirrusSearchSimilarityProfiles`: https://gerrit.wikimedia.org/r/c/operations/medi" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133081 (owner: 10Hashar) [09:47:50] (03PS1) 10Isabelle Hurbain-Palatin: Enable Parsoid Read Views to incubator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133082 (https://phabricator.wikimedia.org/T380768) [09:48:35] FIRING: [2x] ErrorBudgetBurn: wdqs - wdqs-update-lag - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [09:50:57] !log restart nginx on registry* to pick up the debug changes [09:50:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr1-codfw:xe-1/1/1:0 (Transport: cr4-ulsfo:xe-0/1/1 (Lumen, 442550294) {#12252_12295-1}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [09:52:50] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on maps-test2001.codfw.wmnet with reason: host reimage [09:53:06] (03CR) 10Btullis: [C:03+2] Upgrade the Gobblin JAR version to 1.0.6 [puppet] - 10https://gerrit.wikimedia.org/r/1132756 (https://phabricator.wikimedia.org/T390247) (owner: 10Aleksandar Mastilovic) [09:54:00] (03CR) 10TrainBranchBot: [C:03+2] "Approved by akosiaris@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133069 (owner: 10Alexandros Kosiaris) [09:54:48] (03Merged) 10jenkins-bot: typos: Add wnmet as a typo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133069 (owner: 10Alexandros Kosiaris) [09:55:00] !log joal@deploy1003 Started deploy [analytics/refinery@efc4808]: Analytics webrequest migration [analytics/refinery@efc48089] [09:55:31] !log akosiaris@deploy1003 Started scap sync-world: Backport for [[gerrit:1133069|typos: Add wnmet as a typo]] [09:56:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps-test2001.codfw.wmnet with reason: host reimage [09:57:01] !log installing freetype security updates [09:57:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:25] !log joal@deploy1003 Finished deploy [analytics/refinery@efc4808]: Analytics webrequest migration [analytics/refinery@efc48089] (duration: 02m 24s) [09:58:06] !log joal@deploy1003 Started deploy [analytics/refinery@efc4808] (thin): Analytics webrequest migration THIN [analytics/refinery@efc48089] [09:58:57] (03PS7) 10Esanders: VE: Enable mobile insert menu everywhere except top 20 mobile VE wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128374 (https://phabricator.wikimedia.org/T388604) [09:59:01] !log joal@deploy1003 Finished deploy [analytics/refinery@efc4808] (thin): Analytics webrequest migration THIN [analytics/refinery@efc48089] (duration: 00m 55s) [09:59:59] !log joal@deploy1003 Started deploy [analytics/refinery@efc4808] (hadoop-test): Analytics webrequest migration TEST [analytics/refinery@efc48089] [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250401T1000) [10:00:40] !log joal@deploy1003 Finished deploy [analytics/refinery@efc4808] (hadoop-test): Analytics webrequest migration TEST [analytics/refinery@efc48089] (duration: 00m 40s) [10:03:35] FIRING: [4x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:06:33] (03PS1) 10Ladsgroup: Bump thumbnail steps to 55% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133087 (https://phabricator.wikimedia.org/T360589) [10:08:02] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: ProbeDown (instance ripe-atlas-codfw:0) - https://phabricator.wikimedia.org/T390676 (10LSobanski) 03NEW [10:08:33] !log akosiaris@deploy1003 akosiaris: Backport for [[gerrit:1133069|typos: Add wnmet as a typo]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:09:12] (03PS1) 10Ladsgroup: Remove deprecated old bouncehandler db configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133088 [10:09:12] !log akosiaris@deploy1003 akosiaris: Continuing with sync [10:09:40] (03CR) 10Ladsgroup: "shall we deploy this now?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127082 (owner: 10Reedy) [10:10:08] (03Abandoned) 10Ladsgroup: Remove deprecated old bouncehandler db configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133088 (owner: 10Ladsgroup) [10:10:23] jouncebot: nowandnext [10:10:24] For the next 0 hour(s) and 49 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250401T1000) [10:10:24] In 1 hour(s) and 49 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250401T1200) [10:10:38] akosiaris: once you're done, can you let me know? Thank you <3 [10:10:45] Amir1: will do [10:11:09] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, 06serviceops: Create a cookbook to automate gerrit's switchover - https://phabricator.wikimedia.org/T260666#10697808 (10LSobanski) 05Duplicate→03Open This is separate from the activity in {T387833} so let's keep it open. [10:13:02] (03PS1) 10Hnowlan: trafficserver: correct escape non-regex dash, fix wiki typo [puppet] - 10https://gerrit.wikimedia.org/r/1133090 (https://phabricator.wikimedia.org/T388140) [10:13:33] (03CR) 10Joal: [C:03+2] "Let's merge :)" [alerts] - 10https://gerrit.wikimedia.org/r/1132663 (https://phabricator.wikimedia.org/T386177) (owner: 10Joal) [10:14:32] (03PS2) 10Hnowlan: trafficserver: correct escape non-regex dash, fix wiki typo [puppet] - 10https://gerrit.wikimedia.org/r/1133090 (https://phabricator.wikimedia.org/T388140) [10:15:09] (03Merged) 10jenkins-bot: Update data-eng gobblin alert [alerts] - 10https://gerrit.wikimedia.org/r/1132663 (https://phabricator.wikimedia.org/T386177) (owner: 10Joal) [10:15:50] (03CR) 10Jgiannelos: [C:03+1] trafficserver: correct escape non-regex dash, fix wiki typo [puppet] - 10https://gerrit.wikimedia.org/r/1133090 (https://phabricator.wikimedia.org/T388140) (owner: 10Hnowlan) [10:16:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host maps-test2001.codfw.wmnet with OS bookworm [10:16:34] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#10697818 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host maps-test2001.codfw.wmnet with OS bookworm completed: - maps-test2001 (**PASS**)... [10:16:54] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc-gp1004.eqiad.wmnet [10:17:07] !log jiji@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host mc-gp1004.eqiad.wmnet [10:17:30] !log aqu@deploy1003 Started deploy [airflow-dags/analytics_test@d96f732]: Update artifacts for analytics_test [10:17:42] !log aqu@deploy1003 Finished deploy [airflow-dags/analytics_test@d96f732]: Update artifacts for analytics_test (duration: 00m 12s) [10:17:53] (03CR) 10Hnowlan: [C:03+2] trafficserver: correct escape non-regex dash, fix wiki typo [puppet] - 10https://gerrit.wikimedia.org/r/1133090 (https://phabricator.wikimedia.org/T388140) (owner: 10Hnowlan) [10:17:53] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc-gp1004.eqiad.wmnet [10:18:01] !log aqu@deploy1003 Started deploy [airflow-dags/analytics@d96f732]: Update artifacts for analytics [10:18:43] (03CR) 10Slyngshede: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1133063 (owner: 10Muehlenhoff) [10:18:58] FIRING: RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1015:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [10:19:00] !log aqu@deploy1003 Finished deploy [airflow-dags/analytics@d96f732]: Update artifacts for analytics (duration: 00m 59s) [10:19:12] FIRING: ProbeDown: Service aux-k8s-ctrl1002:6443 has failed probes (http_aux_k8s_eqiad_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#aux-k8s-ctrl1002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:19:27] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc-gp2004.codfw.wmnet [10:19:41] !log jiji@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host mc-gp2004.codfw.wmnet [10:20:02] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host maps-test2002.codfw.wmnet with OS bookworm [10:20:05] <_joe_> is someone doing something with the aux cluster? [10:20:14] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#10697827 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host maps-test2002.codfw.wmnet with OS bookworm [10:20:38] jouncebot: nowandnext [10:20:38] For the next 0 hour(s) and 39 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250401T1000) [10:20:38] In 1 hour(s) and 39 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250401T1200) [10:21:01] I’ll probably want to do a deployment in 40 minutes then [10:21:25] <_joe_> !incidents [10:21:25] 5922 (UNACKED) ProbeDown sre (10.64.0.107 ip4 aux-k8s-ctrl1002:6443 probes/custom http_aux_k8s_eqiad_kube_apiserver_ip4 eqiad) [10:21:26] 5921 (RESOLVED) ProbeDown sre (10.2.2.68 ip4 shellbox-video:4080 probes/service http_shellbox-video_ip4 eqiad) [10:21:26] 5919 (RESOLVED) Primary inbound port utilisation over 80% (paged) network noc (cr1-eqiad.wikimedia.org) [10:21:26] 5920 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (asw2-a-eqiad.mgmt.eqiad.wmnet) [10:21:26] 5918 (RESOLVED) NELHigh sre (thanos-rule tcp.timed_out) [10:21:32] <_joe_> !ack 5922 [10:21:32] 5922 (ACKED) ProbeDown sre (10.64.0.107 ip4 aux-k8s-ctrl1002:6443 probes/custom http_aux_k8s_eqiad_kube_apiserver_ip4 eqiad) [10:23:35] RESOLVED: ErrorBudgetBurn: wdqs - wdqs-update-lag - https://wikitech.wikimedia.org/wiki/Monitoring/ErrorBudgetBurn - https://alerts.wikimedia.org/?q=alertname%3DErrorBudgetBurn [10:23:43] RESOLVED: ElevatedMaxLagWDQS: WDQS lag is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DElevatedMaxLagWDQS [10:23:58] RESOLVED: [2x] RdfStreamingUpdaterHighConsumerUpdateLag: wdqs1013:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [10:24:04] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp1004.eqiad.wmnet [10:24:12] RESOLVED: ProbeDown: Service aux-k8s-ctrl1002:6443 has failed probes (http_aux_k8s_eqiad_kube_apiserver_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#aux-k8s-ctrl1002:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:25:05] !log akosiaris@deploy1003 Finished scap sync-world: Backport for [[gerrit:1133069|typos: Add wnmet as a typo]] (duration: 29m 34s) [10:25:57] (03CR) 10Brouberol: Temporarily exclude an-workers in rack F6 for a hard drive replacement (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1133076 (https://phabricator.wikimedia.org/T390169) (owner: 10Btullis) [10:26:54] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc-gp1005.eqiad.wmnet [10:27:06] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc-gp2004.codfw.wmnet [10:27:28] (03PS2) 10Btullis: Temporarily exclude an-workers in rack F6 for a hard drive replacement [puppet] - 10https://gerrit.wikimedia.org/r/1133076 (https://phabricator.wikimedia.org/T390169) [10:27:56] (03CR) 10Btullis: Temporarily exclude an-workers in rack F6 for a hard drive replacement (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1133076 (https://phabricator.wikimedia.org/T390169) (owner: 10Btullis) [10:28:24] jouncebot: next [10:28:24] In 1 hour(s) and 31 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250401T1200) [10:30:18] (03CR) 10Slyngshede: [C:03+1] "Still good :-)" [software/bitu] - 10https://gerrit.wikimedia.org/r/1131471 (owner: 10Hashar) [10:31:06] (03CR) 10Slyngshede: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1132019 (owner: 10Hashar) [10:31:38] I've called dibs before everyone else. I'm waiting for akosiaris to finish [10:32:44] (03PS2) 10Clément Goubert: mw::periodic_jobs: Test xargs parallelism [puppet] - 10https://gerrit.wikimedia.org/r/1133089 (https://phabricator.wikimedia.org/T388538) [10:33:17] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp1005.eqiad.wmnet [10:33:19] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2004.codfw.wmnet [10:33:28] Please take your ticket at the DMV front office (Deployment and Merge Vaitlist) [10:33:34] Amir1: and done [10:33:36] go ahead [10:33:39] wohooo [10:33:43] Amir1: nonononono [10:34:02] elukey: what's up? [10:34:03] I am formally requesting precedence! :D [10:34:13] because it's you, sure. Go ahead <3 [10:34:33] no I am kidding, I don't have anything to deploy. I wanted to see if there were deployments scheduled to watch the docker registry :D [10:34:36] go ahead [10:34:39] <3 [10:35:00] xD [10:35:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133087 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [10:36:28] (03Merged) 10jenkins-bot: Bump thumbnail steps to 55% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133087 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [10:36:51] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1133087|Bump thumbnail steps to 55% (T360589)]] [10:36:54] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [10:37:10] (03CR) 10Btullis: [C:03+2] Temporarily exclude an-workers in rack F6 for a hard drive replacement [puppet] - 10https://gerrit.wikimedia.org/r/1133076 (https://phabricator.wikimedia.org/T390169) (owner: 10Btullis) [10:38:13] 07Puppet, 06Infrastructure-Foundations: Improve the user experience adding new nodes to puppet - https://phabricator.wikimedia.org/T389932#10697904 (10Joe) >>! In T389932#10697616, @fgiunchedi wrote: > With my pontoon hat on: what I did is basically the same as the suggested `node/data.yaml`, i.e. map roles to... [10:40:06] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on maps-test2002.codfw.wmnet with reason: host reimage [10:43:36] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps-test2002.codfw.wmnet with reason: host reimage [10:43:45] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx-envoy rolling restart_daemons on A:wdqs-test [10:44:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx-envoy (exit_code=0) rolling restart_daemons on A:wdqs-test [10:45:04] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx-envoy rolling restart_daemons on A:wdqs-all [10:45:18] Amir1: is the deployment proceeding as expected? [10:45:31] if so I'll take lunch [10:45:44] (meaning, if the registry doesn't make fun things) [10:45:58] it was stuck at pushing to testservers [10:46:07] I don't know why [10:46:37] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1133087|Bump thumbnail steps to 55% (T360589)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:46:39] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [10:46:53] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2165.codfw.wmnet with reason: Maintenance [10:47:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2165 (T371742)', diff saved to https://phabricator.wikimedia.org/P74522 and previous config saved to /var/cache/conftool/dbconfig/20250401-104659-ladsgroup.json [10:47:02] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [10:47:43] I checked the pods on both eqiad/codfw mw-debug namespaces + events, no sign of the registry issue [10:47:54] so they were not stuck due to that [10:48:02] gooood (sort of), I'll step afk :) [10:48:13] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [10:50:12] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc-gp1006.eqiad.wmnet [10:50:20] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc-gp2005.codfw.wmnet [10:53:37] maybe my fault, I was hogging a debug pod [10:53:41] my bad [10:54:00] (03PS2) 10Giuseppe Lavagetto: admin: Add simple function to read mw access logs to my .bashrc [puppet] - 10https://gerrit.wikimedia.org/r/1131054 [10:54:18] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2204.codfw.wmnet with reason: Maintenance [10:54:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2204 (T371742)', diff saved to https://phabricator.wikimedia.org/P74523 and previous config saved to /var/cache/conftool/dbconfig/20250401-105425-ladsgroup.json [10:54:27] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [10:54:35] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390535#10697948 (10phaultfinder) [10:55:04] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.pool (exit_code=0) db1211 slowly with 10 steps - Pool db1211.eqiad.wmnet in after cloning [10:55:06] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db1211.eqiad.wmnet onto db1257.eqiad.wmnet [10:55:09] (03CR) 10Giuseppe Lavagetto: [C:03+2] admin: Add simple function to read mw access logs to my .bashrc [puppet] - 10https://gerrit.wikimedia.org/r/1131054 (owner: 10Giuseppe Lavagetto) [10:56:25] (03PS2) 10Clément Goubert: mw::periodic_jobs: Fix parallel invocation [puppet] - 10https://gerrit.wikimedia.org/r/1133092 (https://phabricator.wikimedia.org/T388538) [10:56:39] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp1006.eqiad.wmnet [10:56:55] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2005.codfw.wmnet [10:57:02] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [10:57:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx-envoy (exit_code=0) rolling restart_daemons on A:wdqs-all [10:58:20] (03CR) 10Clément Goubert: [C:03+2] mw::periodic_jobs: Fix parallel invocation [puppet] - 10https://gerrit.wikimedia.org/r/1133092 (https://phabricator.wikimedia.org/T388538) (owner: 10Clément Goubert) [10:58:45] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc-gp2006.codfw.wmnet [10:58:55] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1133087|Bump thumbnail steps to 55% (T360589)]] (duration: 22m 03s) [10:58:57] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [10:59:15] Something is broken in scap. (duration: 22m 03s) [10:59:24] my test took half a minute at most [11:01:07] jouncebot: nowandnext [11:01:07] No deployments scheduled for the next 0 hour(s) and 58 minute(s) [11:01:07] In 0 hour(s) and 58 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250401T1200) [11:01:18] is it okay for me to deploy something? (cc Amir1) [11:01:51] sure go ahead! [11:01:56] ack, thanks [11:02:04] !increase vrrp prio on cr3-ulsfo to switch gw ahead of cr4-ulsfo junos upgrade T364092 [11:02:05] T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092 [11:02:06] !log jmm@cumin2002 START - Cookbook sre.wdqs.restart-nginx-envoy rolling restart_daemons on A:wcqs-public [11:02:40] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 2 - rack F6) - https://phabricator.wikimedia.org/T390169#10697968 (10BTullis) a:03BTullis [11:02:56] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 2 - rack F6) - https://phabricator.wikimedia.org/T390169#10697969 (10BTullis) [11:03:17] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485#10697970 (10BTullis) [11:04:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host maps-test2002.codfw.wmnet with OS bookworm [11:04:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.wdqs.restart-nginx-envoy (exit_code=0) rolling restart_daemons on A:wcqs-public [11:04:21] !log cgoubert@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [11:04:24] !log cgoubert@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [11:04:27] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#10697971 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host maps-test2002.codfw.wmnet with OS bookworm completed: - maps-test2002 (**PASS**)... [11:05:20] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc-gp2006.codfw.wmnet [11:05:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti4008.ulsfo.wmnet [11:06:19] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host maps-test2003.codfw.wmnet with OS bookworm [11:06:31] !log jmm@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on ganeti4008.ulsfo.wmnet with reason: remove from cluster for reimage [11:06:38] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#10697975 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host maps-test2003.codfw.wmnet with OS bookworm [11:06:39] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in ulsfo to Bookworm - https://phabricator.wikimedia.org/T382511#10697976 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=0a440a7c-23d5-411a-82dc-b35d0662b15f) set by jmm@cumin2002 for 1 day, 0:00:00 on 1 host(s) and the... [11:08:35] FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:10:11] !log cmooney@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 8 hosts with reason: Upgrade cr4-ulsfo JunOS [11:10:12] (03CR) 10Muehlenhoff: [C:03+2] Switch ganeti4008 to nftables [puppet] - 10https://gerrit.wikimedia.org/r/1133074 (owner: 10Muehlenhoff) [11:10:48] !log upgrading spicerack to v10.0.0 on cumin1002 [11:10:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:08] !log restarting FPM on phab1004 to pick up security update [11:12:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:13:54] !log volans@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:05:00 on sretest1002.eqiad.wmnet with reason: Test [11:14:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2204 (T371742)', diff saved to https://phabricator.wikimedia.org/P74525 and previous config saved to /var/cache/conftool/dbconfig/20250401-111415-ladsgroup.json [11:14:18] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [11:15:44] !log Restarted Gerrit replica on gerrit2002 to raise heap from 32G to 64G | T387223 [11:15:44] (03PS1) 10Btullis: Add the k alias to kubectl for btullis [puppet] - 10https://gerrit.wikimedia.org/r/1133095 [11:15:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:47] T387223: Remove explicit enablement of G1 garbage collector for Gerrit - https://phabricator.wikimedia.org/T387223 [11:16:12] !log reboot cr4-ulsfo to upgrade JunOS T364092 [11:16:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:14] T364092: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092 [11:16:45] !log jmm@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ganeti4008.ulsfo.wmnet [11:17:02] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [11:17:21] * Lucas_WMDE deploying mediawiki [11:17:33] (03PS1) 10Clément Goubert: mw-cron: Fix kubectl invocation in alert description [alerts] - 10https://gerrit.wikimedia.org/r/1133096 (https://phabricator.wikimedia.org/T385709) [11:18:32] (03CR) 10Btullis: [C:03+2] Add the k alias to kubectl for btullis [puppet] - 10https://gerrit.wikimedia.org/r/1133095 (owner: 10Btullis) [11:18:49] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1037.eqiad.wmnet [11:19:45] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc2038.codfw.wmnet [11:19:47] jouncebot: now [11:19:48] No deployments scheduled for the next 0 hour(s) and 40 minute(s) [11:21:56] (03CR) 10Filippo Giunchedi: [C:03+1] "I'm kubectl-noob but LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/1133096 (https://phabricator.wikimedia.org/T385709) (owner: 10Clément Goubert) [11:22:31] !log Restarting Gerrit [11:22:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:22] !log installing squid security updates [11:24:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:37] 10ops-eqiad, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390682 (10phaultfinder) 03NEW [11:24:58] (03CR) 10Clément Goubert: [C:03+2] mw-cron: Fix kubectl invocation in alert description [alerts] - 10https://gerrit.wikimedia.org/r/1133096 (https://phabricator.wikimedia.org/T385709) (owner: 10Clément Goubert) [11:25:31] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1037.eqiad.wmnet [11:25:36] (03CR) 10Muehlenhoff: [C:03+2] Update approvers for airflow-ml-ops [puppet] - 10https://gerrit.wikimedia.org/r/1133063 (owner: 10Muehlenhoff) [11:26:23] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2038.codfw.wmnet [11:26:37] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on maps-test2003.codfw.wmnet with reason: host reimage [11:27:05] (03Merged) 10jenkins-bot: mw-cron: Fix kubectl invocation in alert description [alerts] - 10https://gerrit.wikimedia.org/r/1133096 (https://phabricator.wikimedia.org/T385709) (owner: 10Clément Goubert) [11:29:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2204', diff saved to https://phabricator.wikimedia.org/P74526 and previous config saved to /var/cache/conftool/dbconfig/20250401-112921-ladsgroup.json [11:31:20] (03PS6) 10Filippo Giunchedi: mediawiki-global: add alerts for too many login attempts [alerts] - 10https://gerrit.wikimedia.org/r/1132580 (owner: 10Giuseppe Lavagetto) [11:31:29] (03CR) 10Filippo Giunchedi: mediawiki-global: add alerts for too many login attempts (035 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1132580 (owner: 10Giuseppe Lavagetto) [11:32:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps-test2003.codfw.wmnet with reason: host reimage [11:34:12] FIRING: [5x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr4-ulsfo (198.35.26.193) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [11:34:33] !log Deployed patch for T389369 [11:34:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:37] * Lucas_WMDE done deploying [11:38:39] RESOLVED: [5x] CoreBGPDown: Core BGP session down between cr1-eqiad and cr4-ulsfo (198.35.26.193) - group Confed_ulsfo - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [11:40:25] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:44:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2204', diff saved to https://phabricator.wikimedia.org/P74527 and previous config saved to /var/cache/conftool/dbconfig/20250401-114428-ladsgroup.json [11:46:56] (03PS1) 10Muehlenhoff: failover eqiad urldownloader for security update [dns] - 10https://gerrit.wikimedia.org/r/1133108 [11:51:39] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade core routers to Junos 23.4R2 - https://phabricator.wikimedia.org/T364092#10698172 (10cmooney) [11:52:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host maps-test2003.codfw.wmnet with OS bookworm [11:52:38] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#10698175 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host maps-test2003.codfw.wmnet with OS bookworm completed: - maps-test2003 (**PASS**)... [11:54:24] (03CR) 10Slyngshede: [C:03+2] Adjust build.sh for other environments [software/bitu] - 10https://gerrit.wikimedia.org/r/1131452 (owner: 10Hashar) [11:57:08] (03Merged) 10jenkins-bot: Adjust build.sh for other environments [software/bitu] - 10https://gerrit.wikimedia.org/r/1131452 (owner: 10Hashar) [11:57:39] (03CR) 10Slyngshede: [C:03+2] tox: allow passing arguments to django/flake8 [software/bitu] - 10https://gerrit.wikimedia.org/r/1131453 (owner: 10Hashar) [11:57:42] (03PS1) 10Clare Ming: xLab: : Deploying to staging and production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133111 (https://phabricator.wikimedia.org/T390681) [11:59:00] (03CR) 10Santiago Faci: [C:03+2] "Looks good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133111 (https://phabricator.wikimedia.org/T390681) (owner: 10Clare Ming) [11:59:06] (03PS1) 10Filippo Giunchedi: docker-registry: log nginx debug to separate file [puppet] - 10https://gerrit.wikimedia.org/r/1133112 [11:59:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2204 (T371742)', diff saved to https://phabricator.wikimedia.org/P74528 and previous config saved to /var/cache/conftool/dbconfig/20250401-115935-ladsgroup.json [11:59:39] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [12:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250401T1200) [12:00:16] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "LGTM, no significant references left in deployed branches:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130299 (owner: 10Bartosz Dziewoński) [12:00:17] (03Merged) 10jenkins-bot: tox: allow passing arguments to django/flake8 [software/bitu] - 10https://gerrit.wikimedia.org/r/1131453 (owner: 10Hashar) [12:00:17] (03Merged) 10jenkins-bot: xLab: : Deploying to staging and production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133111 (https://phabricator.wikimedia.org/T390681) (owner: 10Clare Ming) [12:00:22] (03PS2) 10Filippo Giunchedi: docker-registry: log nginx debug to separate file [puppet] - 10https://gerrit.wikimedia.org/r/1133112 (https://phabricator.wikimedia.org/T390251) [12:02:04] (03CR) 10Slyngshede: [C:03+2] tox: consolidate flake8 config to a single location [software/bitu] - 10https://gerrit.wikimedia.org/r/1131455 (owner: 10Hashar) [12:02:07] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc2039.codfw.wmnet [12:02:19] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply [12:02:23] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1038.eqiad.wmnet [12:02:44] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply [12:04:03] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host maps-test2004.codfw.wmnet with OS bookworm [12:04:21] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#10698261 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host maps-test2004.codfw.wmnet with OS bookworm [12:04:52] (03Merged) 10jenkins-bot: tox: consolidate flake8 config to a single location [software/bitu] - 10https://gerrit.wikimedia.org/r/1131455 (owner: 10Hashar) [12:05:47] (03PS1) 10Isabelle Hurbain-Palatin: Enable Parsoid Read Views on 12 wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133113 (https://phabricator.wikimedia.org/T390680) [12:06:56] (03CR) 10Ayounsi: [C:03+2] Enable gNMI on management routers [homer/public] - 10https://gerrit.wikimedia.org/r/1133060 (https://phabricator.wikimedia.org/T390052) (owner: 10Ayounsi) [12:07:05] (03CR) 10Ayounsi: [C:03+2] Enable gNMI on management switches [homer/public] - 10https://gerrit.wikimedia.org/r/1133072 (https://phabricator.wikimedia.org/T390052) (owner: 10Ayounsi) [12:07:30] (03Merged) 10jenkins-bot: Enable gNMI on management routers [homer/public] - 10https://gerrit.wikimedia.org/r/1133060 (https://phabricator.wikimedia.org/T390052) (owner: 10Ayounsi) [12:07:41] (03Merged) 10jenkins-bot: Enable gNMI on management switches [homer/public] - 10https://gerrit.wikimedia.org/r/1133072 (https://phabricator.wikimedia.org/T390052) (owner: 10Ayounsi) [12:08:16] !log dbrant@deploy1003 helmfile [staging] START helmfile.d/services/wikifeeds: apply [12:08:44] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2039.codfw.wmnet [12:08:51] !log dbrant@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [12:08:58] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1038.eqiad.wmnet [12:11:19] !log dbrant@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifeeds: apply [12:11:46] !log dbrant@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifeeds: apply [12:11:59] !log dbrant@deploy1003 helmfile [codfw] START helmfile.d/services/wikifeeds: apply [12:12:30] !log dbrant@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifeeds: apply [12:12:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr3-ulsfo:et-0/0/0 (Core: cr4-ulsfo:et-0/0/0 {#1073}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [12:14:34] (03CR) 10Stevemunene: [C:03+1] presto: Double the heap size for the coordinator [puppet] - 10https://gerrit.wikimedia.org/r/1132769 (https://phabricator.wikimedia.org/T390623) (owner: 10Btullis) [12:14:57] (03PS2) 10Thiemo Kreuz (WMDE): [beta] Start using Cite's Community Configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133119 (https://phabricator.wikimedia.org/T385597) [12:15:53] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133082 (https://phabricator.wikimedia.org/T380768) (owner: 10Isabelle Hurbain-Palatin) [12:16:18] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133078 (https://phabricator.wikimedia.org/T381002) (owner: 10Isabelle Hurbain-Palatin) [12:16:22] (03PS1) 10Jelto: trafficserver: switch /querybuilder to wikikube miscweb [puppet] - 10https://gerrit.wikimedia.org/r/1133120 (https://phabricator.wikimedia.org/T350793) [12:16:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133078 (https://phabricator.wikimedia.org/T381002) (owner: 10Isabelle Hurbain-Palatin) [12:17:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133082 (https://phabricator.wikimedia.org/T380768) (owner: 10Isabelle Hurbain-Palatin) [12:19:01] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133113 (https://phabricator.wikimedia.org/T390680) (owner: 10Isabelle Hurbain-Palatin) [12:20:02] (03PS1) 10Jelto: wikidata-query-builder: bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133122 (https://phabricator.wikimedia.org/T350793) [12:22:22] (03CR) 10SBassett: mediawiki-global: add alerts for too many login attempts (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1132580 (owner: 10Giuseppe Lavagetto) [12:23:02] !log installing PHP 7.4 security updates (as shipped in Debian, not our internal build running on a few remaining edge cases) [12:23:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:36] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on maps-test2004.codfw.wmnet with reason: host reimage [12:27:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps-test2004.codfw.wmnet with reason: host reimage [12:27:30] (03PS1) 10Clare Ming: Disable experiment-related config during active development [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133124 [12:29:08] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet [12:30:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T371742)', diff saved to https://phabricator.wikimedia.org/P74529 and previous config saved to /var/cache/conftool/dbconfig/20250401-123009-ladsgroup.json [12:30:12] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [12:34:42] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1039.eqiad.wmnet [12:34:56] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc2040.codfw.wmnet [12:35:42] (03PS1) 10Federico Ceratto: zarcillo.py: basic demo of populating zarcillo test tables [cookbooks] - 10https://gerrit.wikimedia.org/r/1133125 (https://phabricator.wikimedia.org/T257814) [12:36:42] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 01 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployca" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133124 (owner: 10Clare Ming) [12:39:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet [12:39:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ganeti4008.ulsfo.wmnet [12:39:36] !log fceratto@cumin1002 START - Cookbook sre.mysql.zarcillo [12:40:49] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1039.eqiad.wmnet [12:41:07] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.zarcillo (exit_code=99) [12:41:33] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2040.codfw.wmnet [12:41:42] !log fceratto@cumin1002 START - Cookbook sre.mysql.zarcillo [12:42:14] !log fceratto@cumin1002 END (FAIL) - Cookbook sre.mysql.zarcillo (exit_code=99) [12:42:28] !log fceratto@cumin1002 START - Cookbook sre.mysql.zarcillo [12:42:38] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.zarcillo (exit_code=0) [12:43:31] !log fceratto@cumin1002 START - Cookbook sre.mysql.zarcillo [12:43:37] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.zarcillo (exit_code=0) [12:44:40] !log fceratto@cumin1002 START - Cookbook sre.mysql.zarcillo [12:44:42] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.zarcillo (exit_code=0) [12:45:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P74530 and previous config saved to /var/cache/conftool/dbconfig/20250401-124516-ladsgroup.json [12:47:14] !log fceratto@cumin1002 START - Cookbook sre.mysql.zarcillo [12:47:16] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.zarcillo (exit_code=0) [12:47:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host maps-test2004.codfw.wmnet with OS bookworm [12:47:36] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#10698647 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host maps-test2004.codfw.wmnet with OS bookworm completed: - maps-test2004 (**PASS**)... [12:48:28] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host maps-test2005.codfw.wmnet with OS bookworm [12:48:40] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#10698651 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host maps-test2005.codfw.wmnet with OS bookworm [12:50:05] (03PS1) 10Slyngshede: Pull in BituLDAP [software/bitu] - 10https://gerrit.wikimedia.org/r/1133131 [12:50:42] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti4008.ulsfo.wmnet with OS bookworm [12:50:47] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in ulsfo to Bookworm - https://phabricator.wikimedia.org/T382511#10698685 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti4008.ulsfo.wmnet with OS bookworm [12:55:29] (03PS1) 10Cathal Mooney: Include base_paths when initialising the plugin class [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1133133 (https://phabricator.wikimedia.org/T310577) [12:59:34] 10ops-eqiad, 06DC-Ops: Alert for device ps1-a6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390709 (10phaultfinder) 03NEW [13:00:04] Lucas_WMDE, Urbanecm, and TheresNoTime: Your horoscope predicts another UTC afternoon backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250401T1300). [13:00:05] MatmaRex and phuedx: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:14] hi [13:00:23] (03CR) 10Elukey: [C:03+2] "Tested manually on registry1004!" [puppet] - 10https://gerrit.wikimedia.org/r/1133112 (https://phabricator.wikimedia.org/T390251) (owner: 10Filippo Giunchedi) [13:00:23] o/ [13:00:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P74533 and previous config saved to /var/cache/conftool/dbconfig/20250401-130023-ladsgroup.json [13:00:24] my change today is a no-op [13:00:28] o/ [13:00:40] I can deploy! [13:00:58] Lucas_WMDE: o/ do you mind to wait a minute before proceeding? [13:01:01] sure [13:01:08] I need to look at phuedx’ change first anyway [13:01:15] I need to restart the nginx daemons on the docker registry hosts [13:01:19] thanks a lot [13:02:42] (03CR) 10Volans: [C:03+1] "LGTM, to be deployed to prod together with the new homer release" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1133133 (https://phabricator.wikimedia.org/T310577) (owner: 10Cathal Mooney) [13:02:56] (03PS2) 10Lucas Werkmeister (WMDE): Disable experiment-related config during active development [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133124 (owner: 10Clare Ming) [13:04:39] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "(added trailing commas to make the diff more readable)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133124 (owner: 10Clare Ming) [13:04:58] !log msw2-eqiad> restart jsd gracefully - T390052 [13:05:00] both changes look good to me, I think we can just deploy them together (once the nginx stuff is done) [13:05:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:00] T390052: Enable gNMI on SRX devices and fasw - https://phabricator.wikimedia.org/T390052 [13:05:06] (03PS1) 10Muehlenhoff: Add pbuilder hook for ECH builds [puppet] - 10https://gerrit.wikimedia.org/r/1133135 (https://phabricator.wikimedia.org/T205378) [13:05:40] !log restart nginx on registry* to pick up https://gerrit.wikimedia.org/r/c/operations/puppet/+/1133112 - debug logs to /var/log/nginx/debug.log - T390251 [13:05:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:42] T390251: docker-registry.wikimedia.org keeps serving bad blobs - https://phabricator.wikimedia.org/T390251 [13:05:57] Lucas_WMDE: 2 mins and I should be finished [13:08:06] (03PS1) 10Gerrit maintenance bot: Add nup to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/1133136 (https://phabricator.wikimedia.org/T390384) [13:08:25] Lucas_WMDE: done! [13:08:41] ok! [13:08:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130299 (owner: 10Bartosz Dziewoński) [13:08:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133124 (owner: 10Clare Ming) [13:09:19] (03CR) 10Elukey: [C:03+1] failover eqiad urldownloader for security update [dns] - 10https://gerrit.wikimedia.org/r/1133108 (owner: 10Muehlenhoff) [13:09:25] FIRING: SystemdUnitFailed: prometheus-dpkg-success-textfile.service on registry2005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:09:54] (03Merged) 10jenkins-bot: Remove 'exception-json' logging channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130299 (owner: 10Bartosz Dziewoński) [13:09:55] (03Merged) 10jenkins-bot: Disable experiment-related config during active development [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133124 (owner: 10Clare Ming) [13:10:20] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1130299|Remove 'exception-json' logging channel]], [[gerrit:1133124|Disable experiment-related config during active development]] [13:10:44] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ganeti4008.ulsfo.wmnet with reason: host reimage [13:11:15] (03CR) 10Cathal Mooney: [C:03+1] "Idea of moving it is good I think, and the logic makes sense to me. If it's tested as working let's do it. I think we may be able to sim" [software/homer] - 10https://gerrit.wikimedia.org/r/1124437 (owner: 10Ayounsi) [13:13:37] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on maps-test2005.codfw.wmnet with reason: host reimage [13:14:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ganeti4008.ulsfo.wmnet with reason: host reimage [13:15:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T371742)', diff saved to https://phabricator.wikimedia.org/P74534 and previous config saved to /var/cache/conftool/dbconfig/20250401-131530-ladsgroup.json [13:15:33] T371742: Change page.page_links_updated to fixed-length timestamp in wmf wikis - https://phabricator.wikimedia.org/T371742 [13:15:34] (03CR) 10Giuseppe Lavagetto: "Overall LGTM, minus a couple changes I will make myself. I'm slightly unsure with the thresholds we set because these metrics don't have a" [alerts] - 10https://gerrit.wikimedia.org/r/1132580 (owner: 10Giuseppe Lavagetto) [13:16:41] (03CR) 10Reedy: "Yeah, should be GTG whenever" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127082 (owner: 10Reedy) [13:16:46] (03PS2) 10Reedy: CommmonSettings: Remove old BounceHandler DB config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127082 [13:17:33] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, cjming, matmarex: Backport for [[gerrit:1130299|Remove 'exception-json' logging channel]], [[gerrit:1133124|Disable experiment-related config during active development]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:17:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps-test2005.codfw.wmnet with reason: host reimage [13:17:48] MatmaRex, phuedx: if there’s anything to test, please do so now :) [13:18:29] !log installing python-cryptohgraphy security updates [13:18:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:39] !log installing python-cryptography security updates [13:18:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:13] Lucas_WMDE: nothing to test [13:20:16] ack [13:20:20] (sorry, i looked away for a minute) [13:20:27] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc2041.codfw.wmnet [13:20:33] np, I suspected as much from your earlier message ^^ [13:20:42] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1040.eqiad.wmnet [13:20:52] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1160.eqiad.wmnet with reason: Maintenance [13:20:55] (still waiting for phuedx because it’s not clear to me if that change is testable or not, it looks like it might be) [13:20:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1160 (T370903)', diff saved to https://phabricator.wikimedia.org/P74536 and previous config saved to /var/cache/conftool/dbconfig/20250401-132059-ladsgroup.json [13:21:02] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [13:21:21] Lucas_WMDE: LGTM. Enabled verbose logging. Checked the logs and they seem clean (nothing from MetricsPlatform ext.) [13:21:27] yay [13:21:28] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde, cjming, matmarex: Continuing with sync [13:21:30] thanks :) [13:23:23] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T387829#10698970 (10Jhancock.wm) 05Open→03Declined [13:24:01] !log ladsgroup@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2165.codfw.wmnet with reason: Maintenance [13:24:07] (03PS1) 10Isabelle Hurbain-Palatin: Enable Parsoid Read Views to incubator and dagwiki mobile frontend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133141 (https://phabricator.wikimedia.org/T380768) [13:24:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2165 (T370903)', diff saved to https://phabricator.wikimedia.org/P74537 and previous config saved to /var/cache/conftool/dbconfig/20250401-132407-ladsgroup.json [13:24:35] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390535#10698979 (10phaultfinder) [13:25:20] 10ops-codfw, 06SRE, 06DC-Ops: InboundInterfaceErrors - https://phabricator.wikimedia.org/T390008#10698980 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [13:25:21] (03Abandoned) 10Isabelle Hurbain-Palatin: Enable Parsoid Read Views to incubator [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133082 (https://phabricator.wikimedia.org/T380768) (owner: 10Isabelle Hurbain-Palatin) [13:25:33] (03Abandoned) 10Isabelle Hurbain-Palatin: Enable Parsoid Read Views for Mobile Front End on dagwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133078 (https://phabricator.wikimedia.org/T381002) (owner: 10Isabelle Hurbain-Palatin) [13:26:25] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1040.eqiad.wmnet [13:26:45] (03PS1) 10Elukey: services: use the kafka svc endpoint for Tegola [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133142 (https://phabricator.wikimedia.org/T373115) [13:27:04] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2041.codfw.wmnet [13:28:25] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1130299|Remove 'exception-json' logging channel]], [[gerrit:1133124|Disable experiment-related config during active development]] (duration: 18m 04s) [13:28:47] (03CR) 10Marostegui: [C:03+1] "I tested this and it worked well" [cookbooks] - 10https://gerrit.wikimedia.org/r/1132618 (https://phabricator.wikimedia.org/T390217) (owner: 10Federico Ceratto) [13:28:57] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for cp2035.mgmt:22 - https://phabricator.wikimedia.org/T390658#10699012 (10Jhancock.wm) i prefer this. the individual issues makes it easier to communicate and easier to track how many are happening at once. [13:29:12] !log UTC afternoon backport+config window done [13:29:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:16] * Lucas_WMDE done deploying [13:29:54] thanks [13:30:45] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, April 03 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deployc" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133141 (https://phabricator.wikimedia.org/T380768) (owner: 10Isabelle Hurbain-Palatin) [13:33:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ganeti4008.ulsfo.wmnet with OS bookworm [13:33:52] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in ulsfo to Bookworm - https://phabricator.wikimedia.org/T382511#10699051 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host ganeti4008.ulsfo.wmnet with OS bookworm completed: - ganeti4008 (**PASS*... [13:34:06] Lucas_WMDE: Thanks [13:34:15] np :) [13:35:42] !log fceratto@cumin1002 START - Cookbook sre.mysql.zarcillo [13:35:43] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.zarcillo (exit_code=0) [13:37:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T370903)', diff saved to https://phabricator.wikimedia.org/P74539 and previous config saved to /var/cache/conftool/dbconfig/20250401-133707-ladsgroup.json [13:37:12] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [13:37:12] !log fceratto@cumin1002 START - Cookbook sre.mysql.zarcillo [13:37:14] !log fceratto@cumin1002 END (PASS) - Cookbook sre.mysql.zarcillo (exit_code=0) [13:38:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host maps-test2005.codfw.wmnet with OS bookworm [13:38:31] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Test hot disk swap on Supermicro database hosts - https://phabricator.wikimedia.org/T388684#10699063 (10Papaul) @Marostegui I do agree with you that pulling the disk and inserting the disk is not a real disk test failure. We can absolute force ourselves to mark the disk... [13:38:41] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#10699064 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host maps-test2005.codfw.wmnet with OS bookworm completed: - maps-test2005 (**PASS**)... [13:38:59] (03PS7) 10Giuseppe Lavagetto: mediawiki-global: add alerts for too many login attempts [alerts] - 10https://gerrit.wikimedia.org/r/1132580 [13:39:40] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390682#10699077 (10phaultfinder) [13:39:46] !log restart nginx on registry2005 - stuck writing error logs [13:39:46] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for cp2035.mgmt:22 - https://phabricator.wikimedia.org/T390658#10699083 (10Jhancock.wm) @ssingh this game up again. it had this issue in august of last year. T372160. the same remedy should apply. I'm hesitant to do anything more intensive since the serve... [13:39:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T370903)', diff saved to https://phabricator.wikimedia.org/P74540 and previous config saved to /var/cache/conftool/dbconfig/20250401-133954-ladsgroup.json [13:40:09] (03CR) 10Ladsgroup: [C:03+1] Configure virtual terms db for wikidata prod & test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131353 (https://phabricator.wikimedia.org/T389190) (owner: 10Jakob) [13:40:37] (03CR) 10Ladsgroup: [C:03+1] "Do you need help deploying?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131353 (https://phabricator.wikimedia.org/T389190) (owner: 10Jakob) [13:40:59] (03CR) 10Giuseppe Lavagetto: [C:03+1] sessionstore-resources: add SessionStoreDiskSpaceUtilizationTooHigh [alerts] - 10https://gerrit.wikimedia.org/r/1132775 (https://phabricator.wikimedia.org/T390630) (owner: 10Scott French) [13:41:47] (03CR) 10Giuseppe Lavagetto: [C:03+2] mediawiki-global: add alerts for too many login attempts (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1132580 (owner: 10Giuseppe Lavagetto) [13:41:54] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Test hot disk swap on Supermicro database hosts - https://phabricator.wikimedia.org/T388684#10699103 (10Papaul) @Jhancock.wm in the the main time can you please check if we do have a 1.92TB INTEL disk on site if @Marostegui wants to perform the test above? Thanks [13:42:58] (03Merged) 10jenkins-bot: mediawiki-global: add alerts for too many login attempts [alerts] - 10https://gerrit.wikimedia.org/r/1132580 (owner: 10Giuseppe Lavagetto) [13:44:05] !log sukhe@puppetserver1001 conftool action : set/pooled=no; selector: name=cp2035.codfw.wmnet [reason: T390658] [13:44:07] T390658: Unresponsive management for cp2035.mgmt:22 - https://phabricator.wikimedia.org/T390658 [13:44:33] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for cp2035.mgmt:22 - https://phabricator.wikimedia.org/T390658#10699135 (10ssingh) >>! In T390658#10699081, @Jhancock.wm wrote: > @ssingh this came up again. it had this issue in august of last year. T372160. the same remedy should apply. I'm hesitant to... [13:46:12] (03CR) 10Hashar: "I don't have +2 rights on Puppet, thus feel free to merge it unless you are looking for a review by @jhathaway@wikimedia.org ." [puppet] - 10https://gerrit.wikimedia.org/r/1128859 (https://phabricator.wikimedia.org/T389181) (owner: 10Hashar) [13:46:40] (03CR) 10Eevans: sessionstore-resources: add SessionStoreDiskSpaceUtilizationTooHigh (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1132775 (https://phabricator.wikimedia.org/T390630) (owner: 10Scott French) [13:48:38] !log depool registry2005 to investigate some nginx logging issue [13:48:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P74542 and previous config saved to /var/cache/conftool/dbconfig/20250401-135215-ladsgroup.json [13:53:22] !log elukey@cumin1002 START - Cookbook sre.hosts.reboot-single for host registry2005.codfw.wmnet [13:54:16] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for cp2035.mgmt:22 - https://phabricator.wikimedia.org/T390658#10699200 (10Papaul) @fgiunchedi thank you [13:55:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P74543 and previous config saved to /var/cache/conftool/dbconfig/20250401-135501-ladsgroup.json [13:55:02] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Link down between cr3-ulsfo and cr4-ulsfo - https://phabricator.wikimedia.org/T390731 (10cmooney) 03NEW p:05Triage→03High [13:57:13] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Wednesday, April 02 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploy" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131353 (https://phabricator.wikimedia.org/T389190) (owner: 10Jakob) [13:58:27] (03CR) 10Xcollazo: [C:03+1] presto: Double the heap size for the coordinator [puppet] - 10https://gerrit.wikimedia.org/r/1132769 (https://phabricator.wikimedia.org/T390623) (owner: 10Btullis) [13:59:07] (03CR) 10Jakob: "Not needed, I think. I scheduled it to be deployed tomorrow afternoon. Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131353 (https://phabricator.wikimedia.org/T389190) (owner: 10Jakob) [14:00:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host maps-test2006.codfw.wmnet with OS bookworm [14:00:54] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#10699243 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host maps-test2006.codfw.wmnet with OS bookworm [14:01:44] (03PS1) 10Elukey: docker_registry_ha: remove debug logging [puppet] - 10https://gerrit.wikimedia.org/r/1133149 (https://phabricator.wikimedia.org/T390251) [14:02:50] !log elukey@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host registry2005.codfw.wmnet [14:03:22] (03CR) 10Elukey: [C:03+2] "sigh" [puppet] - 10https://gerrit.wikimedia.org/r/1133149 (https://phabricator.wikimedia.org/T390251) (owner: 10Elukey) [14:03:37] FIRING: [4x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:03:53] 10ops-eqiad, 06SRE, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T388236#10699263 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr duplicate ticket [14:05:38] !log roll restart nginx on registry* to remove debug logging - too much data, filling up the root partition [14:05:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:55] RESOLVED: SystemdUnitFailed: prometheus-dpkg-success-textfile.service on registry2005:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:06:17] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti4008.ulsfo.wmnet [14:07:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P74544 and previous config saved to /var/cache/conftool/dbconfig/20250401-140721-ladsgroup.json [14:07:59] (03CR) 10Btullis: [C:03+2] presto: Double the heap size for the coordinator [puppet] - 10https://gerrit.wikimedia.org/r/1132769 (https://phabricator.wikimedia.org/T390623) (owner: 10Btullis) [14:10:09] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160', diff saved to https://phabricator.wikimedia.org/P74545 and previous config saved to /var/cache/conftool/dbconfig/20250401-141008-ladsgroup.json [14:12:39] (03CR) 10Muehlenhoff: [C:03+1] "LGTM." [software/bitu] - 10https://gerrit.wikimedia.org/r/1133131 (owner: 10Slyngshede) [14:14:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti4008.ulsfo.wmnet [14:14:43] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for cp2035.mgmt:22 - https://phabricator.wikimedia.org/T390658#10699305 (10Jhancock.wm) @ssingh server has been power cycled. can ssh into it and it mgmt/network both ping. [14:14:45] (03CR) 10Federico Ceratto: [C:03+1] "yay, merging!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1132618 (https://phabricator.wikimedia.org/T390217) (owner: 10Federico Ceratto) [14:14:46] (03CR) 10Federico Ceratto: [C:03+2] clone.py: skip dbctl addition on --nopool [cookbooks] - 10https://gerrit.wikimedia.org/r/1132618 (https://phabricator.wikimedia.org/T390217) (owner: 10Federico Ceratto) [14:15:05] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti4008.ulsfo.wmnet to cluster ulsfo and group 1 [14:15:16] jouncebot: nowandnext [14:15:16] No deployments scheduled for the next 0 hour(s) and 44 minute(s) [14:15:16] In 0 hour(s) and 44 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250401T1500) [14:15:20] awesome [14:15:30] (03CR) 10Federico Ceratto: sanitarium_restart.py: restart Sanitarium hosts (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1131954 (https://phabricator.wikimedia.org/T363665) (owner: 10Federico Ceratto) [14:16:01] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127082 (owner: 10Reedy) [14:16:53] Reedy: we need a tiny bit more clean up for this: https://codesearch.wmcloud.org/deployed/?q=BounceHandlerCluster&files=&excludeFiles=&repos= [14:16:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.addnode (exit_code=0) for new host ganeti4008.ulsfo.wmnet to cluster ulsfo and group 1 [14:17:09] (03Merged) 10jenkins-bot: CommmonSettings: Remove old BounceHandler DB config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1127082 (owner: 10Reedy) [14:17:23] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for cp2035.mgmt:22 - https://phabricator.wikimedia.org/T390658#10699307 (10ssingh) 05Open→03Resolved a:03ssingh Thanks for the help @Jhancock.wm! [14:17:32] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1127082|CommmonSettings: Remove old BounceHandler DB config]] [14:18:02] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in ulsfo to Bookworm - https://phabricator.wikimedia.org/T382511#10699315 (10MoritzMuehlenhoff) [14:19:49] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Test hot disk swap on Supermicro database hosts - https://phabricator.wikimedia.org/T388684#10699337 (10Marostegui) >>! In T388684#10699063, @Papaul wrote: > @Marostegui I do agree with you that pulling the disk and inserting the disk is not a real disk test failure. We... [14:20:03] (03CR) 10Subramanya Sastry: Enable Parsoid Read Views on 12 wiktionaries (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133113 (https://phabricator.wikimedia.org/T390680) (owner: 10Isabelle Hurbain-Palatin) [14:20:53] Amir1: Can we do `$wgVirtualDomainsMapping['virtual-bouncehandler'] = false;` [14:21:32] I think we need unset but not sure honestly, better safe then sorry? [14:22:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T370903)', diff saved to https://phabricator.wikimedia.org/P74546 and previous config saved to /var/cache/conftool/dbconfig/20250401-142228-ladsgroup.json [14:22:31] T370903: Remove cuc_actiontext, cuc_only_for_read_old, and cuc_private from cu_changes on WMF wikis - https://phabricator.wikimedia.org/T370903 [14:22:33] (03PS1) 10Reedy: CommonSettings-labs: Update BounceHandler config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133156 [14:22:53] (03PS1) 10Michael Große: homepage: Add `homepage_transfersize_bytes_total` metric [extensions/GrowthExperiments] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1133157 (https://phabricator.wikimedia.org/T382003) [14:23:08] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on maps-test2006.codfw.wmnet with reason: host reimage [14:23:11] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1133157 (https://phabricator.wikimedia.org/T382003) (owner: 10Michael Große) [14:23:28] (03PS1) 10Michael Große: homepage: Add `homepage_transfersize_bytes_total` metric [extensions/GrowthExperiments] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133158 (https://phabricator.wikimedia.org/T382003) [14:23:38] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/GrowthExperiments] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133158 (https://phabricator.wikimedia.org/T382003) (owner: 10Michael Große) [14:24:08] !log ladsgroup@deploy1003 reedy, ladsgroup: Backport for [[gerrit:1127082|CommmonSettings: Remove old BounceHandler DB config]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:24:21] (03PS2) 10Isabelle Hurbain-Palatin: Enable Parsoid Read Views on 13 wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133113 (https://phabricator.wikimedia.org/T390680) [14:24:58] (03PS4) 10Bking: cirrus: test rename of single host elastic2055 [puppet] - 10https://gerrit.wikimedia.org/r/1132772 (https://phabricator.wikimedia.org/T388610) (owner: 10Ryan Kemper) [14:25:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1160 (T370903)', diff saved to https://phabricator.wikimedia.org/P74547 and previous config saved to /var/cache/conftool/dbconfig/20250401-142516-ladsgroup.json [14:26:00] !log ladsgroup@deploy1003 reedy, ladsgroup: Continuing with sync [14:26:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on maps-test2006.codfw.wmnet with reason: host reimage [14:26:29] (03CR) 10Kamila Součková: [C:03+2] profile::kubernetes::client: install kubectl 1.31 [puppet] - 10https://gerrit.wikimedia.org/r/1128467 (https://phabricator.wikimedia.org/T388388) (owner: 10JMeybohm) [14:27:09] (03PS5) 10Bking: cirrus: test rename of single host elastic2055 [puppet] - 10https://gerrit.wikimedia.org/r/1132772 (https://phabricator.wikimedia.org/T388610) (owner: 10Ryan Kemper) [14:27:53] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [14:28:12] !log cjming@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [14:28:31] (03CR) 10Bking: [C:03+2] cirrus: test rename of single host elastic2055 [puppet] - 10https://gerrit.wikimedia.org/r/1132772 (https://phabricator.wikimedia.org/T388610) (owner: 10Ryan Kemper) [14:28:54] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM, though please consider routing everything to -critical receiver, unless I'm missing something ?" [puppet] - 10https://gerrit.wikimedia.org/r/1132673 (https://phabricator.wikimedia.org/T385782) (owner: 10Clément Goubert) [14:31:59] !log bking@cumin2002 START - Cookbook sre.hosts.rename from elastic2055 to cirrussearch2055 [14:32:21] (03PS1) 10Alexandros Kosiaris: wikifunctions: Remove ports from httproutes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133161 (https://phabricator.wikimedia.org/T384944) [14:32:21] !log bking@cumin2002 START - Cookbook sre.dns.netbox [14:33:00] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1127082|CommmonSettings: Remove old BounceHandler DB config]] (duration: 15m 28s) [14:34:37] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390709#10699417 (10phaultfinder) [14:35:13] (03CR) 10JHathaway: [C:03+2] apt::package_from_component: unique sources list [puppet] - 10https://gerrit.wikimedia.org/r/1130195 (https://phabricator.wikimedia.org/T388388) (owner: 10JHathaway) [14:36:01] (03CR) 10Alexandros Kosiaris: [C:03+2] wikifunctions: Remove ports from httproutes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133161 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris) [14:36:20] (03PS1) 10Kamila Součková: Revert "profile::kubernetes::client: install kubectl 1.31" [puppet] - 10https://gerrit.wikimedia.org/r/1133162 [14:36:32] (03PS2) 10Kamila Součková: Revert "profile::kubernetes::client: install kubectl 1.31" [puppet] - 10https://gerrit.wikimedia.org/r/1133162 [14:36:58] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2055 to cirrussearch2055 - bking@cumin2002" [14:37:18] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Renaming elastic2055 to cirrussearch2055 - bking@cumin2002" [14:37:18] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:37:19] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2055 [14:37:28] (03Merged) 10jenkins-bot: wikifunctions: Remove ports from httproutes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133161 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris) [14:37:32] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2055 [14:37:39] (03PS5) 10Federico Ceratto: upgrade.py: Depool, repool, update Phabricator [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) [14:38:13] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.rename (exit_code=0) from elastic2055 to cirrussearch2055 [14:40:25] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2055.codfw.wmnet with OS bullseye [14:40:37] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2055 [14:40:47] !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: apply [14:40:53] !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: apply [14:41:09] !log akosiaris@deploy1003 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: apply [14:41:25] !log akosiaris@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: apply [14:41:49] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker2331.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:42:32] !log bking@cumin2002 START - Cookbook sre.dns.netbox [14:44:25] (03CR) 10CI reject: [V:04-1] upgrade.py: Depool, repool, update Phabricator [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) (owner: 10Federico Ceratto) [14:45:30] (03PS2) 10Scott French: sessionstore-resources: add SessionStoreDiskSpaceUtilizationTooHigh [alerts] - 10https://gerrit.wikimedia.org/r/1132775 (https://phabricator.wikimedia.org/T390630) [14:46:18] (03CR) 10Scott French: "Thank you both for the review!" [alerts] - 10https://gerrit.wikimedia.org/r/1132775 (https://phabricator.wikimedia.org/T390630) (owner: 10Scott French) [14:46:43] !log bking@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2055 - bking@cumin2002" [14:46:48] !log bking@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host cirrussearch2055 - bking@cumin2002" [14:46:48] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:46:49] !log bking@cumin2002 START - Cookbook sre.dns.wipe-cache cirrussearch2055.codfw.wmnet 180.0.192.10.in-addr.arpa 0.8.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:46:52] !log bking@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cirrussearch2055.codfw.wmnet 180.0.192.10.in-addr.arpa 0.8.1.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [14:46:53] !log bking@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host cirrussearch2055 [14:47:11] !log bking@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cirrussearch2055 [14:47:11] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2055 [14:48:00] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host maps-test2006.codfw.wmnet with OS bookworm [14:48:16] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#10699475 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jmm@cumin2002 for host maps-test2006.codfw.wmnet with OS bookworm completed: - maps-test2006 (**PASS**)... [14:49:25] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [14:49:28] (03PS1) 10Superpes15: Close pihwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133164 (https://phabricator.wikimedia.org/T390732) [14:50:02] !log depooled cp7001 to test secure removal of unused certificates (T384227) [14:50:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:50:04] T384227: Private TLS material (TLS keys) should be stored in volatile storage only - https://phabricator.wikimedia.org/T384227 [14:50:49] (03CR) 10Lucas Werkmeister (WMDE): [C:03+1] "Looks good to me, though I wouldn’t mind switching `query-main` and `query-scholarly` first (to check that everything works fine) before c" [puppet] - 10https://gerrit.wikimedia.org/r/1133120 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [14:51:39] !log sukhe@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp2035.codfw.wmnet [reason: finished T390658] [14:51:42] T390658: Unresponsive management for cp2035.mgmt:22 - https://phabricator.wikimedia.org/T390658 [14:52:25] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2331.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [14:53:53] 06SRE, 06Infrastructure-Foundations, 10netops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10699524 (10xcollazo) Ok attempting the below query again now: >>! In T390623#10699223, @xcollazo wrote: >... [14:54:30] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Test hot disk swap on Supermicro database hosts - https://phabricator.wikimedia.org/T388684#10699526 (10Jhancock.wm) @Marostegui @Papaul found a 1.92TB intel disk [14:58:44] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Test hot disk swap on Supermicro database hosts - https://phabricator.wikimedia.org/T388684#10699560 (10Marostegui) >>! In T388684#10699526, @Jhancock.wm wrote: > @Marostegui @Papaul found a 1.92TB intel disk I've marked the disk as bad: ` root@db2243:~# sudo megacli -... [15:00:05] jelto, arnoldokoth, and mutante: Your horoscope predicts another SRE Collaboration Services office hours deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250401T1500). [15:00:39] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390535#10699565 (10phaultfinder) [15:00:56] (03PS2) 10Superpes15: Close pihwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133164 (https://phabricator.wikimedia.org/T390732) [15:01:57] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Test hot disk swap on Supermicro database hosts - https://phabricator.wikimedia.org/T388684#10699571 (10Jhancock.wm) @Marostegui replaced disk 4 [15:01:58] (03CR) 10Volans: "I was asked to do a pass. In general looks good to me. I've left mostly some questions and minor suggestions inline." [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) (owner: 10Federico Ceratto) [15:01:58] (03Abandoned) 10Kamila Součková: Revert "profile::kubernetes::client: install kubectl 1.31" [puppet] - 10https://gerrit.wikimedia.org/r/1133162 (owner: 10Kamila Součková) [15:02:02] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [15:03:47] (03CR) 10Marostegui: upgrade.py: Depool, repool, update Phabricator (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1130644 (https://phabricator.wikimedia.org/T389805) (owner: 10Federico Ceratto) [15:04:45] !log dzahn@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab2002.codfw.wmnet with reason: phabricator deploy [15:05:06] !log dzahn@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on phab1004.eqiad.wmnet with reason: phabricator deploy [15:05:59] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Test hot disk swap on Supermicro database hosts - https://phabricator.wikimedia.org/T388684#10699584 (10Marostegui) Looks like it is rebuilding! ` root@db2243:/home/marostegui# sudo megacli -PDRbld -ShowProg -PhysDrv [252:4] -a0 Rebuild Progress on Device at Enclosure... [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:07:02] 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#10699598 (10Ladsgroup) I made a graph to quickly see which backend will alert next: https://grafana.wikimedia.org/d/000000378/ladsgroup... [15:08:37] FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:08:58] !log brennen@deploy1003 Started deploy [phabricator/deployment@53fcaf8]: test deploy phab2002 for T390737 [15:09:00] T390737: Deploy Phabricator/Phorge 2025-04-01 - https://phabricator.wikimedia.org/T390737 [15:09:37] !log brennen@deploy1003 Finished deploy [phabricator/deployment@53fcaf8]: test deploy phab2002 for T390737 (duration: 00m 39s) [15:10:30] (03CR) 10BCornwall: [C:03+1] Add nup to langlist helper [dns] - 10https://gerrit.wikimedia.org/r/1133136 (https://phabricator.wikimedia.org/T390384) (owner: 10Gerrit maintenance bot) [15:10:38] !log brennen@deploy1003 Started deploy [phabricator/deployment@53fcaf8]: deploy phab1004 for T390737 [15:11:07] (03CR) 10BCornwall: [C:03+1] Fixed tabs to spaces. [dns] - 10https://gerrit.wikimedia.org/r/1131978 (owner: 10SCherukuwada) [15:11:14] !log brennen@deploy1003 Finished deploy [phabricator/deployment@53fcaf8]: deploy phab1004 for T390737 (duration: 00m 36s) [15:12:51] (03CR) 10Ssingh: "We should check why the CI never detected this (Will do that.)" [dns] - 10https://gerrit.wikimedia.org/r/1131978 (owner: 10SCherukuwada) [15:15:09] 06SRE, 06Infrastructure-Foundations, 10netops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10699685 (10xcollazo) I've succesfully run the following query: >>! In T390623#10699616, @xcollazo wrote: >... [15:16:07] 06SRE, 06Infrastructure-Foundations, 10netops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10699693 (10xcollazo) >We only have these stats for some of the presto hosts, which are those in rows E and... [15:18:53] !log elukey@cumin1002 START - Cookbook sre.hosts.provision for host wikikube-worker2331.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:19:18] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host wikikube-worker2331.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART [15:19:45] 10ops-codfw, 06SRE, 06DC-Ops: Unresponsive management for cp2035.mgmt:22 - https://phabricator.wikimedia.org/T390658#10699727 (10Dzahn) @fgiunchedi thanks! looks good to me:) Later, let's do the same for tickets generated for failed systemd timers. [15:20:01] (03CR) 10Kamila Součková: [C:03+1] "It appears that the Growth team has only one receiver. In principle this could go elsewhere, but I am not sure whether it makes sense to c" [puppet] - 10https://gerrit.wikimedia.org/r/1132673 (https://phabricator.wikimedia.org/T385782) (owner: 10Clément Goubert) [15:20:35] <_joe_> jouncebot: nowandnext [15:20:35] For the next 0 hour(s) and 39 minute(s): SRE Collaboration Services office hours (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250401T1500) [15:20:35] In 0 hour(s) and 39 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250401T1600) [15:20:36] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10699729 (10elukey) @Jhancock.wm the server is provisioned, please go ahead! [15:22:02] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [15:22:21] (03PS3) 10DCausse: cirrus: explicitly route search traffic to codfw [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129182 (https://phabricator.wikimedia.org/T388610) [15:22:21] (03PS3) 10DCausse: cirrus: switch search traffic back to multi-DC [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129183 (https://phabricator.wikimedia.org/T388610) [15:23:35] (03CR) 10DCausse: [C:04-1] "should be ready once we start upgrading eqiad" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1129182 (https://phabricator.wikimedia.org/T388610) (owner: 10DCausse) [15:23:43] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Test hot disk swap on Supermicro database hosts - https://phabricator.wikimedia.org/T388684#10699736 (10elukey) @Papaul @Jhancock.wm should we do the same test on ms-be2088 to see if the controller picks up the old JBOD config without a controller restart? [15:24:13] (03CR) 10Dzahn: [V:03+1 C:03+2] "https://meta.wikimedia.org/wiki/Requests_for_new_languages/Wikipedia_Nupe" [dns] - 10https://gerrit.wikimedia.org/r/1133136 (https://phabricator.wikimedia.org/T390384) (owner: 10Gerrit maintenance bot) [15:24:50] !log dzahn@dns1004 START - running authdns-update [15:25:40] !log DNS - new project language 'nup' - Nupe (also known as Anufe, Nupenci, Nyinfe, and Tapa[3]) is a Volta–Niger language of the Nupoid branch primarily spoken by the Nupe people of the North Central region of Nigeria. [15:25:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:56] s2 may page [15:25:57] we are on it [15:26:09] ok, thanks [15:26:09] ack, thx [15:27:06] !log dzahn@dns1004 END - running authdns-update [15:27:24] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:10:00 on 27 hosts with reason: Maintenance in s2 [15:27:25] (03CR) 10DCausse: [C:03+1] "thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133081 (owner: 10Hashar) [15:34:35] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390682#10699814 (10phaultfinder) [15:35:42] 10ops-eqiad, 06DC-Ops: Alert for device ps1-a2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390749 (10phaultfinder) 03NEW [15:35:43] 10ops-eqiad, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390750 (10phaultfinder) 03NEW [15:36:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:36:50] (03CR) 10Clément Goubert: "I added another receiver in case they had alerts between the two severities they didn't want in slack, happy to change it if that's not th" [puppet] - 10https://gerrit.wikimedia.org/r/1132673 (https://phabricator.wikimedia.org/T385782) (owner: 10Clément Goubert) [15:36:59] (03PS19) 10Giuseppe Lavagetto: Add a mediawiki-common release to mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117548 [15:37:41] (03PS3) 10Scott French: sessionstore-resources: add SessionStoreDiskSpaceUtilizationTooHigh [alerts] - 10https://gerrit.wikimedia.org/r/1132775 (https://phabricator.wikimedia.org/T390630) [15:39:16] 06SRE, 10Ganeti, 06Infrastructure-Foundations: Update Ganeti servers in ulsfo to Bookworm - https://phabricator.wikimedia.org/T382511#10699851 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff All done! [15:40:01] (03PS1) 10Elukey: sre.hosts.provision: add a warning for ipv6 disabled [cookbooks] - 10https://gerrit.wikimedia.org/r/1133171 (https://phabricator.wikimedia.org/T389950) [15:40:25] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:41:24] (03CR) 10Herron: [C:03+1] hieradata: move k8s prometheus1006 -> 1008 [puppet] - 10https://gerrit.wikimedia.org/r/1131302 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [15:43:02] (03CR) 10Kamila Součková: [C:03+1] mw::periodic_jobs: Migrate deleteOldSurveys [puppet] - 10https://gerrit.wikimedia.org/r/1132674 (https://phabricator.wikimedia.org/T385782) (owner: 10Clément Goubert) [15:43:35] 06SRE, 07SRE-Unowned, 10Maps, 13Patch-For-Review: Move maps servers to Bookworm - https://phabricator.wikimedia.org/T381565#10699865 (10MoritzMuehlenhoff) Status update: The postgres setup is now properly working with Postgres 15. However, it turned out that the old Ganeti servers we re-use as the maps/bo... [15:43:54] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Link down between cr3-ulsfo and cr4-ulsfo - https://phabricator.wikimedia.org/T390731#10699869 (10RobH) Case 01043199 > Support, > > We recently rolled some OS upgrades to our routers and during that, one of the optics on our cross... [15:44:02] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Link down between cr3-ulsfo and cr4-ulsfo - https://phabricator.wikimedia.org/T390731#10699870 (10RobH) a:05cmooney→03RobH [15:44:33] (03CR) 10Giuseppe Lavagetto: [C:03+2] Add a mediawiki-common release to mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117548 (owner: 10Giuseppe Lavagetto) [15:45:06] !log removing et-0/0/0 from ae0 bundle on cr3-ulsfo and cr4-ulsfo T390731 [15:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:09] T390731: Link down between cr3-ulsfo and cr4-ulsfo - https://phabricator.wikimedia.org/T390731 [15:45:24] (03CR) 10Elukey: "Need to test it via test-cookbook :)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1133171 (https://phabricator.wikimedia.org/T389950) (owner: 10Elukey) [15:45:59] (03Merged) 10jenkins-bot: Add a mediawiki-common release to mw-script [deployment-charts] - 10https://gerrit.wikimedia.org/r/1117548 (owner: 10Giuseppe Lavagetto) [15:48:05] (03PS1) 10Pppery: Update translation [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1133172 [15:48:17] (03PS2) 10Pppery: Update translations [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1133172 [15:52:43] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q3:rack/setup/install wikikube-worker2248-2331, wikikube-ctrl2004-2005 - https://phabricator.wikimedia.org/T384970#10699927 (10elukey) >>! In T384970#10699729, @elukey wrote: > @Jhancock.wm the server is provisioned, please go ahead! Taking it back, I think that... [15:54:15] (03CR) 10Clément Goubert: "s/receiver/route/" [puppet] - 10https://gerrit.wikimedia.org/r/1132673 (https://phabricator.wikimedia.org/T385782) (owner: 10Clément Goubert) [15:56:35] 10ops-eqiad, 06Data-Platform-SRE, 06DC-Ops: Q3: an-worker data volumes HDD upgrade tracking task - https://phabricator.wikimedia.org/T385485#10699949 (10BTullis) I'll move this back to our parent board and put it into quartely goals. [16:00:04] jhathaway and rzl: gettimeofday() says it's time for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250401T1600) [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:01:51] 06SRE, 06MediaWiki-Platform-Team: Identify and remediate large increase in sessionstore Cassandra disk usage - https://phabricator.wikimedia.org/T390514#10699976 (10Scott_French) The bot we suspected may have been the source of the high rate of sessionstore writes has largely stopped since a bit before 21:00 U... [16:02:04] (03PS1) 10Bking: site.pp: Fix elastic-related regexes [puppet] - 10https://gerrit.wikimedia.org/r/1133174 (https://phabricator.wikimedia.org/T380529) [16:02:31] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1133174 (https://phabricator.wikimedia.org/T380529) (owner: 10Bking) [16:02:49] (03CR) 10Volans: [C:03+1] "LGTM, sigh that it seems there is no way to check this. I'll try to have a look too." [cookbooks] - 10https://gerrit.wikimedia.org/r/1133171 (https://phabricator.wikimedia.org/T389950) (owner: 10Elukey) [16:04:42] (03CR) 10Dzahn: [C:03+1] site.pp: Fix elastic-related regexes [puppet] - 10https://gerrit.wikimedia.org/r/1133174 (https://phabricator.wikimedia.org/T380529) (owner: 10Bking) [16:04:58] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch2055.codfw.wmnet with OS bullseye [16:05:14] 06SRE, 06Infrastructure-Foundations, 10netops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10699995 (10cmooney) > This effectively moved 308GB from HDFS Datanodes, thru the routers, to Presto server... [16:07:33] (03CR) 10Bking: [C:03+2] site.pp: Fix elastic-related regexes [puppet] - 10https://gerrit.wikimedia.org/r/1133174 (https://phabricator.wikimedia.org/T380529) (owner: 10Bking) [16:10:16] (03PS1) 10Brouberol: airflow-main: allow tasks to egress to the public druid [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133175 [16:10:27] (03CR) 10Elukey: "Probably there is something but from a quick look, I didn't find it.. So better to add some explicit warning while we search :D" [cookbooks] - 10https://gerrit.wikimedia.org/r/1133171 (https://phabricator.wikimedia.org/T389950) (owner: 10Elukey) [16:10:31] (03CR) 10Dzahn: [C:03+1] miscweb: os-report: use puppetdb from external_services (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131952 (https://phabricator.wikimedia.org/T350794) (owner: 10Jelto) [16:10:38] (03PS1) 10Herron: wip [puppet] - 10https://gerrit.wikimedia.org/r/1133176 [16:11:07] (03CR) 10Mforns: [C:03+1] "Not sure how this magic works, but the change makes sense to me." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133175 (owner: 10Brouberol) [16:12:31] (03PS1) 10Bking: elastic/cirrussearch: use correct role name [puppet] - 10https://gerrit.wikimedia.org/r/1133177 (https://phabricator.wikimedia.org/T388610) [16:12:44] (03CR) 10CI reject: [V:04-1] elastic/cirrussearch: use correct role name [puppet] - 10https://gerrit.wikimedia.org/r/1133177 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [16:12:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr3-ulsfo:et-0/0/0 (Core: cr4-ulsfo:et-0/0/0 {#1073}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [16:13:16] (03CR) 10Dzahn: [C:03+1] "matches https://docker-registry.wikimedia.org/repos/wmde/wikidata-query-builder/tags/" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133122 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [16:14:13] (03CR) 10Brouberol: [C:03+2] airflow-main: allow tasks to egress to the public druid [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133175 (owner: 10Brouberol) [16:15:41] (03PS2) 10Bking: elastic/cirrussearch: use correct role name [puppet] - 10https://gerrit.wikimedia.org/r/1133177 (https://phabricator.wikimedia.org/T388610) [16:16:20] (03PS3) 10Bking: elastic/cirrussearch: use correct role name [puppet] - 10https://gerrit.wikimedia.org/r/1133177 (https://phabricator.wikimedia.org/T388610) [16:17:01] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [16:17:19] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Test hot disk swap on Supermicro database hosts - https://phabricator.wikimedia.org/T388684#10700046 (10Marostegui) Almost there: ` root@db2243:/home/marostegui# sudo megacli -PDRbld -ShowProg -PhysDrv [252:4] -a0 Rebuild Progress on Device at Enclosure 252, Slot 4 Com... [16:17:34] !log brouberol@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [16:17:52] (03CR) 10Dzahn: "soo... we need IPs in netbox, like these: https://netbox.wikimedia.org/search/?q=k8s-ingress" [dns] - 10https://gerrit.wikimedia.org/r/1132699 (owner: 10Dzahn) [16:18:57] (03PS3) 10Herron: service: add k8s-ingress-aux-(ro|rw) discovery entries [puppet] - 10https://gerrit.wikimedia.org/r/1133176 (https://phabricator.wikimedia.org/T381417) [16:18:58] (03CR) 10Dzahn: "there already is k8s-ingress-aux.svc.codfw.wmnet and k8s-ingress-aux.svc.eqiad.wmnet but here I was trying to add -ro and -rw, copying f" [dns] - 10https://gerrit.wikimedia.org/r/1132699 (owner: 10Dzahn) [16:19:34] (03CR) 10Dzahn: [C:03+1] elastic/cirrussearch: use correct role name [puppet] - 10https://gerrit.wikimedia.org/r/1133177 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [16:20:57] (03CR) 10Bking: [C:03+2] elastic/cirrussearch: use correct role name [puppet] - 10https://gerrit.wikimedia.org/r/1133177 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [16:22:19] (03CR) 10Herron: "just uploaded Ia7debabf8ac7ce4c20fb572f1f231b19d7c562bb to help sort out the ro/rw discovery. I'm assuming these were omitted so far becau" [dns] - 10https://gerrit.wikimedia.org/r/1132699 (owner: 10Dzahn) [16:22:22] 06SRE, 06Infrastructure-Foundations, 10netops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10700068 (10cmooney) FWIW the largest potential bottleneck in Ashburn are on the 10G interfaces (names star... [16:22:30] FIRING: Primary outbound port utilisation over 80% #page: Alert for device asw2-c-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [16:22:31] FIRING: Primary inbound port utilisation over 80% #page: Alert for device cr1-eqiad.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [16:22:55] o/ [16:22:57] uhm here [16:23:12] perhaps same as yesterday, looking [16:23:30] !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host cirrussearch2055.codfw.wmnet with OS bullseye [16:23:35] !log bking@cumin2002 START - Cookbook sre.hosts.move-vlan for host cirrussearch2055 [16:23:35] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.move-vlan (exit_code=0) for host cirrussearch2055 [16:23:42] yea, was about to say, looking at that bookmarked dashboard [16:25:03] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Test hot disk swap on Supermicro database hosts - https://phabricator.wikimedia.org/T388684#10700073 (10Marostegui) 05Open→03Resolved RAID rebuilt finished and the RAID is optimal: ` root@db2243:/home/marostegui# ./storcli64 /c0 show Generating detailed summary... [16:25:25] !incidents [16:25:25] 5923 (ACKED) Primary outbound port utilisation over 80% (paged) network noc (asw2-c-eqiad.mgmt.eqiad.wmnet) [16:25:25] 5924 (ACKED) Primary inbound port utilisation over 80% (paged) network noc (cr1-eqiad.wikimedia.org) [16:25:26] 5922 (RESOLVED) ProbeDown sre (10.64.0.107 ip4 aux-k8s-ctrl1002:6443 probes/custom http_aux_k8s_eqiad_kube_apiserver_ip4 eqiad) [16:25:26] 5921 (RESOLVED) ProbeDown sre (10.2.2.68 ip4 shellbox-video:4080 probes/service http_shellbox-video_ip4 eqiad) [16:25:26] 5919 (RESOLVED) Primary inbound port utilisation over 80% (paged) network noc (cr1-eqiad.wikimedia.org) [16:25:26] 5920 (RESOLVED) Primary outbound port utilisation over 80% (paged) network noc (asw2-a-eqiad.mgmt.eqiad.wmnet) [16:26:03] librenms shows spike is already down again. and it's not out of the ordinary in the last 7 days [16:26:13] so back to that threshold question? [16:26:46] yeah I think so [16:27:17] 10GB port, every once in a while there is a short 8GB spike, but over 5 and this happens [16:27:31] FIRING: [3x] Primary outbound port utilisation over 80% #page: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [16:27:31] FIRING: [2x] Primary inbound port utilisation over 80% #page: Alert for device cr1-eqiad.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [16:28:09] (03PS1) 10Andrew Bogott: profile::wmcs::nfs::standalone: remove profile::openstack::eqiad1::observerenv [puppet] - 10https://gerrit.wikimedia.org/r/1133180 (https://phabricator.wikimedia.org/T390726) [16:31:00] (03PS1) 10Giuseppe Lavagetto: mw-script: fix flags definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133181 [16:31:35] (03CR) 10RLazarus: [C:03+1] mw-script: fix flags definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133181 (owner: 10Giuseppe Lavagetto) [16:31:52] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe2003 - https://phabricator.wikimedia.org/T390578#10700096 (10Jhancock.wm) @MatthewVernon I need more clarification on which vlan this server should go on. I don't have any other server examples and the only other apus ips i can... [16:32:31] FIRING: [3x] Primary outbound port utilisation over 80% #page: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [16:32:50] (03CR) 10Ssingh: [C:03+1] "Thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1133135 (https://phabricator.wikimedia.org/T205378) (owner: 10Muehlenhoff) [16:34:08] (03CR) 10Giuseppe Lavagetto: [C:03+2] mw-script: fix flags definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133181 (owner: 10Giuseppe Lavagetto) [16:35:30] (03CR) 10Dzahn: "ah!:) but then we'd have to do netbox fixes first, if we want ro and rw" [dns] - 10https://gerrit.wikimedia.org/r/1132699 (owner: 10Dzahn) [16:35:35] (03Merged) 10jenkins-bot: mw-script: fix flags definition [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133181 (owner: 10Giuseppe Lavagetto) [16:35:44] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390749#10700103 (10phaultfinder) [16:47:07] I'm working on a security patch to fix a potential low-risk CSRF vulnerability. I can't figure out what's wrong with the patch. Any hero around who would like to test it and see if it works for them? https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1133178 [16:47:31] FIRING: [3x] Primary outbound port utilisation over 80% #page: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [16:52:31] FIRING: [3x] Primary outbound port utilisation over 80% #page: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [16:54:55] (03CR) 10Andrew Bogott: [C:03+2] profile::wmcs::nfs::standalone: remove profile::openstack::eqiad1::observerenv [puppet] - 10https://gerrit.wikimedia.org/r/1133180 (https://phabricator.wikimedia.org/T390726) (owner: 10Andrew Bogott) [16:57:31] FIRING: [2x] Primary outbound port utilisation over 80% #page: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [16:57:31] RESOLVED: [2x] Primary inbound port utilisation over 80% #page: Device cr1-eqiad.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250401T1700) [17:01:43] 06SRE, 06MediaWiki-Platform-Team: Identify and remediate large increase in sessionstore Cassandra disk usage - https://phabricator.wikimedia.org/T390514#10700270 (10Eevans) I sampled 100k keys at random. Here are the top 25 + central auth: ` enwiki: 33229 (33.23%) commonswiki: 7260 (7.26%) metawiki: 6721 (6.... [17:02:31] RESOLVED: Primary outbound port utilisation over 80% #page: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [17:08:42] (03CR) 10Dzahn: "scratch that last comment about netbox! nevermind" [dns] - 10https://gerrit.wikimedia.org/r/1132699 (owner: 10Dzahn) [17:11:39] (03CR) 10Dzahn: service: add k8s-ingress-aux-(ro|rw) discovery entries (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1133176 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [17:12:39] (03CR) 10Dzahn: [C:03+1] "after more IRC chat, I think we want to go with -ro and -rw right away because sooner or later we will have services on aux that need acti" [puppet] - 10https://gerrit.wikimedia.org/r/1133176 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [17:12:51] RESOLVED: [2x] CoreRouterInterfaceDown: Core router interface down - cr3-ulsfo:et-0/0/0 (Core: cr4-ulsfo:et-0/0/0 {#1073}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:13:26] (03CR) 10Herron: [C:03+2] service: add k8s-ingress-aux-(ro|rw) discovery entries [puppet] - 10https://gerrit.wikimedia.org/r/1133176 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [17:13:31] FIRING: Primary outbound port utilisation over 80% #page: Alert for device asw2-a-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [17:14:35] o/, same issue again [17:15:02] now: 4.7 [17:15:51] (03CR) 10Subramanya Sastry: Enable Parsoid Read Views on 13 wiktionaries (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133113 (https://phabricator.wikimedia.org/T390680) (owner: 10Isabelle Hurbain-Palatin) [17:17:30] FIRING: Primary inbound port utilisation over 80% #page: Alert for device cr1-eqiad.wikimedia.org - Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [17:18:21] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr3-eqsin:ae2 (External: SingTel) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:18:31] RESOLVED: Primary outbound port utilisation over 80% #page: Device asw2-a-eqiad.mgmt.eqiad.wmnet recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [17:21:41] FIRING: [2x] ConfdResourceFailed: confd resource _var_lib_gdnsd_discovery-k8s-ingress-aux-ro.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [17:22:16] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe2003 - https://phabricator.wikimedia.org/T390578#10700322 (10MatthewVernon) @Jhancock.wm is should be networked like moss-fe2001 and moss-fe2002 (apus-* are the new names, moss-* will gradually get cycled out). [17:22:30] RESOLVED: Primary inbound port utilisation over 80% #page: Device cr1-eqiad.wikimedia.org recovered from Primary inbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+inbound+port+utilisation+over+80%25++%23page [17:22:51] !log importing varnish 7.1.1-1.1~bpo11+wmf1 into bullseye-wikimedia main (T378737) [17:22:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:54] T378737: Upgrade Varnish from 6.0 to 7.1 - https://phabricator.wikimedia.org/T378737 [17:23:13] !log repool cp7001, no certs removed (T384227) [17:23:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:15] T384227: Private TLS material (TLS keys) should be stored in volatile storage only - https://phabricator.wikimedia.org/T384227 [17:23:18] (03PS1) 10Ebernhardson: tlsproxy: Reload nginx when on-disk and served cert don't match [puppet] - 10https://gerrit.wikimedia.org/r/1133187 (https://phabricator.wikimedia.org/T390599) [17:23:19] !log importing varnish-modules 0.20.0-2~bpo11 into bullseye-wikimedia main (T378737) [17:23:21] RESOLVED: [4x] CoreRouterInterfaceDown: Core router interface down - cr3-eqsin:ae2 (External: SingTel) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [17:23:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:25] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [17:23:35] 06SRE, 06MediaWiki-Platform-Team: Identify and remediate large increase in sessionstore Cassandra disk usage - https://phabricator.wikimedia.org/T390514#10700334 (10Scott_French) I've looked at the rate of POST requests to sessionstore over the last 30d, aggregated across DCs so we can ignore the effect of the... [17:23:41] (03CR) 10CI reject: [V:04-1] tlsproxy: Reload nginx when on-disk and served cert don't match [puppet] - 10https://gerrit.wikimedia.org/r/1133187 (https://phabricator.wikimedia.org/T390599) (owner: 10Ebernhardson) [17:24:18] !log importing libvmod-netmapper 1.9.1-1 into bullseye-wikimedia main (T378737) [17:24:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:46] !log importing libvmod-querysort 0.4-3 into bullseye-wikimedia main (T378737) [17:24:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:56] (03Abandoned) 10Arlolra: Enable Parsoid read views for a few wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1123487 (https://phabricator.wikimedia.org/T356718) (owner: 10Arlolra) [17:25:18] !log importing libvmod-re2/varnish-re2 2.0.0-2~bpo11+wmf2 into bullseye-wikimedia main (T378737) [17:25:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:36] (03PS2) 10Umherirrender: Improve function and property documentation for php code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130201 (https://phabricator.wikimedia.org/T171115) [17:25:49] !log importing varnishkafka 1.2.0-1 into bullseye-wikimedia main (T378737) [17:25:50] (03PS2) 10Ebernhardson: tlsproxy: Reload nginx when on-disk and served cert don't match [puppet] - 10https://gerrit.wikimedia.org/r/1133187 (https://phabricator.wikimedia.org/T390599) [17:25:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:26:41] FIRING: [12x] ConfdResourceFailed: confd resource _var_lib_gdnsd_discovery-k8s-ingress-aux-ro.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [17:26:41] 10SRE-Access-Requests, 06Data-Engineering, 06Data-Platform-SRE: Requesting Kerberos access for ben.buchenau - https://phabricator.wikimedia.org/T390734#10700350 (10Ottomata) [17:26:55] 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 06Data-Platform-SRE: Requesting Kerberos access for ben.buchenau - https://phabricator.wikimedia.org/T390734#10700353 (10Ottomata) [17:27:57] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Link down between cr3-ulsfo and cr4-ulsfo - https://phabricator.wikimedia.org/T390731#10700358 (10cmooney) p:05High→03Low Looks like remote hands replaced the module. ` cmooney@cr4-ulsfo> show log messages | match qsfp Apr 1 17:1... [17:28:04] (03CR) 10CI reject: [V:04-1] tlsproxy: Reload nginx when on-disk and served cert don't match [puppet] - 10https://gerrit.wikimedia.org/r/1133187 (https://phabricator.wikimedia.org/T390599) (owner: 10Ebernhardson) [17:29:22] (03PS3) 10Ebernhardson: tlsproxy: Reload nginx when on-disk and served cert don't match [puppet] - 10https://gerrit.wikimedia.org/r/1133187 (https://phabricator.wikimedia.org/T390599) [17:29:37] (03PS1) 10Ssingh: hiera: acme_chief: add wikimedia-ech.org [puppet] - 10https://gerrit.wikimedia.org/r/1133190 (https://phabricator.wikimedia.org/T205378) [17:29:40] (03PS1) 10Bking: cirrussearch: don't include ipip LVS profile [puppet] - 10https://gerrit.wikimedia.org/r/1133191 (https://phabricator.wikimedia.org/T388610) [17:30:07] 06SRE, 06Data-Engineering, 06Data-Engineering-Radar, 06Data-Platform-SRE: Rebuild Spark images with Bookworm / bullseye-backports deprecation - https://phabricator.wikimedia.org/T390139#10700376 (10Ottomata) [17:30:24] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5187/co" [puppet] - 10https://gerrit.wikimedia.org/r/1133190 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [17:30:46] (03CR) 10Subramanya Sastry: [C:03+1] Enable Parsoid Read Views on 13 wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133113 (https://phabricator.wikimedia.org/T390680) (owner: 10Isabelle Hurbain-Palatin) [17:30:50] (03PS2) 10Ssingh: hiera: acme_chief: add wikimedia-ech.org [puppet] - 10https://gerrit.wikimedia.org/r/1133190 (https://phabricator.wikimedia.org/T205378) [17:30:59] (03CR) 10Subramanya Sastry: [C:03+1] Enable Parsoid Read Views to incubator and dagwiki mobile frontend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133141 (https://phabricator.wikimedia.org/T380768) (owner: 10Isabelle Hurbain-Palatin) [17:31:00] (03CR) 10Dzahn: [C:03+1] "Copied votes on follow-up patch sets have been updated:" [puppet] - 10https://gerrit.wikimedia.org/r/1133190 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [17:31:33] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5188/co" [puppet] - 10https://gerrit.wikimedia.org/r/1133190 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [17:31:52] (03PS1) 10Mforns: Bump up the Commons Impact Metrics service version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133192 (https://phabricator.wikimedia.org/T370470) [17:32:07] (03CR) 10Ssingh: hiera: acme_chief: add wikimedia-ech.org [puppet] - 10https://gerrit.wikimedia.org/r/1133190 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [17:33:33] (03CR) 10Ebernhardson: "I'm not 100% sure this is the right approach, but it seems plausible. Thoughts?" [puppet] - 10https://gerrit.wikimedia.org/r/1133187 (https://phabricator.wikimedia.org/T390599) (owner: 10Ebernhardson) [17:34:37] (03CR) 10Reedy: "Good to see gerrit can't see a simple rename..." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130201 (https://phabricator.wikimedia.org/T171115) (owner: 10Umherirrender) [17:34:57] (03PS4) 10Tiziano Fogli: auth_metrics: add recording rules for grafana widgets [puppet] - 10https://gerrit.wikimedia.org/r/1133170 (https://phabricator.wikimedia.org/T390672) [17:35:08] 06SRE, 06Infrastructure-Foundations, 10netops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10700403 (10xcollazo) Thanks for the pointers @cmooney. --------- Here are my heavy query results: First... [17:35:24] (03CR) 10Subramanya Sastry: "Should T356718 be tagged as well?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133113 (https://phabricator.wikimedia.org/T390680) (owner: 10Isabelle Hurbain-Palatin) [17:35:37] (03CR) 10Subramanya Sastry: [C:03+1] Enable Parsoid Read Views on 13 wiktionaries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133113 (https://phabricator.wikimedia.org/T390680) (owner: 10Isabelle Hurbain-Palatin) [17:35:40] (03CR) 10Dzahn: [C:03+1] "can't actually compile it yet because the compiler does not know this host yet.. but since only elastic2055 uses the role.. should be good" [puppet] - 10https://gerrit.wikimedia.org/r/1133191 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [17:35:51] (03CR) 10Ebernhardson: [C:03+1] "can confirm we haven't done the ipip migration yet" [puppet] - 10https://gerrit.wikimedia.org/r/1133191 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [17:36:41] FIRING: [14x] ConfdResourceFailed: confd resource _var_lib_gdnsd_discovery-k8s-ingress-aux-ro.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [17:37:00] ^ arr.. well.. we know what merge is related [17:38:08] (03CR) 10Dzahn: [C:03+1] "+jinxer-wm> FIRING: [14x] ConfdResourceFailed: confd resource _var_lib_gdnsd_discovery-k8s-ingress-aux-ro.state.toml has errors - https://" [puppet] - 10https://gerrit.wikimedia.org/r/1133176 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [17:38:30] (03PS1) 10BCornwall: varnish: Remove support for below version 7 [puppet] - 10https://gerrit.wikimedia.org/r/1132765 (https://phabricator.wikimedia.org/T378737) [17:38:37] (03CR) 10Ladsgroup: CommonSettings.php: Set virtual-bouncehandler domain mapping (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1126678 (owner: 10Reedy) [17:41:13] !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cirrussearch2055.codfw.wmnet with OS bullseye [17:41:41] FIRING: [20x] ConfdResourceFailed: confd resource _var_lib_gdnsd_discovery-k8s-ingress-aux-ro.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [17:41:41] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Test hot disk swap on Supermicro database hosts - https://phabricator.wikimedia.org/T388684#10700433 (10Papaul) @Marostegui fyi we need to put back the original disk. [17:42:17] (03PS1) 10Herron: conftool-data: update k8s-ingress-aux [puppet] - 10https://gerrit.wikimedia.org/r/1133195 [17:42:22] confd issue is being worked on [17:42:31] 06SRE, 06Infrastructure-Foundations, 10netops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10700438 (10cmooney) > No one is yelling on IRC so I think I am happy with this. I am done from my side. O... [17:42:36] (03CR) 10Dzahn: [C:03+1] conftool-data: update k8s-ingress-aux [puppet] - 10https://gerrit.wikimedia.org/r/1133195 (owner: 10Herron) [17:43:27] (03PS2) 10Herron: conftool-data: update k8s-ingress-aux [puppet] - 10https://gerrit.wikimedia.org/r/1133195 (https://phabricator.wikimedia.org/T381417) [17:43:50] (03CR) 10Dzahn: conftool-data: update k8s-ingress-aux [puppet] - 10https://gerrit.wikimedia.org/r/1133195 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [17:44:03] (03CR) 10Herron: [C:03+2] conftool-data: update k8s-ingress-aux [puppet] - 10https://gerrit.wikimedia.org/r/1133195 (https://phabricator.wikimedia.org/T381417) (owner: 10Herron) [17:44:26] 06SRE, 06MediaWiki-Platform-Team: Identify and remediate large increase in sessionstore Cassandra disk usage - https://phabricator.wikimedia.org/T390514#10700457 (10Eevans) >>! In T390514#10700334, @Scott_French wrote: > [ ... ] > 3. Has anything changed recently that might increase the TTL of sessions persist... [17:46:41] FIRING: [28x] ConfdResourceFailed: confd resource _var_lib_gdnsd_discovery-k8s-ingress-aux-ro.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [17:47:07] mutante: ok thanks! [17:47:18] ping us if you need an extra pair of eyes. [17:47:28] maybe we have to delete those .err files [17:47:33] yes [17:47:37] but that last merge above should be the actual fix [17:47:45] https://wikitech.wikimedia.org/wiki/Confd#Stale_template_error_files_present [17:47:46] hopefully [17:48:14] !log herron@puppetserver1001 conftool action : set/pooled=true; selector: dnsdisc=k8s-ingress-aux-ro,name=eqiad [17:48:28] !log herron@puppetserver1001 conftool action : set/pooled=true; selector: dnsdisc=k8s-ingress-aux-ro,name=codfw [17:48:31] sees puppetmaster2001 in docs and immediately thinks "or puppetserver" [17:48:32] !log herron@puppetserver1001 conftool action : set/pooled=true; selector: dnsdisc=k8s-ingress-aux-rw,name=codfw [17:48:36] !log herron@puppetserver1001 conftool action : set/pooled=true; selector: dnsdisc=k8s-ingress-aux-rw,name=eqiad [17:49:11] [puppetmaster2001:/var/run/confd-template is empty [17:49:38] also empty on puppetserver1001 [17:49:45] I think the alert has cleared up [17:50:32] (03CR) 10Isabelle Hurbain-Palatin: "I don't think so. We're not deploying a fix for that, this config change is a consequence of having fixed it, but not more so than other b" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133113 (https://phabricator.wikimedia.org/T390680) (owner: 10Isabelle Hurbain-Palatin) [17:51:24] sukhe: I hope it did, just want to see the resolve [17:51:41] RESOLVED: [28x] ConfdResourceFailed: confd resource _var_lib_gdnsd_discovery-k8s-ingress-aux-ro.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [17:52:41] yay:) [17:52:50] herron: looks good, thx [17:53:08] 06SRE, 06Infrastructure-Foundations, 10netops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10700483 (10cmooney) Also to get a sense of total throughput this graph is good: https://grafana.wikimedia... [17:53:50] 06SRE, 06MediaWiki-Platform-Team: Identify and remediate large increase in sessionstore Cassandra disk usage - https://phabricator.wikimedia.org/T390514#10700485 (10Scott_French) >>! In T390514#10700457, @Eevans wrote: > [ ... ] > > I can answer this one; The TTL is determined by the service (sessionstore) co... [17:54:11] (03CR) 10Santiago Faci: [C:03+2] "Looks good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133192 (https://phabricator.wikimedia.org/T370470) (owner: 10Mforns) [17:55:43] (03Merged) 10jenkins-bot: Bump up the Commons Impact Metrics service version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133192 (https://phabricator.wikimedia.org/T370470) (owner: 10Mforns) [17:56:56] FIRING: [30x] ConfdResourceFailed: confd resource _var_lib_gdnsd_discovery-k8s-ingress-aux-ro.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [17:58:03] 06SRE, 06Infrastructure-Foundations, 10netops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Add QoS markings to profile Hadoop/HDFS analytics traffic - https://phabricator.wikimedia.org/T381389#10700491 (10xcollazo) >>! In T381389#10700438, @cmooney wrote: >> No one is yelling on IRC so I think I am... [17:58:35] !log herron@puppetserver1001 conftool action : set/pooled=false; selector: dnsdisc=k8s-ingress-aux-rw,name=codfw [17:59:57] (03PS1) 10Ayounsi: Add magru and eqsin RIPE Atlas Anchors to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1133196 (https://phabricator.wikimedia.org/T385560) [18:00:04] dancy and andre: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-7+Utc-0 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250401T1800). [18:01:08] o/ [18:01:56] RESOLVED: [30x] ConfdResourceFailed: confd resource _var_lib_gdnsd_discovery-k8s-ingress-aux-ro.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [18:01:56] (03PS1) 10TrainBranchBot: group0 to 1.44.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133198 (https://phabricator.wikimedia.org/T386218) [18:01:57] (03CR) 10TrainBranchBot: [C:03+2] group0 to 1.44.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133198 (https://phabricator.wikimedia.org/T386218) (owner: 10TrainBranchBot) [18:02:49] (03Merged) 10jenkins-bot: group0 to 1.44.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133198 (https://phabricator.wikimedia.org/T386218) (owner: 10TrainBranchBot) [18:03:09] (03CR) 10Ayounsi: "Should show up on those dashboards once deployed:" [puppet] - 10https://gerrit.wikimedia.org/r/1133196 (https://phabricator.wikimedia.org/T385560) (owner: 10Ayounsi) [18:03:37] FIRING: [4x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:05:51] 06SRE, 06Data-Persistence, 13Patch-For-Review: Alert when disk space utilization on sessionstore nodes is too high - https://phabricator.wikimedia.org/T390630#10700537 (10Scott_French) p:05Triage→03High [18:07:02] 06SRE, 06Data-Persistence, 13Patch-For-Review: Alert when disk space utilization on sessionstore nodes is too high - https://phabricator.wikimedia.org/T390630#10700538 (10Scott_French) The alert patch is ready to go, and thanks to @urandom we now have a runbook. Since we do not yet understand the high rate o... [18:11:33] !log dancy@deploy1003 Testing. Disreagard [18:13:38] (03CR) 10Ssingh: Add magru and eqsin RIPE Atlas Anchors to monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1133196 (https://phabricator.wikimedia.org/T385560) (owner: 10Ayounsi) [18:14:48] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Test hot disk swap on Supermicro database hosts - https://phabricator.wikimedia.org/T388684#10700552 (10Marostegui) 05Resolved→03Open [18:15:05] (03PS2) 10Dzahn: Revert^2 "create k8s-ingress-aux -ro and -rw discovery records, metafo/geodns" [dns] - 10https://gerrit.wikimedia.org/r/1132699 [18:15:55] !log dancy@deploy1003 rebuilt and synchronized wikiversions files: group0 to 1.44.0-wmf.23 refs T386218 [18:15:57] T386218: 1.44.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T386218 [18:17:42] 10ops-codfw, 06DC-Ops: Power Supply - PS2 Status - issue on wikikube-worker2316:9290 - https://phabricator.wikimedia.org/T390769 (10phaultfinder) 03NEW [18:17:46] (03CR) 10Dzahn: [C:03+2] Revert^2 "create k8s-ingress-aux -ro and -rw discovery records, metafo/geodns" [dns] - 10https://gerrit.wikimedia.org/r/1132699 (owner: 10Dzahn) [18:17:58] !log dzahn@dns1004 START - running authdns-update [18:18:34] (03CR) 10Eevans: [C:03+1] sessionstore-resources: add SessionStoreDiskSpaceUtilizationTooHigh [alerts] - 10https://gerrit.wikimedia.org/r/1132775 (https://phabricator.wikimedia.org/T390630) (owner: 10Scott French) [18:19:36] !log mforns@deploy1003 helmfile [staging] START helmfile.d/services/commons-impact-analytics: apply [18:19:45] !log mforns@deploy1003 helmfile [staging] DONE helmfile.d/services/commons-impact-analytics: apply [18:20:20] !log dzahn@dns1004 END - running authdns-update [18:20:31] (03CR) 10Bking: [C:03+2] cirrussearch: don't include ipip LVS profile [puppet] - 10https://gerrit.wikimedia.org/r/1133191 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [18:20:51] sukhe: merged the revert-revert of that problematic DNS change from the other day. no errors this time. it was not true that it needed a netbox change. what it needed was servicecatalog/conftool data [18:21:14] (03PS2) 10Ayounsi: Add magru and eqsin RIPE Atlas Anchors to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1133196 (https://phabricator.wikimedia.org/T385560) [18:21:20] (03CR) 10Ayounsi: Add magru and eqsin RIPE Atlas Anchors to monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1133196 (https://phabricator.wikimedia.org/T385560) (owner: 10Ayounsi) [18:21:24] mutante: that's interesting, thanks for fixing it. [18:21:49] mutante: but the DNS record exists now, on checking it? [18:22:20] sukhe: herron did https://gerrit.wikimedia.org/r/c/operations/puppet/+/1133176 [18:23:12] (03CR) 10Ssingh: [C:03+1] Add magru and eqsin RIPE Atlas Anchors to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1133196 (https://phabricator.wikimedia.org/T385560) (owner: 10Ayounsi) [18:23:29] sukhe: k8s-ingress-aux-ro.discovery.wmnet has address 10.2.2.78 [18:23:36] it existed in netbox [18:23:38] ah! [18:23:41] but with -ro and -rw [18:23:45] :] [18:23:50] nicely done! [18:23:53] then we had a discussion if we need -ro/-rw or not [18:23:58] we have it now [18:24:05] sorry about the initial confusion from my end. I was looking up the literal record [18:24:09] that should mean the aux cluster can also have active/passive services [18:24:12] same here [18:24:18] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): cloudelastic1008 stuck at boot screen after multiple reboots, SEL reports Comm Error: Backplane 0 - https://phabricator.wikimedia.org/T388150#10700610 (10Jclark-ctr) 05In progress→03Resolved [18:24:26] sukhe: thanks for your help! [18:24:34] (03CR) 10Ayounsi: [C:03+2] Add magru and eqsin RIPE Atlas Anchors to monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1133196 (https://phabricator.wikimedia.org/T385560) (owner: 10Ayounsi) [18:25:04] !log bking@cumin2002 DONE (FAIL) - Cookbook sre.puppet.renew-cert (exit_code=99) for cirrussearch2055.eqiad.wmnet: Renew puppet certificate - bking@cumin2002 [18:25:16] mutante: <3 [18:25:29] !log bking@cumin2002 DONE (FAIL) - Cookbook sre.puppet.renew-cert (exit_code=99) for cirrussearch2055.eqiad.wmnet: Renew puppet certificate - bking@cumin2002 [18:25:39] 10ops-eqiad, 06SRE, 06DC-Ops: OutboundInterfaceErrors - https://phabricator.wikimedia.org/T389992#10700615 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr no errors at this time [18:28:48] (03PS3) 10Ssingh: P:durum: add conditional to enable ECH [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) [18:29:01] (03CR) 10Umherirrender: Improve function and property documentation for php code (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130201 (https://phabricator.wikimedia.org/T171115) (owner: 10Umherirrender) [18:29:36] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390682#10700657 (10phaultfinder) [18:30:04] (03CR) 10JHathaway: [C:03+1] sessionstore-resources: add SessionStoreDiskSpaceUtilizationTooHigh [alerts] - 10https://gerrit.wikimedia.org/r/1132775 (https://phabricator.wikimedia.org/T390630) (owner: 10Scott French) [18:30:07] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [18:30:53] (03PS4) 10Ssingh: P:durum: add conditional to enable ECH [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) [18:31:47] (03CR) 10Scott French: "Thank you all for the review!" [alerts] - 10https://gerrit.wikimedia.org/r/1132775 (https://phabricator.wikimedia.org/T390630) (owner: 10Scott French) [18:31:54] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1132669 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [18:32:12] (03CR) 10Scott French: [C:03+2] sessionstore-resources: add SessionStoreDiskSpaceUtilizationTooHigh [alerts] - 10https://gerrit.wikimedia.org/r/1132775 (https://phabricator.wikimedia.org/T390630) (owner: 10Scott French) [18:33:25] (03Merged) 10jenkins-bot: sessionstore-resources: add SessionStoreDiskSpaceUtilizationTooHigh [alerts] - 10https://gerrit.wikimedia.org/r/1132775 (https://phabricator.wikimedia.org/T390630) (owner: 10Scott French) [18:41:36] (03PS1) 10Bking: cirrussearch: use correct filename for opensearch config [puppet] - 10https://gerrit.wikimedia.org/r/1133211 (https://phabricator.wikimedia.org/T388610) [18:44:02] (03CR) 10Ebernhardson: [C:03+1] "matches the role name" [puppet] - 10https://gerrit.wikimedia.org/r/1133211 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [18:45:00] (03CR) 10Bking: [C:03+2] cirrussearch: use correct filename for opensearch config [puppet] - 10https://gerrit.wikimedia.org/r/1133211 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [18:48:52] (03PS2) 10Ebernhardson: envoy: Add service proxys for cirrussearch read traffic [puppet] - 10https://gerrit.wikimedia.org/r/838182 (https://phabricator.wikimedia.org/T143553) [18:48:52] (03CR) 10Ebernhardson: envoy: Add service proxys for cirrussearch read traffic (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/838182 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [18:49:36] (03CR) 10JHathaway: [C:03+1] Hiera: enable deep merge lookup option for abuse_networks [puppet] - 10https://gerrit.wikimedia.org/r/1128859 (https://phabricator.wikimedia.org/T389181) (owner: 10Hashar) [18:49:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [18:51:38] (03CR) 10Ssingh: [C:03+1] "Looks good, nice work!" [puppet] - 10https://gerrit.wikimedia.org/r/1132765 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [18:53:35] FIRING: [8x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:57:30] FIRING: Primary outbound port utilisation over 80% #page: Alert for device asw2-d-eqiad.mgmt.eqiad.wmnet - Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [18:57:45] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390535#10700755 (10Jclark-ctr) Phase, BA:L2-L3, Active Power was port alerting rebalanced pdu ports [19:02:31] RESOLVED: Primary outbound port utilisation over 80% #page: Device asw2-d-eqiad.mgmt.eqiad.wmnet recovered from Primary outbound port utilisation over 80% #page - https://alerts.wikimedia.org/?q=alertname%3DPrimary+outbound+port+utilisation+over+80%25++%23page [19:03:07] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390535#10700767 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [19:04:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [19:07:02] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [19:08:37] FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:09:08] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390750#10700775 (10Jclark-ctr) Cord, Link1_Cord_A, Active Power over power rebalanced pdu to AB cord [19:09:19] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390750#10700776 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [19:10:31] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390749#10700790 (10Jclark-ctr) #1: Phase, BA:L2-L3, Active Power; Value: 1403 (power) high: 1400 [19:11:21] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390709#10700794 (10Jclark-ctr) #1: Cord, Master_Cord_A, Active Power; Value: 3466 (power) high: 3440 #2: Cord, Link1_Cord_A, Active Power; Value: 3480 (power) high: 3440 [19:11:58] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390682#10700796 (10Jclark-ctr) #1: Cord, Master_Cord_A, Active Power; Value: 3477 (power) high: 3440 [19:14:39] 10ops-eqiad, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390775 (10phaultfinder) 03NEW [19:16:21] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a2-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390749#10700812 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr Rebalanced ports [19:16:57] 10ops-eqiad, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390775#10700817 (10Jclark-ctr) ps1-b4-eqiad.mgmt.eqiad.wmnet #1: Phase, BA:L2-L3, Active Power; Value: 1411 (power) high: 1400 [19:17:23] (03CR) 10Andrew Bogott: [C:03+1] P:ldaptui LDAP Terminal UI [puppet] - 10https://gerrit.wikimedia.org/r/1130071 (owner: 10Slyngshede) [19:18:52] 10ops-eqiad, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390775#10700820 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [19:19:25] (03PS1) 10Bking: cirrussearch: set correct cluster name for LVS [puppet] - 10https://gerrit.wikimedia.org/r/1133220 (https://phabricator.wikimedia.org/T388610) [19:23:13] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a6-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390709#10700827 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [19:23:36] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [19:25:05] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390682#10700847 (10Jclark-ctr) 05Open→03Resolved a:03Jclark-ctr [19:25:13] (03CR) 10Ebernhardson: [C:03+1] "seems reasonable for now. I suppose we will have to plan some way to migrate this name in the future? Can certainly wait until later." [puppet] - 10https://gerrit.wikimedia.org/r/1133220 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [19:27:02] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [19:27:34] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding thanos-fe2005-7, ms-fe2015-6, and apus-fe2003 to codfw - jhancock@cumin2002" [19:27:40] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding thanos-fe2005-7, ms-fe2015-6, and apus-fe2003 to codfw - jhancock@cumin2002" [19:27:41] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:28:10] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host thanos-fe2005 [19:28:19] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host thanos-fe2005 [19:28:22] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host thanos-fe2006 [19:28:30] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host thanos-fe2006 [19:28:32] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host thanos-fe2007 [19:28:40] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host thanos-fe2007 [19:28:43] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-fe2015 [19:28:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-fe2015 [19:28:54] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host ms-fe2016 [19:29:02] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-fe2016 [19:29:05] !log jhancock@cumin2002 START - Cookbook sre.network.configure-switch-interfaces for host apus-fe2003 [19:29:13] !log jhancock@cumin2002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host apus-fe2003 [19:30:19] (03CR) 10Bking: [C:03+2] cirrussearch: set correct cluster name for LVS [puppet] - 10https://gerrit.wikimedia.org/r/1133220 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [19:31:16] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host thanos-fe2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:31:44] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host thanos-fe2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:31:58] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-fe2015.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:32:17] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host ms-fe2016.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:32:36] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host apus-fe2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:33:11] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host thanos-fe2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:34:39] 10ops-eqiad, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778 (10phaultfinder) 03NEW [19:35:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-fe2006.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:35:17] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-fe2015.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:35:27] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host ms-fe2016.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:35:31] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-fe2007.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:35:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host apus-fe2003.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:36:23] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host thanos-fe2005.mgmt.codfw.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [19:37:11] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['thanos-fe2005'] [19:37:20] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['thanos-fe2006'] [19:37:21] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['thanos-fe2007'] [19:37:21] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['apus-fe2003'] [19:37:22] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-fe2016'] [19:37:23] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-fe2015'] [19:37:37] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['thanos-fe2007'] [19:37:47] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['thanos-fe2005'] [19:37:49] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['thanos-fe2006'] [19:37:56] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['thanos-fe2007'] [19:38:05] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['thanos-fe2007'] [19:38:07] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ms-fe2015'] [19:38:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ms-fe2016'] [19:38:13] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['apus-fe2003'] [19:38:57] 10ops-codfw, 06SRE, 06DBA, 06DC-Ops: Test hot disk swap on Supermicro database hosts - https://phabricator.wikimedia.org/T388684#10700887 (10Marostegui) @papaul feel free to mark the disk as bad (or simply pull it out) and get the old one back. We probably have to clear the config again as you did before [19:39:08] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-fe2005.codfw.wmnet with OS bullseye [19:39:13] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe200[5-7] - https://phabricator.wikimedia.org/T389634#10700889 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host thanos-fe2005.codfw.wmnet with OS bull... [19:39:33] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-fe2006.codfw.wmnet with OS bullseye [19:39:40] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe200[5-7] - https://phabricator.wikimedia.org/T389634#10700896 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host thanos-fe2006.codfw.wmnet with OS bull... [19:40:25] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:40:37] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ms-fe2015.codfw.wmnet with OS bullseye [19:40:44] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install ms-fe201[56] - https://phabricator.wikimedia.org/T388887#10700907 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ms-fe2015.codfw.wmnet with OS bullseye [19:40:53] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-fe2007.codfw.wmnet with OS bullseye [19:40:59] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe200[5-7] - https://phabricator.wikimedia.org/T389634#10700910 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host thanos-fe2007.codfw.wmnet with OS bull... [19:41:10] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ms-fe2016.codfw.wmnet with OS bullseye [19:41:16] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install ms-fe201[56] - https://phabricator.wikimedia.org/T388887#10700915 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ms-fe2016.codfw.wmnet with OS bullseye [19:41:33] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host apus-fe2003.codfw.wmnet with OS bookworm [19:41:40] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe2003 - https://phabricator.wikimedia.org/T390578#10700924 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host apus-fe2003.codfw.wmnet with OS bookworm [19:44:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [19:47:08] (03PS1) 10Bking: cirrussearch: Enable new role with existing alias [puppet] - 10https://gerrit.wikimedia.org/r/1133230 (https://phabricator.wikimedia.org/T388610) [19:48:51] (03PS1) 10Scott French: php8.1: Rebuild to update Debian packages [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1133229 [19:53:37] 06SRE, 07SRE-Unowned, 06serviceops-radar, 10wikitech.wikimedia.org, 13Patch-For-Review: Redesign wikitech-static - https://phabricator.wikimedia.org/T376400#10700999 (10Andrew) Thanks for the look! Since my goal here is 'adequate in an emergency' I'm going to set aside many of your concerns as aesthetic... [19:54:24] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-fe2005.codfw.wmnet with reason: host reimage [19:54:40] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-fe2006.codfw.wmnet with reason: host reimage [19:54:54] (03PS1) 10Michael Große: Don't add WikiLove icon to Minerva [extensions/WikiLove] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133232 (https://phabricator.wikimedia.org/T390642) [19:55:23] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 01 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-ite" [extensions/WikiLove] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133232 (https://phabricator.wikimedia.org/T390642) (owner: 10Michael Große) [19:55:41] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe2015.codfw.wmnet with reason: host reimage [19:56:07] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-fe2007.codfw.wmnet with reason: host reimage [19:56:11] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-fe2016.codfw.wmnet with reason: host reimage [19:58:05] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-fe2005.codfw.wmnet with reason: host reimage [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: Your horoscope predicts another UTC late backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250401T2000). [20:00:05] MichaelG_WMF and Superpes: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:11] Hi :) [20:00:14] Hey :) [20:01:07] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-fe2007.codfw.wmnet with reason: host reimage [20:02:51] i can deploy [20:03:40] Superpes: i'll get yours out while MichaelG_WMF's patches will be in the CI [20:03:55] (03PS1) 10Bking: cirrussearch: Add host-specific hieradata for Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1133234 (https://phabricator.wikimedia.org/T388610) [20:04:00] taavi: Thank you :) [20:04:00] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe2015.codfw.wmnet with reason: host reimage [20:04:16] Yep thanks :) [20:04:25] (03CR) 10Majavah: [C:03+2] homepage: Add `homepage_transfersize_bytes_total` metric [extensions/GrowthExperiments] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1133157 (https://phabricator.wikimedia.org/T382003) (owner: 10Michael Große) [20:04:26] (03CR) 10Majavah: [C:03+2] homepage: Add `homepage_transfersize_bytes_total` metric [extensions/GrowthExperiments] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133158 (https://phabricator.wikimedia.org/T382003) (owner: 10Michael Große) [20:04:27] (03CR) 10Majavah: [C:03+2] Don't add WikiLove icon to Minerva [extensions/WikiLove] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133232 (https://phabricator.wikimedia.org/T390642) (owner: 10Michael Große) [20:05:02] (03PS2) 10Bking: cirrussearch: Add host-specific hieradata for Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1133234 (https://phabricator.wikimedia.org/T388610) [20:05:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by taavi@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132196 (https://phabricator.wikimedia.org/T389829) (owner: 10Superpes15) [20:05:51] (03CR) 10TrainBranchBot: [C:03+2] "Approved by taavi@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133164 (https://phabricator.wikimedia.org/T390732) (owner: 10Superpes15) [20:06:28] (03CR) 10Ryan Kemper: [C:03+1] cirrussearch: Add host-specific hieradata for Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1133234 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [20:06:42] (03Merged) 10jenkins-bot: [plwiki] Allow bureaucrats to remove users from sysop usergroup [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132196 (https://phabricator.wikimedia.org/T389829) (owner: 10Superpes15) [20:06:45] (03Merged) 10jenkins-bot: Close pihwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133164 (https://phabricator.wikimedia.org/T390732) (owner: 10Superpes15) [20:07:12] !log taavi@deploy1003 Started scap sync-world: Backport for [[gerrit:1132196|[plwiki] Allow bureaucrats to remove users from sysop usergroup (T389829)]], [[gerrit:1133164|Close pihwiki (T390732)]] [20:07:13] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-fe2016.codfw.wmnet with reason: host reimage [20:07:17] T389829: Allow pl.wiki 'crats to remove sysop permissions - https://phabricator.wikimedia.org/T389829 [20:07:17] T390732: Close pihwiki - https://phabricator.wikimedia.org/T390732 [20:11:07] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-fe2006.codfw.wmnet with reason: host reimage [20:11:15] (03CR) 10Bking: [C:03+2] cirrussearch: Add host-specific hieradata for Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1133234 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [20:13:17] Superpes: please test [20:13:36] !log taavi@deploy1003 superpes, taavi: Backport for [[gerrit:1132196|[plwiki] Allow bureaucrats to remove users from sysop usergroup (T389829)]], [[gerrit:1133164|Close pihwiki (T390732)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:13:40] T389829: Allow pl.wiki 'crats to remove sysop permissions - https://phabricator.wikimedia.org/T389829 [20:13:40] T390732: Close pihwiki - https://phabricator.wikimedia.org/T390732 [20:13:58] Everything looks fine taavi :) [20:14:29] !log taavi@deploy1003 superpes, taavi: Continuing with sync [20:14:58] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:20:17] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:21:30] !log taavi@deploy1003 Finished scap sync-world: Backport for [[gerrit:1132196|[plwiki] Allow bureaucrats to remove users from sysop usergroup (T389829)]], [[gerrit:1133164|Close pihwiki (T390732)]] (duration: 14m 18s) [20:21:34] T389829: Allow pl.wiki 'crats to remove sysop permissions - https://phabricator.wikimedia.org/T389829 [20:21:34] T390732: Close pihwiki - https://phabricator.wikimedia.org/T390732 [20:21:57] Thanks for your assistance taavi :3 [20:22:37] taavi - the jobs look funny for my changes on zuul: https://integration.wikimedia.org/zuul/ - the mwext-php74-phan ones haven't started. Are they waiting on some other queue? [20:24:47] yeah the CI configuration is kind of borked atm unfortunately, we're looking at it in -releng [20:25:23] * MichaelG_WMF reads up over there [20:26:55] ok, now the jobs are running at least [20:27:03] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:27:39] 🤞 [20:27:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [20:28:20] (03Merged) 10jenkins-bot: homepage: Add `homepage_transfersize_bytes_total` metric [extensions/GrowthExperiments] (wmf/1.44.0-wmf.22) - 10https://gerrit.wikimedia.org/r/1133157 (https://phabricator.wikimedia.org/T382003) (owner: 10Michael Große) [20:28:21] (03Merged) 10jenkins-bot: homepage: Add `homepage_transfersize_bytes_total` metric [extensions/GrowthExperiments] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133158 (https://phabricator.wikimedia.org/T382003) (owner: 10Michael Große) [20:29:07] (03Merged) 10jenkins-bot: Don't add WikiLove icon to Minerva [extensions/WikiLove] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1133232 (https://phabricator.wikimedia.org/T390642) (owner: 10Michael Große) [20:29:46] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:29:47] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-fe2005.codfw.wmnet with OS bullseye [20:29:51] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe200[5-7] - https://phabricator.wikimedia.org/T389634#10701242 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host thanos-fe2005.codfw.wmnet with OS bullseye... [20:29:53] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:29:54] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-fe2007.codfw.wmnet with OS bullseye [20:29:56] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:29:57] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe2015.codfw.wmnet with OS bullseye [20:29:57] !log taavi@deploy1003 Started scap sync-world: Backport for [[gerrit:1133157|homepage: Add `homepage_transfersize_bytes_total` metric (T382003)]], [[gerrit:1133158|homepage: Add `homepage_transfersize_bytes_total` metric (T382003)]], [[gerrit:1133232|Don't add WikiLove icon to Minerva (T390642)]] [20:30:01] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe200[5-7] - https://phabricator.wikimedia.org/T389634#10701246 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host thanos-fe2007.codfw.wmnet with OS bullseye... [20:30:01] T382003: (mw.track) Migrate timing.growthExperiments.* to statslib - https://phabricator.wikimedia.org/T382003 [20:30:01] T390642: Non-functional WikiLove icon showing on user pages on mobile - https://phabricator.wikimedia.org/T390642 [20:30:04] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install ms-fe201[56] - https://phabricator.wikimedia.org/T388887#10701250 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ms-fe2015.codfw.wmnet with OS bullseye completed: - ms-fe2015 (**WARN... [20:30:06] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:30:24] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:30:25] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:30:25] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-fe2016.codfw.wmnet with OS bullseye [20:30:30] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install ms-fe201[56] - https://phabricator.wikimedia.org/T388887#10701257 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ms-fe2016.codfw.wmnet with OS bullseye completed: - ms-fe2016 (**WARN... [20:30:30] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [20:30:31] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-fe2006.codfw.wmnet with OS bullseye [20:30:37] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe200[5-7] - https://phabricator.wikimedia.org/T389634#10701258 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host thanos-fe2006.codfw.wmnet with OS bullseye... [20:32:02] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe200[5-7] - https://phabricator.wikimedia.org/T389634#10701265 (10Jhancock.wm) 05Open→03Resolved [20:32:31] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install thanos-fe200[5-7] - https://phabricator.wikimedia.org/T389634#10701269 (10Jhancock.wm) @MatthewVernon all yours! [20:32:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [20:33:01] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install ms-fe201[56] - https://phabricator.wikimedia.org/T388887#10701270 (10Jhancock.wm) 05Open→03Resolved [20:33:09] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install ms-fe201[56] - https://phabricator.wikimedia.org/T388887#10701273 (10Jhancock.wm) @MatthewVernon all yours! [20:34:27] (03PS1) 10Bking: cirrussearch: fix typo in hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1133242 (https://phabricator.wikimedia.org/T388610) [20:35:56] MichaelG_WMF: please test [20:36:05] looking [20:36:12] (03CR) 10Ebernhardson: [C:03+1] cirrussearch: fix typo in hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1133242 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [20:36:38] (03CR) 10Bking: [C:03+2] cirrussearch: fix typo in hieradata [puppet] - 10https://gerrit.wikimedia.org/r/1133242 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [20:36:50] the wikilove change is looking good, testing the other ones [20:37:19] !log taavi@deploy1003 migr, taavi: Backport for [[gerrit:1133157|homepage: Add `homepage_transfersize_bytes_total` metric (T382003)]], [[gerrit:1133158|homepage: Add `homepage_transfersize_bytes_total` metric (T382003)]], [[gerrit:1133232|Don't add WikiLove icon to Minerva (T390642)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:37:22] T382003: (mw.track) Migrate timing.growthExperiments.* to statslib - https://phabricator.wikimedia.org/T382003 [20:37:23] T390642: Non-functional WikiLove icon showing on user pages on mobile - https://phabricator.wikimedia.org/T390642 [20:39:40] taavi, and the other ones are looking fine, too! [20:39:58] !log taavi@deploy1003 migr, taavi: Continuing with sync [20:44:36] (03PS3) 10Ebernhardson: cirrus: Add services for read operations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838270 (https://phabricator.wikimedia.org/T143553) [20:44:36] (03PS4) 10Ebernhardson: Use discovery dns for elasticsearch read traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838271 (https://phabricator.wikimedia.org/T143553) [20:45:29] (03CR) 10CI reject: [V:04-1] cirrus: Add services for read operations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838270 (https://phabricator.wikimedia.org/T143553) (owner: 10Ebernhardson) [20:46:57] !log taavi@deploy1003 Finished scap sync-world: Backport for [[gerrit:1133157|homepage: Add `homepage_transfersize_bytes_total` metric (T382003)]], [[gerrit:1133158|homepage: Add `homepage_transfersize_bytes_total` metric (T382003)]], [[gerrit:1133232|Don't add WikiLove icon to Minerva (T390642)]] (duration: 16m 59s) [20:47:00] T382003: (mw.track) Migrate timing.growthExperiments.* to statslib - https://phabricator.wikimedia.org/T382003 [20:47:01] T390642: Non-functional WikiLove icon showing on user pages on mobile - https://phabricator.wikimedia.org/T390642 [20:47:04] MichaelG_WMF: finally done! [20:47:16] taavi: Thank you! :) [20:50:41] 06SRE, 06MediaWiki-Platform-Team: Identify and remediate large increase in sessionstore Cassandra disk usage - https://phabricator.wikimedia.org/T390514#10701358 (10Tgr) >>! In T390514#10700334, @Scott_French wrote: > 1. Under what conditions will Mediawiki create / persist a new user session? There are many:... [20:50:49] (03PS1) 10Bking: cirrussearch: match filename to lookup paths (again) [puppet] - 10https://gerrit.wikimedia.org/r/1133243 (https://phabricator.wikimedia.org/T388610) [20:51:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [20:54:48] (03PS4) 10Ebernhardson: cirrus: Add services for read operations [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838270 (https://phabricator.wikimedia.org/T143553) [20:54:48] (03PS5) 10Ebernhardson: Use discovery dns for elasticsearch read traffic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/838271 (https://phabricator.wikimedia.org/T143553) [20:57:24] 06SRE, 06MediaWiki-Platform-Team: Identify and remediate large increase in sessionstore Cassandra disk usage - https://phabricator.wikimedia.org/T390514#10701374 (10Tgr) > Uncommon authentication methods (OAuth, centralauthtoken, NetworkAuth) Well by request share this are actually pretty common. But the num... [21:00:05] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250401T2100) [21:01:27] (03CR) 10Ebernhardson: [C:03+1] cirrussearch: match filename to lookup paths (again) [puppet] - 10https://gerrit.wikimedia.org/r/1133243 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:04:37] (03CR) 10Ryan Kemper: [C:03+1] cirrussearch: match filename to lookup paths (again) [puppet] - 10https://gerrit.wikimedia.org/r/1133243 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:06:54] (03PS1) 10SBassett: OATHAuth: Mark checkuser and suppress as requiring 2FA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133245 (https://phabricator.wikimedia.org/T150898) [21:08:17] (03CR) 10SBassett: [C:04-2] "Hold for deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133245 (https://phabricator.wikimedia.org/T150898) (owner: 10SBassett) [21:09:56] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe2003 - https://phabricator.wikimedia.org/T390578#10701400 (10Jhancock.wm) [21:09:57] (03CR) 10Bking: [C:03+2] Set envoy keepalive's for search to match nginx [puppet] - 10https://gerrit.wikimedia.org/r/1132739 (https://phabricator.wikimedia.org/T390612) (owner: 10Ebernhardson) [21:10:49] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops: Q4:rack/setup/install apus-fe2003 - https://phabricator.wikimedia.org/T390578#10701402 (10Jhancock.wm) hit an error with the raid during the os install. No specific error was given. Will come back to this later. [21:12:15] (03CR) 10Bking: [C:03+2] cirrussearch: match filename to lookup paths (again) [puppet] - 10https://gerrit.wikimedia.org/r/1133243 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:14:34] 10ops-eqiad, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390787 (10phaultfinder) 03NEW [21:15:38] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10701412 (10phaultfinder) [21:21:08] (03PS1) 10Bking: cirrussearch: set correct hieradata for opensearch::instances [puppet] - 10https://gerrit.wikimedia.org/r/1133246 (https://phabricator.wikimedia.org/T388610) [21:23:04] (03CR) 10Ryan Kemper: [C:03+1] cirrussearch: set correct hieradata for opensearch::instances [puppet] - 10https://gerrit.wikimedia.org/r/1133246 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:23:09] (03CR) 10Bking: [C:03+2] cirrussearch: set correct hieradata for opensearch::instances [puppet] - 10https://gerrit.wikimedia.org/r/1133246 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:27:58] (03PS1) 10Bking: cirrussearch: fix hieradata for codfw [puppet] - 10https://gerrit.wikimedia.org/r/1133247 (https://phabricator.wikimedia.org/T388610) [21:28:25] FIRING: SystemdUnitFailed: mwscript-cleanup.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:30:11] (03CR) 10Ryan Kemper: [C:03+1] cirrussearch: fix hieradata for codfw [puppet] - 10https://gerrit.wikimedia.org/r/1133247 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:30:22] (03CR) 10Bking: [C:03+2] cirrussearch: fix hieradata for codfw [puppet] - 10https://gerrit.wikimedia.org/r/1133247 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:35:36] 06SRE, 06MediaWiki-Platform-Team: Identify and remediate large increase in sessionstore Cassandra disk usage - https://phabricator.wikimedia.org/T390514#10701511 (10Eevans) Here is noteworthy data point: This large spike in the graph corresponds with a compaction in Cassandra (a big one)... | {F58960589} | |... [21:36:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [21:38:25] FIRING: [2x] SystemdUnitFailed: mwscript-cleanup.service on deploy1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:39:22] ● mwscript-cleanup.service loaded failed failed Remove lingering Helm releases from completed maintenance scripts. [21:40:20] i'll make a lower prio ticket for that one [21:41:36] !log deploy1003 sudo -u mwdeploy /usr/local/bin/mwscript-cleanup --debug eqiad [21:41:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:43:37] 06SRE, 06MediaWiki-Platform-Team: Identify and remediate large increase in sessionstore Cassandra disk usage - https://phabricator.wikimedia.org/T390514#10701517 (10Tgr) Does that mean the problem is not creation of new sessions but too many writes to existing sessions? [21:44:31] 06SRE, 06MediaWiki-Platform-Team: Identify and remediate large increase in sessionstore Cassandra disk usage - https://phabricator.wikimedia.org/T390514#10701518 (10Eevans) >>! In T390514#10701374, @Tgr wrote: > [ ... ] > Can you look inside session data in Cassandra? (Seems like it still uses PHP serializatio... [21:45:32] (03PS1) 10Bking: Fix variable paths and reorganize [puppet] - 10https://gerrit.wikimedia.org/r/1133249 (https://phabricator.wikimedia.org/T388610) [21:46:44] RESOLVED: RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [21:47:53] 06SRE, 06MediaWiki-Platform-Team: Identify and remediate large increase in sessionstore Cassandra disk usage - https://phabricator.wikimedia.org/T390514#10701541 (10Eevans) >>! In T390514#10701517, @Tgr wrote: > Does that mean the problem is not creation of new sessions but too many writes to existing sessions... [21:48:16] (03PS2) 10Bking: cirrussearch: Fix variable paths and reorganize [puppet] - 10https://gerrit.wikimedia.org/r/1133249 (https://phabricator.wikimedia.org/T388610) [21:48:55] (03CR) 10Ryan Kemper: [C:03+1] cirrussearch: Fix variable paths and reorganize [puppet] - 10https://gerrit.wikimedia.org/r/1133249 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:49:14] (03CR) 10Bking: [C:03+2] cirrussearch: Fix variable paths and reorganize [puppet] - 10https://gerrit.wikimedia.org/r/1133249 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:50:42] 06SRE, 06serviceops: mwscript-cleanup.service failure - https://phabricator.wikimedia.org/T390790#10701547 (10Dzahn) [21:53:00] 06SRE, 06serviceops: mwscript-cleanup.service failure - https://phabricator.wikimedia.org/T390790#10701553 (10Dzahn) https://alerts.wikimedia.org/?q=%40state%3Dactive&q=%40cluster%3Dwikimedia.org&q=alertname%3DCheck%20unit%20status%20of%20mwscript-cleanup [21:53:33] (03PS1) 10Bking: cirrussearch: fix variable scope for row awareness [puppet] - 10https://gerrit.wikimedia.org/r/1133250 (https://phabricator.wikimedia.org/T388610) [21:54:03] (03CR) 10Ryan Kemper: [C:03+1] cirrussearch: fix variable scope for row awareness [puppet] - 10https://gerrit.wikimedia.org/r/1133250 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:54:18] (03CR) 10Bking: [C:03+2] cirrussearch: fix variable scope for row awareness [puppet] - 10https://gerrit.wikimedia.org/r/1133250 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:54:34] 06SRE, 06MediaWiki-Platform-Team: Identify and remediate large increase in sessionstore Cassandra disk usage - https://phabricator.wikimedia.org/T390514#10701555 (10Tgr) BagOStuff metrics for MWSession: [[https://grafana.wikimedia.org/d/4plhqSPGk/bagostuff-stats-by-key-group?orgId=1&var-kClass=MWSession&from=n... [21:56:32] 06SRE, 06serviceops: mwscript-cleanup.service failure - https://phabricator.wikimedia.org/T390790#10701559 (10Dzahn) p:05Triage→03Medium [21:57:03] 07Puppet, 06Infrastructure-Foundations: Improve the user experience adding new nodes to puppet - https://phabricator.wikimedia.org/T389932#10701560 (10jhathaway) >>! In T389932#10697436, @Joe wrote: >>>! In T389932#10694961, @jhathaway wrote: >> One issue with using just the FQDN is that is breaks tools which... [21:58:57] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 06Data-Platform-SRE: Requesting Kerberos access for ben.buchenau - https://phabricator.wikimedia.org/T390734#10701569 (10Dzahn) [21:58:58] (03PS1) 10Bking: cirrus: fix regex to match new host [puppet] - 10https://gerrit.wikimedia.org/r/1133251 (https://phabricator.wikimedia.org/T388610) [21:58:59] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Ben.buchenau - https://phabricator.wikimedia.org/T386904#10701570 (10Dzahn) [21:59:18] (03CR) 10Ryan Kemper: [C:03+1] cirrus: fix regex to match new host [puppet] - 10https://gerrit.wikimedia.org/r/1133251 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:59:31] (03CR) 10Bking: [C:03+2] cirrus: fix regex to match new host [puppet] - 10https://gerrit.wikimedia.org/r/1133251 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:59:47] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for @Ben.buchenau - https://phabricator.wikimedia.org/T386904#10701577 (10Dzahn) The user says now they are just missing a Kerberos identity and opened T390734. [22:00:43] 07Puppet, 06Infrastructure-Foundations: Improve the user experience adding new nodes to puppet - https://phabricator.wikimedia.org/T389932#10701585 (10jhathaway) In proposing possible solutions, I would love to understand a bit more why our `site.pp` uses complex regexes. From looking through the git log it ap... [22:01:59] 06SRE, 06MediaWiki-Platform-Team: Identify and remediate large increase in sessionstore Cassandra disk usage - https://phabricator.wikimedia.org/T390514#10701589 (10Eevans) >>! In T390514#10701555, @Tgr wrote: > ...the request rate is the same you can see in the sessionstore logs... Do you mean this one? Do... [22:04:05] !log bking@cumin2002 DONE (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for cirrussearch2055.codfw.wmnet: Renew puppet certificate - bking@cumin2002 [22:06:17] (03PS1) 10Dzahn: admin: add Kerberos principal to user benbuchenau [puppet] - 10https://gerrit.wikimedia.org/r/1133254 (https://phabricator.wikimedia.org/T386904) [22:08:02] (03CR) 10Máté Szabó: [C:03+1] EmailAuth: Enable "enforce" mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1132996 (https://phabricator.wikimedia.org/T390662) (owner: 10Kosta Harlan) [22:10:09] (03PS8) 10Krinkle: MediaWiki: Only proxy existing .php files, otherwise return nice 404 [puppet] - 10https://gerrit.wikimedia.org/r/1100534 (https://phabricator.wikimedia.org/T382357) (owner: 10Bartosz Dziewoński) [22:10:30] (03PS2) 10Dzahn: admin: add Kerberos principal to user benbuchenau [puppet] - 10https://gerrit.wikimedia.org/r/1133254 (https://phabricator.wikimedia.org/T386904) [22:13:07] 06SRE, 06MediaWiki-Platform-Team: Identify and remediate large increase in sessionstore Cassandra disk usage - https://phabricator.wikimedia.org/T390514#10701610 (10Tgr) We are logging when session cookies get written (which is a subset of when sessionstore is written). Persisting when we have a session ID b... [22:15:14] (03PS9) 10Krinkle: MediaWiki: Only proxy existing .php files, otherwise return nice 404 [puppet] - 10https://gerrit.wikimedia.org/r/1100534 (https://phabricator.wikimedia.org/T382357) (owner: 10Bartosz Dziewoński) [22:15:31] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, and 2 others: Requesting Kerberos access for ben.buchenau - https://phabricator.wikimedia.org/T390734#10701615 (10Dzahn) ACK! per [[https://wikitech.wikimedia.org/wiki/SRE/Clinic_Duty/Access_requests#Analytics_Groups | .. one of th... [22:15:32] (03CR) 10Krinkle: [C:03+1] MediaWiki: Only proxy existing .php files, otherwise return nice 404 [puppet] - 10https://gerrit.wikimedia.org/r/1100534 (https://phabricator.wikimedia.org/T382357) (owner: 10Bartosz Dziewoński) [22:21:23] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, and 2 others: Requesting Kerberos access for ben.buchenau - https://phabricator.wikimedia.org/T390734#10701619 (10Dzahn) @Ben.buchenau I ran the `manage_principals.py create benbuchenau ..` command to create a Kerberos principal fo... [22:22:47] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, and 2 others: Requesting Kerberos access for ben.buchenau - https://phabricator.wikimedia.org/T390734#10701621 (10Dzahn) 05Open→03In progress a:03Ben.buchenau Please let us know if everything works for you now. [22:23:35] FIRING: [6x] SystemdUnitFailed: opensearch-disable-readahead.service on cirrussearch2055:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:24:54] (03CR) 10Dzahn: [C:03+2] admin: add Kerberos principal to user benbuchenau [puppet] - 10https://gerrit.wikimedia.org/r/1133254 (https://phabricator.wikimedia.org/T386904) (owner: 10Dzahn) [22:40:21] (03PS1) 10Ladsgroup: wikimedia.org: Add HIBP confirm TXT record [dns] - 10https://gerrit.wikimedia.org/r/1133256 [22:40:53] (03CR) 10CI reject: [V:04-1] wikimedia.org: Add HIBP confirm TXT record [dns] - 10https://gerrit.wikimedia.org/r/1133256 (owner: 10Ladsgroup) [22:48:19] (03PS2) 10Ladsgroup: wikimedia.org: Add HIBP confirm TXT record [dns] - 10https://gerrit.wikimedia.org/r/1133256 [22:49:18] (03PS3) 10Ladsgroup: wikimedia.org: Add HIBP confirm TXT record for idm [dns] - 10https://gerrit.wikimedia.org/r/1133256 [22:51:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [22:53:37] FIRING: [8x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:54:25] (03CR) 10Ssingh: [C:03+1] wikimedia.org: Add HIBP confirm TXT record for idm (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1133256 (owner: 10Ladsgroup) [22:55:01] 06SRE, 06MediaWiki-Platform-Team: Identify and remediate large increase in sessionstore Cassandra disk usage - https://phabricator.wikimedia.org/T390514#10701739 (10Tgr) Login rates: | web | API | {F58961870} | {F58961871} The API is even except for the two bots already accounted for. Web shows a jump in Novem... [22:55:15] (03CR) 10Ssingh: [C:03+1] wikimedia.org: Add HIBP confirm TXT record for idm (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1133256 (owner: 10Ladsgroup) [22:56:03] (03PS4) 10Ladsgroup: wikimedia.org: Add HIBP confirm TXT record for idm [dns] - 10https://gerrit.wikimedia.org/r/1133256 [22:56:32] (03CR) 10Ladsgroup: wikimedia.org: Add HIBP confirm TXT record for idm (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/1133256 (owner: 10Ladsgroup) [22:56:39] (03CR) 10Ssingh: [C:03+1] wikimedia.org: Add HIBP confirm TXT record for idm [dns] - 10https://gerrit.wikimedia.org/r/1133256 (owner: 10Ladsgroup) [22:59:45] (03CR) 10Ladsgroup: [C:03+2] wikimedia.org: Add HIBP confirm TXT record for idm [dns] - 10https://gerrit.wikimedia.org/r/1133256 (owner: 10Ladsgroup) [23:00:47] !log ladsgroup@dns1004 START - running authdns-update [23:03:05] !log ladsgroup@dns1004 END - running authdns-update [23:04:43] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10701778 (10phaultfinder) [23:12:02] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [23:16:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [23:16:47] (03PS1) 10Ladsgroup: wikimedia.org: Remove idm HIBP record, add idp [dns] - 10https://gerrit.wikimedia.org/r/1133259 [23:18:03] (03CR) 10Ladsgroup: [C:03+2] wikimedia.org: Remove idm HIBP record, add idp [dns] - 10https://gerrit.wikimedia.org/r/1133259 (owner: 10Ladsgroup) [23:18:19] !log ladsgroup@dns1004 START - running authdns-update [23:20:37] !log ladsgroup@dns1004 END - running authdns-update [23:23:52] (03PS1) 10Ladsgroup: wikimedia.org: Add gerrit HIBP confirm record, remove idp [dns] - 10https://gerrit.wikimedia.org/r/1133260 [23:25:11] (03CR) 10Ladsgroup: [C:03+2] wikimedia.org: Add gerrit HIBP confirm record, remove idp [dns] - 10https://gerrit.wikimedia.org/r/1133260 (owner: 10Ladsgroup) [23:25:17] !log ladsgroup@dns1004 START - running authdns-update [23:26:33] 06SRE, 06MediaWiki-Platform-Team: Identify and remediate large increase in sessionstore Cassandra disk usage - https://phabricator.wikimedia.org/T390514#10701851 (10Tgr) > Web shows a jump in November Should probably look into that in more detail, we might have broken something unintentionally. [23:27:36] !log ladsgroup@dns1004 END - running authdns-update [23:27:57] (03PS1) 10Reedy: wikiversions.json: Move pihwiki to .23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133263 [23:28:05] jouncebot: nowandnext [23:28:05] No deployments scheduled for the next 6 hour(s) and 31 minute(s) [23:28:05] In 6 hour(s) and 31 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250402T0600) [23:29:32] (03CR) 10Reedy: [C:03+2] wikiversions.json: Move pihwiki to .23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133263 (owner: 10Reedy) [23:30:10] 06SRE, 06MediaWiki-Platform-Team: Identify and remediate large increase in sessionstore Cassandra disk usage - https://phabricator.wikimedia.org/T390514#10701872 (10Tgr) >>! In T390514#10701518, @Eevans wrote: > Is that not reflected in the formatting of the key? I sampled some here: T390514#10700270 All sess... [23:30:32] (03Merged) 10jenkins-bot: wikiversions.json: Move pihwiki to .23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133263 (owner: 10Reedy) [23:31:12] (03PS1) 10Ladsgroup: wikimedia.org: Add HIBP confirm record for phabricator, remove gerrit [dns] - 10https://gerrit.wikimedia.org/r/1133264 [23:32:02] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [23:32:23] (03CR) 10Ladsgroup: [C:03+2] wikimedia.org: Add HIBP confirm record for phabricator, remove gerrit [dns] - 10https://gerrit.wikimedia.org/r/1133264 (owner: 10Ladsgroup) [23:32:29] !log ladsgroup@dns1004 START - running authdns-update [23:32:38] 06SRE, 06MediaWiki-Platform-Team: Identify and remediate large increase in sessionstore Cassandra disk usage - https://phabricator.wikimedia.org/T390514#10701877 (10Eevans) >>! In T390514#10701872, @Tgr wrote: >>>! In T390514#10701518, @Eevans wrote: >> Is that not reflected in the formatting of the key? I sam... [23:34:50] !log ladsgroup@dns1004 END - running authdns-update [23:37:11] (03PS1) 10Ladsgroup: wikimedia.org: Add lists. HIBP TXT record, remove phabricator [dns] - 10https://gerrit.wikimedia.org/r/1133265 [23:37:48] FIRING: PuppetFailure: Puppet has failed on cirrussearch2055:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [23:38:15] (03CR) 10Ladsgroup: [C:03+2] wikimedia.org: Add lists. HIBP TXT record, remove phabricator [dns] - 10https://gerrit.wikimedia.org/r/1133265 (owner: 10Ladsgroup) [23:38:22] !log ladsgroup@dns1004 START - running authdns-update [23:38:44] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1133266 [23:38:44] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1133266 (owner: 10TrainBranchBot) [23:39:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [23:40:25] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:40:41] !log ladsgroup@dns1004 END - running authdns-update [23:43:38] !log reedy@deploy1003 rebuilt and synchronized wikiversions files: pihwiki to .23 [23:44:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [23:50:33] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1133266 (owner: 10TrainBranchBot)