[00:07:04] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [00:09:34] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1134389 [00:09:35] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1134389 (owner: 10TrainBranchBot) [00:09:42] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10716174 (10phaultfinder) [00:09:43] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390787#10716175 (10phaultfinder) [00:09:44] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390922#10716176 (10phaultfinder) [00:24:39] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [00:27:04] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [00:27:45] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1134389 (owner: 10TrainBranchBot) [00:35:38] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2056-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [00:37:13] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [01:29:33] FIRING: KubernetesCalicoDown: wikikube-worker2142.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2142.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [01:32:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-jobrunner_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:32:02] (03PS1) 10Jdlrobson: Revert "Take 2: Large math formulae should be scrollable" [extensions/Math] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1134391 (https://phabricator.wikimedia.org/T201233) [02:35:39] RESOLVED: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2056-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [03:12:13] FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:19:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [03:29:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [03:35:42] FIRING: JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:42:13] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:12:04] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [04:24:39] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [04:32:04] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [04:32:08] (03PS5) 10Abijeet Patro: AX: Enable Quick Surveys extension on Tswana and Venetian wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130963 (https://phabricator.wikimedia.org/T390023) [04:32:22] (03PS4) 10Abijeet Patro: AX: Enable entry-points on Tswana and Venetian wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130942 (https://phabricator.wikimedia.org/T390023) [04:37:13] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [04:54:04] (03PS1) 10Superpes15: [tawiki] Enable translator usergroup and only allows translator to use ContentTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134394 (https://phabricator.wikimedia.org/T391171) [05:03:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:08:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [05:10:42] FIRING: [2x] JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:29:33] FIRING: KubernetesCalicoDown: wikikube-worker2142.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2142.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [05:32:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-jobrunner_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:37:48] (03PS1) 10KartikMistry: Update cxserver to 2025-04-07-053106-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134397 (https://phabricator.wikimedia.org/T390732) [05:39:28] (03CR) 10Ayounsi: [C:03+1] ripe atlas anchors: change hiera device name [puppet] - 10https://gerrit.wikimedia.org/r/1134235 (https://phabricator.wikimedia.org/T388419) (owner: 10Tiziano Fogli) [05:55:42] FIRING: [2x] JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:57:13] FIRING: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:57:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid releases routed via main at eqiad: 3.493% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [05:58:15] FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [05:58:42] (03CR) 10Nikerabbit: [C:03+1] AX: Enable Quick Surveys extension on Tswana and Venetian wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130963 (https://phabricator.wikimedia.org/T390023) (owner: 10Abijeet Patro) [06:00:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.988s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:01:06] (03CR) 10Nikerabbit: [C:03+1] AX: Enable entry-points on Tswana and Venetian wiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130942 (https://phabricator.wikimedia.org/T390023) (owner: 10Abijeet Patro) [06:02:13] RESOLVED: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:02:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid releases routed via main at eqiad: 1.705% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [06:03:15] RESOLVED: [10x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:05:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 2.988s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [06:09:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 07 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130963 (https://phabricator.wikimedia.org/T390023) (owner: 10Abijeet Patro) [06:21:41] (03PS1) 10Daniel Kinzler: EventIngress: use getDeletedPage instead of getPageStateBefore [extensions/Wikistories] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1134400 (https://phabricator.wikimedia.org/T388588) [06:22:43] ^-- I'm planning to backport that in a bit... any objections? [06:33:22] @Amir1 @urbanecm: is it ok with you if I deploy this now before the deployment window? [06:35:06] duesen: fine with me, hopefully it would complete before the window can start [06:35:29] urbanecm: it should. uh, what's the current deployment host? [06:36:18] duesen: deploy1003 [06:38:03] thanks [06:38:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by daniel@deploy1003 using scap backport" [extensions/Wikistories] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1134400 (https://phabricator.wikimedia.org/T388588) (owner: 10Daniel Kinzler) [06:39:42] ah crud, i keep forgetting to run backports in screen. Let's hope my internet doesn't die. [06:39:46] (03Merged) 10jenkins-bot: EventIngress: use getDeletedPage instead of getPageStateBefore [extensions/Wikistories] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1134400 (https://phabricator.wikimedia.org/T388588) (owner: 10Daniel Kinzler) [06:40:16] !log daniel@deploy1003 Started scap sync-world: Backport for [[gerrit:1134400|EventIngress: use getDeletedPage instead of getPageStateBefore (T388588 T391051)]] [06:40:22] T388588: Rename classes and methods on page related events to match the design document - https://phabricator.wikimedia.org/T388588 [06:40:22] T391051: Error: Call to undefined method MediaWiki\Page\Event\PageDeletedEvent::getPageStateBefore() - https://phabricator.wikimedia.org/T391051 [06:42:56] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 07 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-i" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130942 (https://phabricator.wikimedia.org/T390023) (owner: 10Abijeet Patro) [06:45:05] docker-pusher run is taking several minutes, is that expected? [06:45:41] (03Abandoned) 10Kevin Bazira: ml-services: update RRML image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1132572 (https://phabricator.wikimedia.org/T389768) (owner: 10Kevin Bazira) [06:47:17] duesen: `screen -R` in .bash_profile :) Learned the hard way! [06:47:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [06:47:48] probably best :D [06:48:15] kart_: do you know if it's ok for docker-pusher to take five minutes? [06:50:26] duesen: It is slow, you'll see logs written in your home directory as well. [06:50:55] yea, I did tail -f, but it seems suck. [06:50:59] *stuck [06:51:53] hmm. 9 minutes. That's long. [06:52:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [06:55:48] urbanecm: any idea what to do when docker-pusher is stuck? [07:00:04] Amir1, Urbanecm, and awight: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250407T0700). [07:00:05] abijeet and Superpes: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:33] Hi :) [07:02:57] (03PS18) 10Phedenskog: perf/navtiming: migrate alerts from grafana to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1133152 (https://phabricator.wikimedia.org/T325283) (owner: 10Tiziano Fogli) [07:03:17] Deployment is stuck it seems duesen? [07:03:33] kart_: yes. [07:03:45] ctrl-c and retry? [07:03:57] or would that mess thing up? [07:04:15] urbanecm: ^ [07:04:20] Superpes: hi! sorry, the deployment i started 30 minutes ago apparently got stuck... [07:04:52] Yep yep Absolutely no rush for me :) [07:08:14] duesen: Not sure, what we can do here :/ [07:12:12] which part is it stuck on? [07:12:18] FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:12:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2056-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [07:12:50] (03CR) 10Tiziano Fogli: [C:03+2] ripe atlas anchors: change hiera device name [puppet] - 10https://gerrit.wikimedia.org/r/1134235 (https://phabricator.wikimedia.org/T388419) (owner: 10Tiziano Fogli) [07:15:00] taavi: it is at, [mediawiki-publish-81] Running sudo /usr/local/bin/docker-pusher -q docker-registry.discovery.wmnet/restricted/mediawiki-multiversion:2025-04-07-064056-publish-81 [07:15:21] based on /home/daniel/scap-image-build-and-push-log [07:17:17] o/ [07:17:20] kart_, taavi : now my ssh session stalled. no ideas if there's anything still running [07:17:58] the other week we pushed a change to serialize docker image push to the registry, as attempt to fix an issue with deployments and the registry [07:18:06] https://phabricator.wikimedia.org/T390251 [07:18:16] not sure if related [07:18:52] elukey: do you think it would be safe to just try again? [07:19:03] were you deploying in a screen/tmux? [07:19:27] taavi: nope, i forgot and realized too late [07:19:34] from the logs it doesn't seem really stuck in pushing from a long time [07:20:12] but /home/daniel/scap-image-build-and-push-log [07:20:25] contains old timestamps though [07:20:35] maybe it wasn't suck, it just died. [07:20:47] last one is 06:42, is it the right one? [07:20:53] namely, from that point onward nothing? [07:20:55] yes [07:20:57] ouch ok [07:21:05] lemme check [07:21:18] maybe there is something specific to that image [07:22:18] Apr 07 06:45:43 deploy1003 dockerd[561401]: time="2025-04-07T06:45:43.188911651Z" level=error msg="Upload failed, retrying: received unexpected HTTP status: 500 Internal Server Error" [07:22:36] yep the registry didn't like the image, lemme find out why [07:22:51] also I see [07:22:52] Apr 07 06:45:45 deploy1003 dockerd[561401]: time="2025-04-07T06:45:45.040553553Z" level=error msg="Not continuing with push after error: context canceled" [07:23:11] that may indicate why scap is waiting [07:23:27] Is it still waiting? [07:25:31] so far it seems to me that scap is waiting for docker to complete the push, but docker gave it up a while ago [07:25:56] my ssh client gave up as well [07:28:18] level=error msg="response completed with error" err.code=unknown err.detail="timeout expired while waiting for segments of /docker/registry/v2/repositories/restricted/mediawiki-multiversion [07:28:46] this is on the registry side, so the new multiversion image failed to be pushed correctly to swift, afaics [07:28:57] it seems a different version of https://phabricator.wikimedia.org/T390251 [07:29:05] but this time, we fail earlier [07:31:08] *sigh* [07:31:58] I think, I just started working on a monday morning, so I tend to doubt my analysis :D [07:32:39] RESOLVED: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2056-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [07:32:49] but it seems that we are hitting some limits with the multiversion image somehow [07:33:37] duesen: qq - what is the status of scap? [07:35:27] ideally we should rollback [07:35:31] or let scap to rollback [07:37:17] elukey: i have no idea what the status is... [07:37:49] well, it's merged into the deployment branch, but not deployed [07:38:00] so we could either roll back, or try to deploy again. [07:39:11] sure we can retry deploying, I feel that it should get stuck again, but we should be able to check [07:39:37] lovely start of the week :D [07:41:20] elukey: yea... actually, I don't really have time now... And Superpes is waiting to deploy something. You think theirs will get stuck as well? [07:42:13] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:43:10] elukey: I'm revering the patch [07:44:22] (03PS1) 10Daniel Kinzler: Revert "EventIngress: use getDeletedPage instead of getPageStateBefore" [extensions/Wikistories] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1134627 [07:44:42] duesen: we can monitor, I am interested to see if it gets stuck as well [07:45:06] !log T391122: reconciled 14 wikidata items (lost EventBus/eventgate events) [07:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:45:09] T391122: Some wikidata edits not being reflected on WDQS - https://phabricator.wikimedia.org/T391122 [07:46:09] elukey: can you merge the revert? looks like I can't: https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikistories/+/1134627 [07:47:09] (03CR) 10Elukey: [C:03+2] Revert "EventIngress: use getDeletedPage instead of getPageStateBefore" [extensions/Wikistories] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1134627 (owner: 10Daniel Kinzler) [07:47:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [07:48:16] duesen: done! in the meantime I added some info in e52ef272dcf8633756b2a934fd [07:48:19] uff [07:48:21] https://phabricator.wikimedia.org/T390251#10716525 [07:50:47] elukey: thank you [07:52:44] (03CR) 10Filippo Giunchedi: [C:03+2] prometheus: cleanup k8s instances from prometheus200[56] [puppet] - 10https://gerrit.wikimedia.org/r/1133909 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [07:53:22] (03CR) 10CI reject: [V:04-1] Revert "EventIngress: use getDeletedPage instead of getPageStateBefore" [extensions/Wikistories] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1134627 (owner: 10Daniel Kinzler) [07:58:03] (03CR) 10Vgutierrez: [C:03+1] external_cloud_vendors: Added Google SpeciaCaseCrawlers list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1134243 (https://phabricator.wikimedia.org/T391108) (owner: 10Fabfur) [07:59:38] (03CR) 10Vgutierrez: [C:03+1] hiera: enable TLS on volatile storage in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1133897 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [08:03:28] elukey: we're still blocked for deployment, right? [08:03:45] duesen: o/ failed to merge afaics https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikistories/+/1134627 [08:04:17] kart_: yes but when the above gets merged we should be able to retry, in theory [08:05:48] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, 06serviceops: Create a cookbook to automate gerrit's switchover - https://phabricator.wikimedia.org/T260666#10716569 (10ABran-WMF) [08:05:52] elukey: yes, CI fails, because of the error that this patch is fixing. Reverting the patch makes the code fail... [08:05:55] Now what? [08:06:00] (03CR) 10Filippo Giunchedi: [C:03+2] prometheus: cleanup k8s instances from prometheus100[56] [puppet] - 10https://gerrit.wikimedia.org/r/1133910 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [08:06:07] (03PS2) 10Filippo Giunchedi: prometheus: cleanup k8s instances from prometheus100[56] [puppet] - 10https://gerrit.wikimedia.org/r/1133910 (https://phabricator.wikimedia.org/T383232) [08:06:21] (03CR) 10Elukey: Revert "EventIngress: use getDeletedPage instead of getPageStateBefore" [extensions/Wikistories] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1134627 (owner: 10Daniel Kinzler) [08:06:49] elukey: can you reset the deployment branch to the parent commit? [08:06:54] !log disable puppet on A:cp-codfw to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/1133897 (T384227) [08:06:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:06:56] T384227: Private TLS material (TLS keys) should be stored in volatile storage only - https://phabricator.wikimedia.org/T384227 [08:07:03] (03CR) 10Filippo Giunchedi: [C:03+2] prometheus: cleanup k8s instances from prometheus100[56] [puppet] - 10https://gerrit.wikimedia.org/r/1133910 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [08:07:19] duesen: I am not comfortable making these kind of changes, can you retry deploying? [08:07:22] (03CR) 10Fabfur: [C:03+2] hiera: enable TLS on volatile storage in codfw [puppet] - 10https://gerrit.wikimedia.org/r/1133897 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [08:07:30] so we check if it was a temporary issue or not [08:07:59] (after godog) [08:08:16] fabfur: I'm done [08:08:32] tnx! [08:09:16] elukey: ok... but i have to leave in about 30 minutes. [08:09:28] and i need to make a phone call in the meantime [08:09:35] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2230.codfw.wmnet,db1176.eqiad.wmnet with reason: Maintenance [08:09:44] i'll start the deploy, in screen [08:09:45] duesen: we'll make it work :) [08:10:04] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db[2160,2234].codfw.wmnet with reason: Maintenance [08:10:35] !log deploying https://gerrit.wikimedia.org/r/c/operations/puppet/+/1133897 on A:cp-codfw (T384227) [08:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:55] fabfur: mind if i start a scap deploy? [08:11:37] (03PS3) 10Fabfur: external_cloud_vendors: Added Google SpecialCaseCrawlers list [puppet] - 10https://gerrit.wikimedia.org/r/1134243 (https://phabricator.wikimedia.org/T391108) [08:11:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [08:11:45] (03CR) 10Fabfur: external_cloud_vendors: Added Google SpecialCaseCrawlers list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1134243 (https://phabricator.wikimedia.org/T391108) (owner: 10Fabfur) [08:12:06] duesen there should be not problem at all [08:12:13] ok cool [08:12:33] !log daniel@deploy1003 Started scap sync-world: Backport for [[gerrit:1134400|EventIngress: use getDeletedPage instead of getPageStateBefore (T388588 T391051)]] [08:12:37] T388588: Rename classes and methods on page related events to match the design document - https://phabricator.wikimedia.org/T388588 [08:12:37] T391051: Error: Call to undefined method MediaWiki\Page\Event\PageDeletedEvent::getPageStateBefore() - https://phabricator.wikimedia.org/T391051 [08:12:40] elukey: i restarted the deployment [08:12:58] super [08:13:19] (03Abandoned) 10Daniel Kinzler: Revert "EventIngress: use getDeletedPage instead of getPageStateBefore" [extensions/Wikistories] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1134627 (owner: 10Daniel Kinzler) [08:16:21] elukey: ARRGG! So I did what kart_ suggested and put screen- R into my .profile. Tested it, worked nicely. And now... scap is running without screen agin. Because... I edited .profile locally, and the puppet deployment reverted my change! How do I change my .profile again properly, so it sticks? [08:16:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [08:17:04] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [08:19:47] duesen: it should be .bash_profile, not .profile :) [08:20:19] duesen: if you check under modules/admin/files/home/* there are examples of folks cutomizing files like .bash_profile etc,, [08:20:19] kart_: does it make a difference? afaik bash ready both. And it did seem to work, before the puppet deploy. [08:20:42] FIRING: [2x] JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:20:55] elukey: check it where? [08:21:06] sorry, puppet repo [08:21:15] ah right [08:24:26] elukey: the docker push went through, deployment is running [08:24:39] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [08:26:03] (03PS1) 10Fabfur: hiera: enable TLS on volatile storage in drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1134630 (https://phabricator.wikimedia.org/T384227) [08:26:18] duesen: ack nice [08:27:05] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1134630 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [08:28:17] 06SRE, 06SRE Observability, 07Kubernetes, 13Patch-For-Review: etcd: adapt etcd-backup.py for etcd 3.4 - https://phabricator.wikimedia.org/T385727#10716636 (10MatthewVernon) [I've put the Observability tag onto this task as otherwise it's showing up in the Clinic Duty triage list] [08:30:31] (03PS2) 10Fabfur: hiera: enable TLS on volatile storage in drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1134630 (https://phabricator.wikimedia.org/T384227) [08:31:06] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1134630 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [08:31:08] elukey: i... crud. so, the ssh session got stuck again. No Idea where scap is at now, because of the epic screen fail. I'm super sorry... And I really have to go... [08:31:22] (03PS4) 10Volans: spicerack: add Spicerack interactive shell [puppet] - 10https://gerrit.wikimedia.org/r/1133961 (https://phabricator.wikimedia.org/T389329) [08:31:22] (03PS1) 10Volans: cumin: Update insetup role report [puppet] - 10https://gerrit.wikimedia.org/r/1134632 (https://phabricator.wikimedia.org/T389825) [08:31:30] elukey: scap probably want me to say "yes, everything is fine" now. But I can't. [08:31:34] 06SRE, 06Data-Engineering, 06Data-Engineering-Radar, 06Infrastructure-Foundations, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Rebuild Spark images with Bookworm / bullseye-backports deprecation - https://phabricator.wikimedia.org/T390139#10716642 (10MatthewVernon) [08:31:56] duesen: is there anybody that can pick up your deployment? kart_ ? [08:32:38] AaronSchulz, but he's asleep. bpirkle too. [08:33:22] kart_ was just a randomly helping me out [08:33:26] duesen: done with deployment? [08:33:54] duesen: I completely get that you have to go, but leaving deployments stuck in this way it is not cool [08:33:54] nope [08:34:12] elukey: yes, i know. considering options. [08:34:39] oh actually - it's not stuck, scap just spat a LOT of red error messages at me. [08:35:14] it should be at the time of deploying to mw-debug [08:35:33] not anymore, no process hanging around [08:35:46] probably we hit the 600s timeout [08:36:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [08:37:04] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [08:37:13] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [08:37:19] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [08:38:23] elukey: it came back, ssh didn't die. Scap failed with spectacular error output. I'll put it in a pastebin in a second. [08:38:49] sigh [08:39:20] !log ayounsi@cumin1002 START - Cookbook sre.ganeti.makevm for new host atlas1001.wikimedia.org [08:39:22] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [08:41:17] (03PS1) 10Slyngshede: Release version 0.1.10 [software/bitu] - 10https://gerrit.wikimedia.org/r/1134633 [08:41:59] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [08:43:41] elukey: https://phabricator.wikimedia.org/P74613 [08:43:45] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM atlas1001.wikimedia.org - ayounsi@cumin1002" [08:43:51] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM atlas1001.wikimedia.org - ayounsi@cumin1002" [08:43:51] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:43:51] !log ayounsi@cumin1002 START - Cookbook sre.dns.wipe-cache atlas1001.wikimedia.org on all recursors [08:43:53] elukey: looks like helm failed. [08:43:54] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) atlas1001.wikimedia.org on all recursors [08:43:57] FIRING: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:44:12] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5215/console" [puppet] - 10https://gerrit.wikimedia.org/r/1133848 (owner: 10Elukey) [08:44:23] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM atlas1001.wikimedia.org - ayounsi@cumin1002" [08:44:27] !incidents [08:44:27] 6020 (UNACKED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [08:44:27] 6019 (RESOLVED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [08:44:29] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM atlas1001.wikimedia.org - ayounsi@cumin1002" [08:44:29] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host atlas1001.wikimedia.org [08:44:33] (03Restored) 10Daniel Kinzler: Revert "EventIngress: use getDeletedPage instead of getPageStateBefore" [extensions/Wikistories] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1134627 (owner: 10Daniel Kinzler) [08:45:15] FIRING: [2x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-web - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:45:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid releases routed via main at eqiad: 2.717% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:45:33] (03CR) 10Slyngshede: [C:03+2] Release version 0.1.10 [software/bitu] - 10https://gerrit.wikimedia.org/r/1134633 (owner: 10Slyngshede) [08:45:34] elukey: so, we can't revert the patch, because the previous version of the code is broken. And we also can't deploy the patch. So the deployment branch and deployment image are now both out of sync with production.... [08:45:35] !ack 6020 [08:45:35] 6020 (ACKED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [08:46:04] !incidents [08:46:04] 6020 (ACKED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [08:46:04] 6021 (UNACKED) db1228/MariaDB read only m5 (paged) [08:46:05] 6019 (RESOLVED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [08:46:06] kostajh: if you have ideas, let me know :P [08:46:09] checking [08:46:16] !incidents [08:46:16] 6020 (ACKED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [08:46:16] 6021 (UNACKED) db1228/MariaDB read only m5 (paged) [08:46:16] 6022 (UNACKED) db1228/mysqld processes (paged) [08:46:16] 6019 (RESOLVED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [08:46:29] !ack 6021 [08:46:29] 6021 (ACKED) db1228/MariaDB read only m5 (paged) [08:46:31] the host rebooted itself [08:46:33] !ack 6022 [08:46:33] 6022 (ACKED) db1228/mysqld processes (paged) [08:47:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 7.094s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:47:22] duesen: just to understand, scap deployed to mw-debug and then you were not able to deploy futher right? [08:47:27] now we also have an outage [08:48:10] (03Merged) 10jenkins-bot: Release version 0.1.10 [software/bitu] - 10https://gerrit.wikimedia.org/r/1134633 (owner: 10Slyngshede) [08:48:13] I think that we hit the 10 mins of wait time for mw-debug [08:48:18] since you didn't have the screen [08:48:31] NAME NAMESPACE CHART VERSION DURATION [08:48:31] next mw-debug wmf-stable/mediawiki 10m7s [08:48:47] <_joe_> yes it timed out pulling the image if i had to guess [08:48:57] RESOLVED: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:49:00] <_joe_> duesen: can you retry? [08:49:20] with screen :D [08:49:24] _joe_: that was the second try... third time is a charm? failure mode was different the second time around. [08:49:48] I don't see timeout errors in the mw-debug eqiad events on k8s [08:49:56] <_joe_> sigh ok [08:50:05] elukey: but the ssh session was actually fine, it was just silent. so we shouldn't have it a timeout? I'm a bit confused by the log. https://phabricator.wikimedia.org/P74613$89 [08:50:06] !incidents [08:50:06] If I have to bet I'd say it was something that didn't work between scap and your session duesen [08:50:06] 6020 (RESOLVED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [08:50:07] 6021 (RESOLVED) db1228/MariaDB read only m5 (paged) [08:50:07] 6022 (RESOLVED) db1228/mysqld processes (paged) [08:50:07] 6019 (RESOLVED) ProbeDown sre (10.2.2.92 ip4 mw-parsoid:4452 probes/service http_mw-parsoid_ip4 eqiad) [08:50:15] RESOLVED: [10x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [08:50:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid releases routed via main at eqiad: 5.625% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [08:50:49] duesen: after 600s/10m helm times out, I have no idea what happened in your session, but let's retry with screen/tmux [08:50:56] I'll watch events on k8s [08:51:01] _joe_: want me to retry now? the alternative would be to manually reset the deployment branch (and also kill the last image that was pushed i guess). [08:51:11] elukey: ok. [08:51:35] <_joe_> I'm not sure what happened, please work with elukey :) [08:51:51] so, 28mins ago ther was a failed to pull event [08:52:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 7.094s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:52:27] but then no more [08:52:47] elukey: ok, trying again [08:52:52] with screen [08:53:17] trying to match your scap logs with the error logs now [08:53:20] !log daniel@deploy1003 Started scap sync-world: Backport for [[gerrit:1134400|EventIngress: use getDeletedPage instead of getPageStateBefore (T388588 T391051)]] [08:53:24] T388588: Rename classes and methods on page related events to match the design document - https://phabricator.wikimedia.org/T388588 [08:53:24] T391051: Error: Call to undefined method MediaWiki\Page\Event\PageDeletedEvent::getPageStateBefore() - https://phabricator.wikimedia.org/T391051 [08:56:31] ok so there was an image pull error for mwdebug, and matches the new image that duesen was trying to push [08:56:57] it is the issue mentioned in T390251, so it must have caused helm to be stuck for 10mins and then fail [08:56:57] T390251: docker-registry.wikimedia.org keeps serving bad blobs - https://phabricator.wikimedia.org/T390251 [08:57:02] elukey, duesen I found a couple of things on logstash and grafana that ~8:44 UTC the DBs were overloaded, shall we cont this on -sre ? [08:57:31] effie: o/ there was a db host that got rebooted afaics [08:57:43] should be unrelated [08:58:48] !log daniel@deploy1003 daniel: Backport for [[gerrit:1134400|EventIngress: use getDeletedPage instead of getPageStateBefore (T388588 T391051)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:58:49] elukey: got further than it did before. We are at check-testservers now [08:58:51] T388588: Rename classes and methods on page related events to match the design document - https://phabricator.wikimedia.org/T388588 [08:58:52] T391051: Error: Call to undefined method MediaWiki\Page\Event\PageDeletedEvent::getPageStateBefore() - https://phabricator.wikimedia.org/T391051 [08:59:23] elukey: ok [08:59:42] doing a quick sanity check. I can't really test the functionality without messing with community content [09:00:42] FIRING: [2x] JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:00:46] elukey: let me know if I can help [09:00:52] !log push pfw policies - T390908 [09:00:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:44] effie: <3 context in https://phabricator.wikimedia.org/T390251, it has been ongoing for quite a bit sadly [09:01:59] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [09:02:21] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Requesting Kerberos access for ben.buchenau - https://phabricator.wikimedia.org/T390734#10716825 (10MatthewVernon) @Ben.buchenau can you confirm that you have the access you need now... [09:03:20] !log daniel@deploy1003 daniel: Continuing with sync [09:04:36] ok, sanity check went fine. Deployment in progress [09:04:53] niceee [09:05:35] (03CR) 10Filippo Giunchedi: ssl_ciphersuite: drop stretch support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1134179 (owner: 10Filippo Giunchedi) [09:06:08] (03CR) 10Vgutierrez: [C:03+1] hiera: enable TLS on volatile storage in drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1134630 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [09:06:22] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: HelmfileAdminNGPendingChanges (instance deploy1003:9100) - https://phabricator.wikimedia.org/T391241 (10LSobanski) 03NEW [09:06:33] (03PS1) 10Slyngshede: Switch IDM to Bitu version 0.1.10 [dns] - 10https://gerrit.wikimedia.org/r/1134637 [09:07:23] (03CR) 10Vgutierrez: ssl_ciphersuite: drop stretch support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1134179 (owner: 10Filippo Giunchedi) [09:08:53] (03PS1) 10Daniel Kinzler: ~daniel: Always run screen [puppet] - 10https://gerrit.wikimedia.org/r/1134638 [09:09:39] elukey: sync at 40% now. btw, is there a way for me to test this patch? https://gerrit.wikimedia.org/r/c/operations/puppet/+/1134638 [09:09:47] (03CR) 10Slyngshede: [C:03+2] Switch IDM to Bitu version 0.1.10 [dns] - 10https://gerrit.wikimedia.org/r/1134637 (owner: 10Slyngshede) [09:09:54] !log slyngshede@dns1004 START - running authdns-update [09:11:48] duesen: I'd say that you could apply it manually on the deploy node, log out and log-in again to see if it worked [09:12:19] !log slyngshede@dns1004 END - running authdns-update [09:13:03] !log daniel@deploy1003 Finished scap sync-world: Backport for [[gerrit:1134400|EventIngress: use getDeletedPage instead of getPageStateBefore (T388588 T391051)]] (duration: 19m 43s) [09:13:07] T388588: Rename classes and methods on page related events to match the design document - https://phabricator.wikimedia.org/T388588 [09:13:07] T391051: Error: Call to undefined method MediaWiki\Page\Event\PageDeletedEvent::getPageStateBefore() - https://phabricator.wikimedia.org/T391051 [09:13:11] elukey: ok, done [09:13:27] super, thanks for staying, really appreciated [09:14:15] elukey: many thanks for staying with me on this. botched deployments really freak me out. [09:14:56] it may al have been my fault for not running in screen... though the errors don't really look like it. Didn't make things better, in any case :) [09:15:42] FIRING: [3x] JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:16:32] What about morning backport window? :D [09:16:49] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Requesting Kerberos access for ben.buchenau - https://phabricator.wikimedia.org/T390734#10716858 (10Ben.buchenau) Yes, thank you for the effort, I have access now and it works just f... [09:18:23] !log disable puppet on A:cp-drmrs to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/1134630 (T384227) [09:18:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:18:25] T384227: Private TLS material (TLS keys) should be stored in volatile storage only - https://phabricator.wikimedia.org/T384227 [09:18:37] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 08 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130942 (https://phabricator.wikimedia.org/T390023) (owner: 10Abijeet Patro) [09:18:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 08 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130963 (https://phabricator.wikimedia.org/T390023) (owner: 10Abijeet Patro) [09:19:21] (03CR) 10Fabfur: [C:03+2] hiera: enable TLS on volatile storage in drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1134630 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [09:20:28] (03PS1) 10Tiziano Fogli: ripe atlas anchors: recover original severity [puppet] - 10https://gerrit.wikimedia.org/r/1134639 [09:20:52] (03CR) 10CI reject: [V:04-1] ripe atlas anchors: recover original severity [puppet] - 10https://gerrit.wikimedia.org/r/1134639 (owner: 10Tiziano Fogli) [09:21:15] jouncebot: now and next [09:21:15] No deployments scheduled for the next 0 hour(s) and 38 minute(s) [09:21:28] (03CR) 10Slyngshede: [C:03+1] idp: spiderpig: Add spiderpig-access to required_groups [puppet] - 10https://gerrit.wikimedia.org/r/1134292 (https://phabricator.wikimedia.org/T383947) (owner: 10Ahmon Dancy) [09:22:20] (03PS2) 10Tiziano Fogli: ripe atlas anchors: restore original severity Also: * Adjusts alert summary, description, and runbook * Removes the corresponding Icinga alert [puppet] - 10https://gerrit.wikimedia.org/r/1134639 [09:22:34] !log deploying https://gerrit.wikimedia.org/r/c/operations/puppet/+/1134630 on A:cp-drmrs (T384227) [09:22:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:44] (03CR) 10CI reject: [V:04-1] ripe atlas anchors: restore original severity Also: * Adjusts alert summary, description, and runbook * Removes the corresponding Icinga alert [puppet] - 10https://gerrit.wikimedia.org/r/1134639 (owner: 10Tiziano Fogli) [09:24:12] (03PS3) 10Tiziano Fogli: ripe atlas anchors: restore original severity Also: * Adjusts alert summary, description, and runbook * Removes the corresponding Icinga alert [puppet] - 10https://gerrit.wikimedia.org/r/1134639 [09:24:36] (03CR) 10CI reject: [V:04-1] ripe atlas anchors: restore original severity Also: * Adjusts alert summary, description, and runbook * Removes the corresponding Icinga alert [puppet] - 10https://gerrit.wikimedia.org/r/1134639 (owner: 10Tiziano Fogli) [09:25:37] (03PS4) 10Tiziano Fogli: ripe atlas anchors: restore original severity [puppet] - 10https://gerrit.wikimedia.org/r/1134639 [09:26:54] Superpes: If you want to risk it, I can deploy for you ;) I stole your window afterall... [09:28:14] duesen Lol If you have time and want, I'd be happy, otherwise I'll reschedule it :) [09:29:00] hey folks sorry [09:29:25] If there are not deployments scheduled or other windows we can keep deploying [09:29:26] (03PS5) 10Tiziano Fogli: ripe atlas anchors: restore original severity [puppet] - 10https://gerrit.wikimedia.org/r/1134639 [09:29:33] FIRING: KubernetesCalicoDown: wikikube-worker2142.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2142.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:29:34] I'll stay to watch in case of errors [09:31:07] elukey: thank you! [09:31:23] Grazie :) [09:31:44] Superpes: ah, config changes. let me have a quick look [09:31:56] I don't know anything about how the logo stuff works [09:32:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-jobrunner_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:32:54] Oh It's all automated! we just need to edit config.yalm and run tox :) [09:33:21] *config.yaml [09:33:50] yea, but I don't know what that config is supposed to look like. Anyway... [09:34:33] (03PS3) 10Filippo Giunchedi: ssl_ciphersuite: drop stretch support [puppet] - 10https://gerrit.wikimedia.org/r/1134179 [09:34:37] (03CR) 10Filippo Giunchedi: ssl_ciphersuite: drop stretch support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1134179 (owner: 10Filippo Giunchedi) [09:35:06] Superpes: which one shall we do first? [09:35:11] Yep yep don't worry, I'll test the patch, there shouldn't be any problems since I often deal with logos :) [09:35:13] Do you have a plan for testing these? [09:35:34] You can also do them together if you want! They are both "simple" [09:35:55] (03CR) 10Filippo Giunchedi: [C:03+1] ripe atlas anchors: restore original severity [puppet] - 10https://gerrit.wikimedia.org/r/1134639 (owner: 10Tiziano Fogli) [09:36:03] can scrap backport do multiple patches at once? I have never tried that [09:36:11] yes [09:36:18] nice [09:36:20] even config changes and backports together, I believe [09:36:31] * duesen is impressed [09:36:39] we often use it to speed up the backport windows :) [09:37:04] ok, let's do it [09:37:36] Superpes: double-check: you want these: [09:37:38] [pswiki] Change the logo and wordmark/tagline [09:37:45] [tawiki] Enable translator usergroup and only allows translator to use ContentTranslation [09:37:47] yes? [09:37:52] Yep :) [09:37:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by daniel@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031963 (https://phabricator.wikimedia.org/T360851) (owner: 10Superpes15) [09:37:59] (03CR) 10TrainBranchBot: [C:03+2] "Approved by daniel@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134394 (https://phabricator.wikimedia.org/T391171) (owner: 10Superpes15) [09:38:01] ok [09:38:52] (03Merged) 10jenkins-bot: [pswiki] Change the logo and wordmark/tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031963 (https://phabricator.wikimedia.org/T360851) (owner: 10Superpes15) [09:38:56] (03Merged) 10jenkins-bot: [tawiki] Enable translator usergroup and only allows translator to use ContentTranslation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134394 (https://phabricator.wikimedia.org/T391171) (owner: 10Superpes15) [09:39:08] !log daniel@deploy1003 Started scap sync-world: Backport for [[gerrit:1031963|[pswiki] Change the logo and wordmark/tagline (T360851)]], [[gerrit:1134394|[tawiki] Enable translator usergroup and only allows translator to use ContentTranslation (T391171)]] [09:39:12] T360851: Change the current Pashto Wikipedia wordmark and tagline on new vector/mobile skin - https://phabricator.wikimedia.org/T360851 [09:39:13] T391171: Creation of a translator user group on tawikipedia - https://phabricator.wikimedia.org/T391171 [09:40:07] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 06Data-Engineering-Radar, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Requesting Kerberos access for ben.buchenau - https://phabricator.wikimedia.org/T390734#10716957 (10MatthewVernon) 05In progress→03Resolved Great, thanks, I'll close this tic... [09:40:24] (03CR) 10Vgutierrez: [C:03+1] profile::service_proxy::envoy: add data-gateway-staging [puppet] - 10https://gerrit.wikimedia.org/r/1133848 (owner: 10Elukey) [09:43:48] (03PS2) 10Daniel Kinzler: ~daniel: Always run screen [puppet] - 10https://gerrit.wikimedia.org/r/1134638 [09:44:09] (03CR) 10Elukey: [V:03+1 C:03+2] profile::service_proxy::envoy: add data-gateway-staging (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1133848 (owner: 10Elukey) [09:44:24] !log daniel@deploy1003 superpes, daniel: Backport for [[gerrit:1031963|[pswiki] Change the logo and wordmark/tagline (T360851)]], [[gerrit:1134394|[tawiki] Enable translator usergroup and only allows translator to use ContentTranslation (T391171)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:44:28] T360851: Change the current Pashto Wikipedia wordmark and tagline on new vector/mobile skin - https://phabricator.wikimedia.org/T360851 [09:44:28] T391171: Creation of a translator user group on tawikipedia - https://phabricator.wikimedia.org/T391171 [09:44:37] Just a minute to test everything properly [09:45:42] FIRING: [3x] JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [09:46:03] Superpes: can you test your changes now? [09:46:11] Ah sorry, you are on it [09:47:17] (03CR) 10Btullis: [C:03+2] Reduce the verbosity of pgbouncer logs in airflow deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134283 (https://phabricator.wikimedia.org/T362788) (owner: 10Btullis) [09:48:47] duesen Everything looks good to me :) [09:48:47] (03PS1) 10Elukey: services: use the data-gw staging endpoint in commons-impact-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134647 [09:48:59] (03CR) 10Ayounsi: ripe atlas anchors: restore original severity (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1134639 (owner: 10Tiziano Fogli) [09:49:07] !log daniel@deploy1003 superpes, daniel: Continuing with sync [09:49:24] let's go, then! [09:49:35] (03Merged) 10jenkins-bot: Reduce the verbosity of pgbouncer logs in airflow deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134283 (https://phabricator.wikimedia.org/T362788) (owner: 10Btullis) [09:52:16] (03Abandoned) 10Daniel Kinzler: Revert "EventIngress: use getDeletedPage instead of getPageStateBefore" [extensions/Wikistories] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1134627 (owner: 10Daniel Kinzler) [09:54:47] ...deployment at 50%... [09:57:40] !log daniel@deploy1003 Finished scap sync-world: Backport for [[gerrit:1031963|[pswiki] Change the logo and wordmark/tagline (T360851)]], [[gerrit:1134394|[tawiki] Enable translator usergroup and only allows translator to use ContentTranslation (T391171)]] (duration: 18m 31s) [09:57:43] T360851: Change the current Pashto Wikipedia wordmark and tagline on new vector/mobile skin - https://phabricator.wikimedia.org/T360851 [09:57:44] T391171: Creation of a translator user group on tawikipedia - https://phabricator.wikimedia.org/T391171 [09:58:16] (03PS1) 10Fabfur: hiera: enable TLS on volatile storage in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1134648 (https://phabricator.wikimedia.org/T384227) [09:58:22] Superpes: all done! Please double-check that everything is fine. [09:58:22] (03PS2) 10Jelto: Ceph: add types for S3 credential and account [puppet] - 10https://gerrit.wikimedia.org/r/1133916 (https://phabricator.wikimedia.org/T378922) [09:59:05] Yep confirmed!!! Many thanks for your assistance and time duesen :D [09:59:33] No problem, and sorry for the delay ;) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250407T1000) [10:01:56] nice :) [10:02:27] (03CR) 10Jelto: "@mvernon I added Bens suggestion in patchset 2. Does the length constraint of 20 and 40 characters apply for the GitLab credentials as wel" [puppet] - 10https://gerrit.wikimedia.org/r/1133916 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [10:03:08] (03CR) 10Elukey: [C:03+1] cumin: Update insetup role report [puppet] - 10https://gerrit.wikimedia.org/r/1134632 (https://phabricator.wikimedia.org/T389825) (owner: 10Volans) [10:03:31] (03PS2) 10Cathal Mooney: Cloudsw: adjust routing-policies to reflect change to IBGP [homer/public] - 10https://gerrit.wikimedia.org/r/1134234 (https://phabricator.wikimedia.org/T389958) [10:06:25] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5216/co" [puppet] - 10https://gerrit.wikimedia.org/r/1133961 (https://phabricator.wikimedia.org/T389329) (owner: 10Volans) [10:06:30] (03CR) 10Vgutierrez: [C:03+1] ssl_ciphersuite: drop stretch support [puppet] - 10https://gerrit.wikimedia.org/r/1134179 (owner: 10Filippo Giunchedi) [10:07:15] (03CR) 10Elukey: [V:03+1 C:03+1] "\o/" [puppet] - 10https://gerrit.wikimedia.org/r/1133961 (https://phabricator.wikimedia.org/T389329) (owner: 10Volans) [10:08:31] (03CR) 10Elukey: "recheck" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1131666 (owner: 10Elukey) [10:09:00] (03PS6) 10Tiziano Fogli: ripe atlas anchors: restore original severity [puppet] - 10https://gerrit.wikimedia.org/r/1134639 [10:09:19] (03CR) 10Tiziano Fogli: ripe atlas anchors: restore original severity (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1134639 (owner: 10Tiziano Fogli) [10:12:23] (03PS6) 10Elukey: Release version 4.0.4 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1131666 [10:12:38] (03CR) 10Ayounsi: [C:03+1] ripe atlas anchors: restore original severity [puppet] - 10https://gerrit.wikimedia.org/r/1134639 (owner: 10Tiziano Fogli) [10:12:57] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [10:13:01] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [10:14:37] (03CR) 10Jelto: [C:03+1] "lgtm now, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1133992 (https://phabricator.wikimedia.org/T390948) (owner: 10Dzahn) [10:14:38] (03CR) 10Tiziano Fogli: [C:03+2] ripe atlas anchors: restore original severity [puppet] - 10https://gerrit.wikimedia.org/r/1134639 (owner: 10Tiziano Fogli) [10:15:08] (03CR) 10CI reject: [V:04-1] Release version 4.0.4 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1131666 (owner: 10Elukey) [10:22:38] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 2 - rack F6) - https://phabricator.wikimedia.org/T390169#10717048 (10BTullis) a:05BTullis→03Jclark-ctr I have carried out the same preparation process on the three hosts in... [10:22:48] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 2 - rack F6) - https://phabricator.wikimedia.org/T390169#10717051 (10BTullis) [10:23:39] jouncebot: nowandnext [10:23:39] For the next 0 hour(s) and 36 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250407T1000) [10:23:39] In 2 hour(s) and 36 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250407T1300) [10:29:36] (03CR) 10Ladsgroup: [C:03+2] Revert "Take 2: Large math formulae should be scrollable" [extensions/Math] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1134391 (https://phabricator.wikimedia.org/T201233) (owner: 10Jdlrobson) [10:30:30] (03PS1) 10Ladsgroup: Bump thumbnail steps to 70% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134651 (https://phabricator.wikimedia.org/T360589) [10:31:13] (03CR) 10Ladsgroup: [C:03+2] Bump thumbnail steps to 70% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134651 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [10:31:52] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134651 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [10:32:01] (03Merged) 10jenkins-bot: Bump thumbnail steps to 70% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134651 (https://phabricator.wikimedia.org/T360589) (owner: 10Ladsgroup) [10:32:14] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1134651|Bump thumbnail steps to 70% (T360589)]] [10:32:17] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [10:32:28] (03PS7) 10Elukey: Release version 4.0.4 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1131666 [10:37:14] !log ladsgroup@deploy1003 ladsgroup: Backport for [[gerrit:1134651|Bump thumbnail steps to 70% (T360589)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:39:39] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-product: apply [10:39:48] !log ladsgroup@deploy1003 ladsgroup: Continuing with sync [10:40:10] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-product: apply [10:40:43] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-main: apply [10:41:13] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-main: apply [10:42:24] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-search: apply [10:42:32] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-search: apply [10:42:54] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-wmde: apply [10:43:01] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-wmde: apply [10:43:12] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-analytics-test: apply [10:43:23] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-analytics-test: apply [10:43:32] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-ml: apply [10:43:39] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-ml: apply [10:43:48] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-research: apply [10:43:59] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-research: apply [10:45:24] (03Merged) 10jenkins-bot: Revert "Take 2: Large math formulae should be scrollable" [extensions/Math] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1134391 (https://phabricator.wikimedia.org/T201233) (owner: 10Jdlrobson) [10:45:46] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1134648 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [10:46:37] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1134651|Bump thumbnail steps to 70% (T360589)]] (duration: 14m 22s) [10:46:39] T360589: De-fragment thumbnail sizes in mediawiki - https://phabricator.wikimedia.org/T360589 [10:47:35] !log ladsgroup@deploy1003 Started scap sync-world: Backport for [[gerrit:1134391|Revert "Take 2: Large math formulae should be scrollable" (T201233)]] [10:47:37] T201233: Long math output unreadable on small screens due to scrolling off the side of screen - https://phabricator.wikimedia.org/T201233 [10:50:12] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-platform-eng: apply [10:50:19] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-platform-eng: apply [10:52:27] (03Abandoned) 10Btullis: global_config: add dpe cephosd to external services [puppet] - 10https://gerrit.wikimedia.org/r/1099694 (https://phabricator.wikimedia.org/T381264) (owner: 10Stevemunene) [10:53:02] !log ladsgroup@deploy1003 jdlrobson, ladsgroup: Backport for [[gerrit:1134391|Revert "Take 2: Large math formulae should be scrollable" (T201233)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:53:05] T201233: Long math output unreadable on small screens due to scrolling off the side of screen - https://phabricator.wikimedia.org/T201233 [10:53:38] !log ladsgroup@deploy1003 jdlrobson, ladsgroup: Continuing with sync [10:56:40] (03CR) 10Vgutierrez: [C:03+1] "I'm assuming you're unifying hiera settings in an upcoming CR" [puppet] - 10https://gerrit.wikimedia.org/r/1134648 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [10:57:35] (03CR) 10Fabfur: "tnx" [puppet] - 10https://gerrit.wikimedia.org/r/1134648 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [10:58:05] !log disable puppet on A:cp-eqiad to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/1134648 (T384227) [10:58:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:08] T384227: Private TLS material (TLS keys) should be stored in volatile storage only - https://phabricator.wikimedia.org/T384227 [10:58:46] (03CR) 10Fabfur: [C:03+2] hiera: enable TLS on volatile storage in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1134648 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [10:59:33] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 6 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#10717142 (10Ladsgroup) It'd be nice to add this to next week's tech news. Worth mentioning this has been requested 12 years ago (at least) [10:59:38] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2056-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [11:00:43] !log deploying https://gerrit.wikimedia.org/r/c/operations/puppet/+/1134648 on A:cp-eqiad (T384227) [11:00:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:47] !log ladsgroup@deploy1003 Finished scap sync-world: Backport for [[gerrit:1134391|Revert "Take 2: Large math formulae should be scrollable" (T201233)]] (duration: 13m 12s) [11:00:49] T201233: Long math output unreadable on small screens due to scrolling off the side of screen - https://phabricator.wikimedia.org/T201233 [11:02:58] (03CR) 10Filippo Giunchedi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1134179 (owner: 10Filippo Giunchedi) [11:04:38] RESOLVED: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2056-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [11:05:40] (03CR) 10Filippo Giunchedi: [C:03+2] ssl_ciphersuite: drop stretch support [puppet] - 10https://gerrit.wikimedia.org/r/1134179 (owner: 10Filippo Giunchedi) [11:07:14] jouncebot: now and next [11:07:14] No deployments scheduled for the next 1 hour(s) and 52 minute(s) [11:08:22] (03CR) 10Filippo Giunchedi: [C:03+2] ruby: move to .exist? [puppet] - 10https://gerrit.wikimedia.org/r/1134189 (https://phabricator.wikimedia.org/T391083) (owner: 10Filippo Giunchedi) [11:09:12] (03PS1) 10Jelto: wikidata-query-gui: add gateway route / for gui services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134656 (https://phabricator.wikimedia.org/T350793) [11:11:15] mmhh my ssh to puppetserver1001 dropped in the middle of puppet-merge, I'll merge another patch to make sure things have converged [11:11:36] or if anyone has a patch they are about to merge that works too [11:11:54] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5217/console" [puppet] - 10https://gerrit.wikimedia.org/r/1128780 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi) [11:12:13] FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:12:17] (03CR) 10Filippo Giunchedi: [V:03+1 C:03+2] alertmanager: replace prometheus_all_nodes [puppet] - 10https://gerrit.wikimedia.org/r/1128780 (https://phabricator.wikimedia.org/T389170) (owner: 10Filippo Giunchedi) [11:15:03] (03CR) 10Btullis: [C:03+1] "Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1133853 (https://phabricator.wikimedia.org/T390875) (owner: 10Stevemunene) [11:15:58] (03CR) 10Stevemunene: [C:03+2] hdfs: Remove disk space checks for hadoop worker [puppet] - 10https://gerrit.wikimedia.org/r/1133853 (https://phabricator.wikimedia.org/T390875) (owner: 10Stevemunene) [11:20:16] 06SRE, 07SRE-Unowned, 06serviceops-radar, 10wikitech.wikimedia.org, 13Patch-For-Review: Redesign wikitech-static - https://phabricator.wikimedia.org/T376400#10717183 (10Volans) Thanks for the fixes @Andrew, I did another pass, replying to my own comments inline: >>! In T376400#10694816, @Volans wrote: >... [11:20:34] (03PS1) 10Fabfur: hiera: enable TLS on volatile storage in esams [puppet] - 10https://gerrit.wikimedia.org/r/1134658 (https://phabricator.wikimedia.org/T384227) [11:24:57] (03CR) 10Ladsgroup: "I think this is for serviceops to handle." [puppet] - 10https://gerrit.wikimedia.org/r/1133995 (https://phabricator.wikimedia.org/T391009) (owner: 10Superpes15) [11:25:58] !log enable EBGP between cr1-eqiad and cloudsw1-c8-eqiad (IPv6 / cloud vrf) T389958 [11:26:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:00] T389958: WMCS Eqiad: Enable IPv6 in cloud vrf on switches - https://phabricator.wikimedia.org/T389958 [11:26:06] (03PS1) 10Novem Linguae: InitializeSettings: add wgSecurePollEditOtherWikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134660 (https://phabricator.wikimedia.org/T384302) [11:38:35] !log enable EBGP between cr2-eqiad and cloudsw1-d5-eqiad (IPv6 / cloud vrf) T389958 [11:38:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:38] T389958: WMCS Eqiad: Enable IPv6 in cloud vrf on switches - https://phabricator.wikimedia.org/T389958 [11:38:58] (03PS1) 10Ladsgroup: mainstash: Disable multiPrimaryMode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134663 (https://phabricator.wikimedia.org/T389893) [11:39:56] (03PS1) 10Ayounsi: Add BFD down alerting [alerts] - 10https://gerrit.wikimedia.org/r/1134664 (https://phabricator.wikimedia.org/T388641) [11:42:13] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:45:07] (03PS1) 10Marostegui: db1176: Add note [puppet] - 10https://gerrit.wikimedia.org/r/1134666 (https://phabricator.wikimedia.org/T390034) [11:50:05] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1134658 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [11:53:17] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): an-worker1202 is in the private vlan instead of the analytics vlan - https://phabricator.wikimedia.org/T390048#10717269 (10BTullis) a:05Papaul→03Jclark-ctr [11:53:19] FIRING: [2x] CloudCoreBGPDown: Cloud (WMCS) BGP session down between cloudsw1-c8-eqiad and cr1-eqiad (2a02:ec80:a000:fe01::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCloudCoreBGPDown [11:56:52] (03CR) 10Filippo Giunchedi: [C:03+1] "LGTM!" [alerts] - 10https://gerrit.wikimedia.org/r/1133152 (https://phabricator.wikimedia.org/T325283) (owner: 10Tiziano Fogli) [11:57:42] !log btullis@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on an-worker1169.eqiad.wmnet with reason: Moving to rack F8 [11:57:48] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 2 - rack F6) - https://phabricator.wikimedia.org/T390169#10717288 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=fe0295af-afe6-483e-932a-696ad85270c1) set b... [11:59:56] (03PS2) 10Fabfur: hiera: enable TLS on volatile storage in esams [puppet] - 10https://gerrit.wikimedia.org/r/1134658 (https://phabricator.wikimedia.org/T384227) [12:00:19] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1134658 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [12:01:10] (03PS1) 10Btullis: Move an-worker1169 from rack F6 to F8 [puppet] - 10https://gerrit.wikimedia.org/r/1134670 (https://phabricator.wikimedia.org/T390169) [12:03:19] FIRING: [2x] CloudCoreBGPDown: Cloud (WMCS) BGP session down between cloudsw1-c8-eqiad and cr1-eqiad (2a02:ec80:a000:fe01::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCloudCoreBGPDown [12:14:01] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [12:18:19] RESOLVED: [2x] CloudCoreBGPDown: Cloud (WMCS) BGP session down between cloudsw1-c8-eqiad and cr1-eqiad (2a02:ec80:a000:fe01::1) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCloudCoreBGPDown [12:18:21] (03CR) 10Cathal Mooney: [C:03+1] gNMIc set retry to 1 minute [puppet] - 10https://gerrit.wikimedia.org/r/1133328 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [12:22:04] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [12:24:39] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [12:26:14] (03PS2) 10Ayounsi: gNMIc set retry to 5 minute [puppet] - 10https://gerrit.wikimedia.org/r/1133328 (https://phabricator.wikimedia.org/T388641) [12:26:45] (03CR) 10Ayounsi: [C:03+2] gNMIc set retry to 5 minute [puppet] - 10https://gerrit.wikimedia.org/r/1133328 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [12:31:51] (03PS2) 10Jelto: wikidata-query-gui: add gateway route / for gui services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134656 (https://phabricator.wikimedia.org/T350793) [12:31:51] (03CR) 10Jelto: "I'll test this in staging" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134656 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [12:32:23] !log cloudsw1-c8-eqiad: add routes for WMCS OpenStack IPv6 aggregate to cloudgw VIP T389958 [12:32:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:26] T389958: WMCS Eqiad: Enable IPv6 in cloud vrf on switches - https://phabricator.wikimedia.org/T389958 [12:34:30] FIRING: [2x] Emergency syslog message: Alert for device cloudsw1-d5-eqiad.mgmt.eqiad.wmnet - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [12:35:51] !log cloudsw1-d5-eqiad: add routes for WMCS OpenStack IPv6 aggregate to cloudgw VIP T389958 [12:35:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:00] (03CR) 10Jelto: [C:03+2] wikidata-query-gui: add gateway route / for gui services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134656 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [12:37:13] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [12:37:51] (03Merged) 10jenkins-bot: wikidata-query-gui: add gateway route / for gui services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134656 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [12:39:30] RESOLVED: [2x] Emergency syslog message: Device cloudsw1-d5-eqiad.mgmt.eqiad.wmnet recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [12:42:04] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [12:42:24] 06SRE, 06Infrastructure-Foundations, 10netops, 07IPv6, 13Patch-For-Review: WMCS Eqiad: Enable IPv6 in cloud vrf on switches - https://phabricator.wikimedia.org/T389958#10717397 (10cmooney) [12:47:54] !log jelto@deploy1003 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [12:48:29] !log jelto@deploy1003 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [12:49:22] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1045.eqiad.wmnet [12:49:28] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc2046.codfw.wmnet [12:51:34] (03CR) 10Jelto: [C:03+2] "`" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134656 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [12:52:11] 06SRE, 06Infrastructure-Foundations, 10netops, 07IPv6, 13Patch-For-Review: WMCS Eqiad: Enable IPv6 in cloud vrf on switches - https://phabricator.wikimedia.org/T389958#10717450 (10cmooney) 05Open→03Resolved Thankfully all works are now in place for this, after a few little blips on the way. The... [12:52:30] (03CR) 10Fabfur: [C:03+1] "lgtm!" [puppet] - 10https://gerrit.wikimedia.org/r/1134197 (https://phabricator.wikimedia.org/T389212) (owner: 10Vgutierrez) [12:52:45] (03CR) 10Cathal Mooney: [C:03+1] gNMIc set retry to 5 minute [puppet] - 10https://gerrit.wikimedia.org/r/1133328 (https://phabricator.wikimedia.org/T388641) (owner: 10Ayounsi) [12:52:48] jouncebot: now and next [12:52:48] No deployments scheduled for the next 0 hour(s) and 7 minute(s) [12:52:55] jouncebot: next [12:52:55] In 0 hour(s) and 7 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250407T1300) [12:54:47] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10717474 (10elukey) @Jhancock.wm sorry I was afk! Please do it anytime, the host is not serving prod traffic. [12:55:23] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1045.eqiad.wmnet [12:56:26] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2046.codfw.wmnet [13:00:05] Lucas_WMDE, Urbanecm, and TheresNoTime: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250407T1300). [13:00:05] James_F and cscott: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:06] cscott: Did you want to deploy your things yourself? Mine are pretty trivial. [13:01:23] I try to let the experts handle the deploy if at all possible; I'm very rusty [13:01:37] Ha, OK, I can deploy. Do you want to do them apart? [13:01:37] (03CR) 10Ssingh: "Looks good! There might be value in splitting this up into two patches. Basically, add esams in one and then unify all of them in one comm" [puppet] - 10https://gerrit.wikimedia.org/r/1134658 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [13:02:31] o/ [13:02:34] cscott: The v3 change should probably go out on its own, I guess? [13:03:41] v3 change should only affect parsoid read views wikis. And the other config patch only affects mobile views of parsoid read views wikis. [13:03:53] OK, let's do them together. [13:04:01] They could go together if that's easier for you, yeah [13:04:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134106 (https://phabricator.wikimedia.org/T390420) (owner: 10C. Scott Ananian) [13:04:06] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134107 (https://phabricator.wikimedia.org/T376048) (owner: 10C. Scott Ananian) [13:04:15] It saves about 20 minutes of waiting around. [13:04:44] yep [13:04:53] I'll get my test cases ready [13:04:55] (03Merged) 10jenkins-bot: Shift to Parsoid Fragment support v3 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134106 (https://phabricator.wikimedia.org/T390420) (owner: 10C. Scott Ananian) [13:04:58] (03Merged) 10jenkins-bot: Where Parsoid Read Views are the default, use it for MFE as well [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134107 (https://phabricator.wikimedia.org/T376048) (owner: 10C. Scott Ananian) [13:05:15] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1134106|Shift to Parsoid Fragment support v3 (T390420)]], [[gerrit:1134107|Where Parsoid Read Views are the default, use it for MFE as well (T376048 T374578)]] [13:05:20] T390420: "indicator" tag not parsed properly - https://phabricator.wikimedia.org/T390420 [13:05:21] T376048: MFE still have issues with Parsoid Read Views on talk pages (Discussion Tools) - Type Error: startMarker is null - https://phabricator.wikimedia.org/T376048 [13:05:21] T374578: Bug: MobileFrontend wraps entire article with semantically incorrect mf-section-0 element. - https://phabricator.wikimedia.org/T374578 [13:06:32] (03CR) 10Effie Mouzeli: [C:03+1] Create insetup role for ServiceOps with nftables and rename existing one [puppet] - 10https://gerrit.wikimedia.org/r/1133927 (https://phabricator.wikimedia.org/T389825) (owner: 10Muehlenhoff) [13:07:46] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc2047.codfw.wmnet [13:07:52] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1046.eqiad.wmnet [13:10:26] (03PS1) 10Jelto: wikidata-query-gui: add gateway route "/" for query-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134686 (https://phabricator.wikimedia.org/T350793) [13:10:35] !log jforrester@deploy1003 jforrester, cscott: Backport for [[gerrit:1134106|Shift to Parsoid Fragment support v3 (T390420)]], [[gerrit:1134107|Where Parsoid Read Views are the default, use it for MFE as well (T376048 T374578)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:10:38] cscott: Now live on mwdebug; please check. [13:10:39] T390420: "indicator" tag not parsed properly - https://phabricator.wikimedia.org/T390420 [13:10:40] T376048: MFE still have issues with Parsoid Read Views on talk pages (Discussion Tools) - Type Error: startMarker is null - https://phabricator.wikimedia.org/T376048 [13:10:40] T374578: Bug: MobileFrontend wraps entire article with semantically incorrect mf-section-0 element. - https://phabricator.wikimedia.org/T374578 [13:10:44] (03Abandoned) 10Fabfur: hiera: enable TLS on volatile storage in esams [puppet] - 10https://gerrit.wikimedia.org/r/1134658 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [13:11:30] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission kafka-main1001 / kafka-main1002 / kafka-main1003 / kafka-main1004 / kafka-main1005 - https://phabricator.wikimedia.org/T381593#10717634 (10jijiki) a:05jijiki→03VRiley-WMF [13:11:42] (03CR) 10Jforrester: [C:03+2] Improve GeoCrumbs fallback when page property is not (yet) set [extensions/GeoCrumbs] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1134309 (https://phabricator.wikimedia.org/T391128) (owner: 10C. Scott Ananian) [13:12:20] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission kafka-main1001 / kafka-main1002 / kafka-main1003 / kafka-main1004 / kafka-main1005 - https://phabricator.wikimedia.org/T381593#10717639 (10jijiki) @VRiley-WMF sorry this fell through the cracks, kafka-main1005 has been dec... [13:13:07] James_F: ok, testing [13:13:39] (this isn't the geocrumbs patch, just the other two right?) [13:13:53] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1046.eqiad.wmnet [13:14:23] cscott: Yeah, I'm just getting CI started on that as it'll take a while. [13:14:23] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2047.codfw.wmnet [13:14:31] (03Merged) 10jenkins-bot: Improve GeoCrumbs fallback when page property is not (yet) set [extensions/GeoCrumbs] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1134309 (https://phabricator.wikimedia.org/T391128) (owner: 10C. Scott Ananian) [13:15:06] … and of course because I said that, CI decided to be fast today instead. Oh well. [13:15:11] (03PS1) 10Btullis: Update the ceph-csi container images to use 18.2.4 package versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134688 (https://phabricator.wikimedia.org/T389184) [13:16:01] (03CR) 10Jelto: [C:03+2] "Similar to Ia18c060827d879e6c9db8e927927f1a20622d402. I'll test this in staging:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134686 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [13:16:11] (03PS1) 10Fabfur: hiera: enable TLS on volatile storage in esams [puppet] - 10https://gerrit.wikimedia.org/r/1134689 (https://phabricator.wikimedia.org/T384227) [13:16:13] v3 fragments looks fine, testing MFE config change [13:16:21] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1134689 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [13:16:25] Excellent. [13:16:49] (03PS5) 10Ssingh: hiera: acme_chief: add wikimedia-ech.org [puppet] - 10https://gerrit.wikimedia.org/r/1133190 (https://phabricator.wikimedia.org/T205378) [13:17:47] (03Merged) 10jenkins-bot: wikidata-query-gui: add gateway route "/" for query-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134686 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [13:17:59] (03CR) 10Ssingh: [C:03+2] hiera: acme_chief: add wikimedia-ech.org [puppet] - 10https://gerrit.wikimedia.org/r/1133190 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [13:18:47] James_F: looks good, ok to continue [13:18:51] !log jforrester@deploy1003 jforrester, cscott: Continuing with sync [13:18:54] (03CR) 10Ssingh: [C:03+2] hiera: acme_chief: add wikimedia-ech.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1133190 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [13:18:56] Thanks. [13:19:06] (03PS2) 10Effie Mouzeli: Apache config for arbcom_plwiki [puppet] - 10https://gerrit.wikimedia.org/r/1133995 (https://phabricator.wikimedia.org/T391009) (owner: 10Superpes15) [13:19:15] !log jelto@deploy1003 helmfile [staging] START helmfile.d/services/wikidata-query-gui: apply [13:19:26] !log jelto@deploy1003 helmfile [staging] DONE helmfile.d/services/wikidata-query-gui: apply [13:19:35] !log jelto@deploy1003 helmfile [eqiad] START helmfile.d/services/wikidata-query-gui: apply [13:20:07] !log jelto@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikidata-query-gui: apply [13:20:17] !log jelto@deploy1003 helmfile [codfw] START helmfile.d/services/wikidata-query-gui: apply [13:20:46] !log jelto@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikidata-query-gui: apply [13:21:35] !log sukhe@cumin1002 START - Cookbook sre.cdn.roll-upgrade-ats Rolling upgrade/restart of Apache Traffic Server on P{cp4037.*} and A:cp for 9.2.10-1wm1 [13:21:59] !log P{cp4037.*} and A:cp for 9.2.10-1wm1 T390912 [13:22:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:22:02] T390912: Upgrade to ATS 9.2.10 - https://phabricator.wikimedia.org/T390912 [13:24:32] (03CR) 10Clément Goubert: [C:03+1] Create insetup role for ServiceOps with nftables and rename existing one [puppet] - 10https://gerrit.wikimedia.org/r/1133927 (https://phabricator.wikimedia.org/T389825) (owner: 10Muehlenhoff) [13:24:40] (03PS1) 10Lucas Werkmeister (WMDE): Fix EntitySchema propertyType on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134691 (https://phabricator.wikimedia.org/T371196) [13:24:42] (03PS1) 10Lucas Werkmeister (WMDE): Fix EntitySchema propertyType on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134692 (https://phabricator.wikimedia.org/T371196) [13:24:44] (03PS1) 10Lucas Werkmeister (WMDE): Remove unused EntitySchema config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134693 (https://phabricator.wikimedia.org/T371196) [13:24:45] (03PS1) 10Majavah: network: Add v6 cloud-private addresses [puppet] - 10https://gerrit.wikimedia.org/r/1134694 (https://phabricator.wikimedia.org/T379282) [13:24:55] !log sukhe@cumin1002 END (PASS) - Cookbook sre.cdn.roll-upgrade-ats (exit_code=0) Rolling upgrade/restart of Apache Traffic Server on P{cp4037.*} and A:cp for 9.2.10-1wm1 [13:25:14] (03CR) 10CI reject: [V:04-1] Remove unused EntitySchema config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134693 (https://phabricator.wikimedia.org/T371196) (owner: 10Lucas Werkmeister (WMDE)) [13:25:26] (03CR) 10Lucas Werkmeister (WMDE): [C:04-2] "Don’t deploy before this has been announced and all." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134692 (https://phabricator.wikimedia.org/T371196) (owner: 10Lucas Werkmeister (WMDE)) [13:26:03] (03CR) 10Lucas Werkmeister (WMDE): "recheck (I pushed a new version of the EntitySchema change)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134693 (https://phabricator.wikimedia.org/T371196) (owner: 10Lucas Werkmeister (WMDE)) [13:26:10] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1134106|Shift to Parsoid Fragment support v3 (T390420)]], [[gerrit:1134107|Where Parsoid Read Views are the default, use it for MFE as well (T376048 T374578)]] (duration: 20m 54s) [13:26:14] (03CR) 10Clément Goubert: [C:03+1] Release version 4.0.4 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1131666 (owner: 10Elukey) [13:26:15] T390420: "indicator" tag not parsed properly - https://phabricator.wikimedia.org/T390420 [13:26:15] T376048: MFE still have issues with Parsoid Read Views on talk pages (Discussion Tools) - Type Error: startMarker is null - https://phabricator.wikimedia.org/T376048 [13:26:16] T374578: Bug: MobileFrontend wraps entire article with semantically incorrect mf-section-0 element. - https://phabricator.wikimedia.org/T374578 [13:26:47] (03CR) 10Stevemunene: [C:03+1] "lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134688 (https://phabricator.wikimedia.org/T389184) (owner: 10Btullis) [13:27:56] 2 down 1 to go [13:28:00] (03CR) 10Btullis: [C:03+2] Update the ceph-csi container images to use 18.2.4 package versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134688 (https://phabricator.wikimedia.org/T389184) (owner: 10Btullis) [13:28:36] <_joe_> jouncebot: now [13:28:36] For the next 0 hour(s) and 31 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250407T1300) [13:29:28] <_joe_> James_F: ping me once you're done with the deployments [13:29:33] FIRING: KubernetesCalicoDown: wikikube-worker2142.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2142.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:29:34] _joe_: Will do. [13:29:40] <_joe_> thanks <3 [13:30:22] (03PS1) 10Kamila Součková: alertmanager: add task receivers for 4 teams [puppet] - 10https://gerrit.wikimedia.org/r/1134696 (https://phabricator.wikimedia.org/T385782) [13:30:46] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1134309|Improve GeoCrumbs fallback when page property is not (yet) set (T391128)]] [13:30:49] T391128: Missing GeoCrumbs bread crumbs on wikivoyage - https://phabricator.wikimedia.org/T391128 [13:32:03] !log depool cp4037: reverting to ATS 9.2.9 [13:32:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:32:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-jobrunner_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:32:48] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5218/co" [puppet] - 10https://gerrit.wikimedia.org/r/1134694 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [13:32:57] (03CR) 10Btullis: [C:03+2] Move an-worker1169 from rack F6 to F8 [puppet] - 10https://gerrit.wikimedia.org/r/1134670 (https://phabricator.wikimedia.org/T390169) (owner: 10Btullis) [13:33:04] (03Merged) 10jenkins-bot: Update the ceph-csi container images to use 18.2.4 package versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134688 (https://phabricator.wikimedia.org/T389184) (owner: 10Btullis) [13:34:05] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Link down between cr3-ulsfo and cr4-ulsfo - https://phabricator.wikimedia.org/T390731#10717744 (10ayounsi) p:05Low→03High Bumping the priority back up on this one as the interface keeps flapping. {F59004138} {F59004137} @RobH can... [13:34:22] !log sudo -i reprepro remove bullseye-wikimedia trafficserver [13:34:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:28] !log sudo -i reprepro remove bullseye-wikimedia trafficserver: T390912 [13:34:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:34:30] T390912: Upgrade to ATS 9.2.10 - https://phabricator.wikimedia.org/T390912 [13:35:13] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:35:59] (03CR) 10Tiziano Fogli: [C:03+2] perf/navtiming: migrate alerts from grafana to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1133152 (https://phabricator.wikimedia.org/T325283) (owner: 10Tiziano Fogli) [13:36:02] (03PS1) 10Jelto: trafficserver: switch querybuilder scholarly to wikikube [puppet] - 10https://gerrit.wikimedia.org/r/1134697 (https://phabricator.wikimedia.org/T350793) [13:36:04] !log reprepro -C main include bullseye-wikimedia trafficserver_9.2.9-1wm1_amd64.changes: T390912 [13:36:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:12] !log taavi@cumin1002 START - Cookbook sre.dns.wipe-cache cloudsw-b1.private.codfw.wikimedia.cloud on codfw recursors [13:36:13] !log taavi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) cloudsw-b1.private.codfw.wikimedia.cloud on codfw recursors [13:36:19] !log jforrester@deploy1003 jforrester, cscott: Backport for [[gerrit:1134309|Improve GeoCrumbs fallback when page property is not (yet) set (T391128)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:36:22] T391128: Missing GeoCrumbs bread crumbs on wikivoyage - https://phabricator.wikimedia.org/T391128 [13:36:23] cscott: Now please test your GeoCrumbs patch in debug. [13:36:28] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:36:35] on it [13:37:45] (03CR) 10Jelto: [V:03+1] "404 issues should be fixed now, see T350793#10717725. Let's do another traffic switch test with query-scholarly." [puppet] - 10https://gerrit.wikimedia.org/r/1134697 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [13:38:20] James_F: looks good to me, ok to continue [13:38:23] !log jforrester@deploy1003 Sync cancelled. [13:38:30] Bah, wrong key stroke. [13:38:34] (03Merged) 10jenkins-bot: perf/navtiming: migrate alerts from grafana to alertmanager [alerts] - 10https://gerrit.wikimedia.org/r/1133152 (https://phabricator.wikimedia.org/T325283) (owner: 10Tiziano Fogli) [13:38:39] That's a pain. [13:38:44] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1134309|Improve GeoCrumbs fallback when page property is not (yet) set (T391128)]] [13:39:00] A tenth of a second too fast on the before the [13:39:49] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission kafka-main1001 / kafka-main1002 / kafka-main1003 / kafka-main1004 / kafka-main1005 - https://phabricator.wikimedia.org/T381593#10717766 (10VRiley-WMF) [13:40:37] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware, 06serviceops: decommission kafka-main1001 / kafka-main1002 / kafka-main1003 / kafka-main1004 / kafka-main1005 - https://phabricator.wikimedia.org/T381593#10717771 (10VRiley-WMF) 05Open→03Resolved [13:40:39] that's why i leave this to the professionals. ;-p [13:40:44] Yeah yeah. [13:41:03] (03CR) 10Ssingh: [C:03+1] hiera: enable TLS on volatile storage in esams [puppet] - 10https://gerrit.wikimedia.org/r/1134689 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [13:42:47] (03CR) 10Fabfur: "tnx" [puppet] - 10https://gerrit.wikimedia.org/r/1134689 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [13:42:49] (03CR) 10Fabfur: [C:03+2] hiera: enable TLS on volatile storage in esams [puppet] - 10https://gerrit.wikimedia.org/r/1134689 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [13:43:22] !log disable puppet on A:cp-esams to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/1134689 (T384227) [13:43:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:43:25] T384227: Private TLS material (TLS keys) should be stored in volatile storage only - https://phabricator.wikimedia.org/T384227 [13:44:06] !log jforrester@deploy1003 jforrester, cscott: Backport for [[gerrit:1134309|Improve GeoCrumbs fallback when page property is not (yet) set (T391128)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:44:09] T391128: Missing GeoCrumbs bread crumbs on wikivoyage - https://phabricator.wikimedia.org/T391128 [13:44:20] !log deploying https://gerrit.wikimedia.org/r/c/operations/puppet/+/1134689 on A:cp-esams (T384227) [13:44:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:51] James_F: still works [13:44:54] !log jforrester@deploy1003 jforrester, cscott: Continuing with sync [13:44:59] * James_F assumed as much. :-) [13:45:42] FIRING: JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:46:23] (03CR) 10Cathal Mooney: [C:03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1134694 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [13:47:55] (03CR) 10Kamila Součková: "I created these based on staring at Phab, please let me know if you'd prefer different tags for your team's autocreated tasks." [puppet] - 10https://gerrit.wikimedia.org/r/1134696 (https://phabricator.wikimedia.org/T385782) (owner: 10Kamila Součková) [13:49:27] (03CR) 10Majavah: [V:03+1 C:03+2] network: Add v6 cloud-private addresses [puppet] - 10https://gerrit.wikimedia.org/r/1134694 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [13:49:48] (03CR) 10Clément Goubert: [C:03+1] alertmanager: add task receivers for 4 teams [puppet] - 10https://gerrit.wikimedia.org/r/1134696 (https://phabricator.wikimedia.org/T385782) (owner: 10Kamila Součková) [13:50:39] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1134696 (https://phabricator.wikimedia.org/T385782) (owner: 10Kamila Součková) [13:51:42] (03PS1) 10Fabfur: hiera: cleanup TLS on volatile storage custom files [puppet] - 10https://gerrit.wikimedia.org/r/1134698 (https://phabricator.wikimedia.org/T384227) [13:52:09] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1134309|Improve GeoCrumbs fallback when page property is not (yet) set (T391128)]] (duration: 13m 25s) [13:52:12] T391128: Missing GeoCrumbs bread crumbs on wikivoyage - https://phabricator.wikimedia.org/T391128 [13:53:42] thank you so much James_F [13:54:16] Of course. [13:54:20] _joe_: Over to you. [13:54:26] !log Backport window complete. [13:54:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:28] <_joe_> ty! [13:55:10] 10ops-eqiad, 06DC-Ops: fasw2-c1[a|b]-eqiad:ge-0/0/27 flapping while admin down - https://phabricator.wikimedia.org/T391257 (10ayounsi) 03NEW [13:56:36] (03CR) 10Giuseppe Lavagetto: [C:03+2] Add mediawiki-common to mw-cron [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133902 (owner: 10Giuseppe Lavagetto) [13:56:59] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1134698 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [13:57:40] (03PS1) 10Majavah: P:bird: Allow enabling IPv6 without enabling all services on it [puppet] - 10https://gerrit.wikimedia.org/r/1134699 (https://phabricator.wikimedia.org/T379282) [13:57:42] (03PS1) 10Majavah: hieradata: Announce OpenStack API over v6 from cloudlb2002-dev [puppet] - 10https://gerrit.wikimedia.org/r/1134700 (https://phabricator.wikimedia.org/T379282) [13:58:29] (03Merged) 10jenkins-bot: Add mediawiki-common to mw-cron [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133902 (owner: 10Giuseppe Lavagetto) [13:58:59] <_joe_> claime: FYI, deploying now [13:59:37] dogspeed\ [13:59:54] !log oblivian@deploy1003 helmfile [codfw] START helmfile.d/services/mw-cron: apply [13:59:56] (03PS2) 10Majavah: hieradata: Announce OpenStack API over v6 from cloudlb2002-dev [puppet] - 10https://gerrit.wikimedia.org/r/1134700 (https://phabricator.wikimedia.org/T379282) [14:00:22] !log oblivian@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply [14:00:43] !log oblivian@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [14:01:02] !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1047.eqiad.wmnet [14:01:05] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5221/co" [puppet] - 10https://gerrit.wikimedia.org/r/1134700 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [14:01:11] !log oblivian@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [14:01:51] 07Puppet, 06Infrastructure-Foundations: Improve the user experience adding new nodes to puppet - https://phabricator.wikimedia.org/T389932#10717872 (10MatthewVernon) FWIW my (simple) use case is typically "every node that looks like e.g. ms-be* should be `swift::storage`. I'd like to still be able to do that (... [14:02:29] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 9): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5219/console" [puppet] - 10https://gerrit.wikimedia.org/r/1134699 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [14:03:31] 06SRE, 06SRE Observability, 13Patch-For-Review: Cluster puppet variable and ganglia decommission - https://phabricator.wikimedia.org/T179395#10717877 (10fgiunchedi) 05Open→03Declined I'm going to decline this for now, we can revisit if and when the discussion re: cluster is revived [14:04:58] (03PS1) 10Giuseppe Lavagetto: mw-cron: fix values file name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134701 [14:05:11] (03CR) 10Giuseppe Lavagetto: [C:03+2] mw-cron: fix values file name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134701 (owner: 10Giuseppe Lavagetto) [14:06:54] (03Merged) 10jenkins-bot: mw-cron: fix values file name [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134701 (owner: 10Giuseppe Lavagetto) [14:07:04] !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1047.eqiad.wmnet [14:07:22] !log oblivian@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [14:07:57] !log oblivian@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [14:09:28] !log oblivian@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-cron: apply [14:09:34] !log oblivian@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-cron: apply [14:09:41] !log oblivian@deploy1003 helmfile [codfw] START helmfile.d/services/mw-cron: apply [14:09:49] (03CR) 10Jelto: [C:03+2] admin: add ozge shell user and groups [puppet] - 10https://gerrit.wikimedia.org/r/1133900 (https://phabricator.wikimedia.org/T390855) (owner: 10Jelto) [14:09:50] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [14:09:54] !log oblivian@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-cron: apply [14:10:37] <_joe_> claime: everything should be allright [14:10:41] <_joe_> emphasis on should [14:10:47] _joe_: test job ran ok [14:10:58] _joe_: startupregistrystats just kicked off [14:11:01] <_joe_> does the test job have to reach the db or anything? [14:11:08] <_joe_> ok that would indeed show issues [14:11:23] worked ok [14:11:32] _joe_: no it just does version.php [14:11:47] <_joe_> yeah the other one reaches out OTOH [14:12:35] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:13:39] (03PS1) 10Filippo Giunchedi: prometheus: scrape maintain_dbusers from its site only [puppet] - 10https://gerrit.wikimedia.org/r/1134708 [14:15:12] 06SRE, 06Data-Platform-SRE, 06SRE Observability: dropped packets to kafkamon 9000/tcp - https://phabricator.wikimedia.org/T238794#10717933 (10fgiunchedi) 05Open→03Invalid I'm not seeing the traffic anymore, resolving [14:16:55] yep [14:16:59] (03CR) 10Kamila Součková: [C:03+1] mwcron: Import all periodic_jobs resources [puppet] - 10https://gerrit.wikimedia.org/r/1133872 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [14:17:27] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5222/co" [puppet] - 10https://gerrit.wikimedia.org/r/1134708 (owner: 10Filippo Giunchedi) [14:18:29] (03CR) 10Ottomata: "Aiko suggested a different name, but otherwise LGTM!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133603 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira) [14:21:00] (03CR) 10Clément Goubert: [C:03+2] mwcron: Import all periodic_jobs resources [puppet] - 10https://gerrit.wikimedia.org/r/1133872 (https://phabricator.wikimedia.org/T341555) (owner: 10Clément Goubert) [14:24:28] (03PS2) 10Majavah: P:bird: Allow enabling IPv6 without enabling all services on it [puppet] - 10https://gerrit.wikimedia.org/r/1134699 (https://phabricator.wikimedia.org/T379282) [14:24:28] (03PS3) 10Majavah: hieradata: Announce OpenStack API over v6 from cloudlb2002-dev [puppet] - 10https://gerrit.wikimedia.org/r/1134700 (https://phabricator.wikimedia.org/T379282) [14:25:25] (03PS2) 10Kevin Bazira: EventStreamConfig: Add RRLA prediction_change stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133603 (https://phabricator.wikimedia.org/T326179) [14:25:42] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5224/co" [puppet] - 10https://gerrit.wikimedia.org/r/1134700 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [14:27:23] (03PS1) 10Clément Goubert: httpbb: replace jobrunner with mw-jobrunner [puppet] - 10https://gerrit.wikimedia.org/r/1134709 (https://phabricator.wikimedia.org/T354791) [14:27:23] (03CR) 10Kevin Bazira: EventStreamConfig: Add RRLA prediction_change stream (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133603 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira) [14:28:31] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install druid101[2-3] - https://phabricator.wikimedia.org/T387132#10717965 (10Jclark-ctr) a:03Jclark-ctr [14:28:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2056-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [14:28:52] 10ops-eqiad, 06SRE, 06Data-Platform-SRE, 06DC-Ops: Q3:rack/setup/install an-druid100[56] - https://phabricator.wikimedia.org/T387142#10717966 (10Jclark-ctr) a:03Jclark-ctr [14:28:57] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 9): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/5223/console" [puppet] - 10https://gerrit.wikimedia.org/r/1134699 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [14:29:09] !log bking@cumin2002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on cirrussearch[2055-2056].codfw.wmnet with reason: adding net-new role [14:30:05] (03CR) 10Kamila Součková: [C:03+1] httpbb: replace jobrunner with mw-jobrunner [puppet] - 10https://gerrit.wikimedia.org/r/1134709 (https://phabricator.wikimedia.org/T354791) (owner: 10Clément Goubert) [14:30:08] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 2 - rack F6) - https://phabricator.wikimedia.org/T390169#10717971 (10Jclark-ctr) [14:30:09] (03CR) 10Ottomata: [C:03+1] EventStreamConfig: Add RRLA prediction_change stream (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133603 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira) [14:30:23] (03CR) 10Clément Goubert: [C:03+2] httpbb: replace jobrunner with mw-jobrunner [puppet] - 10https://gerrit.wikimedia.org/r/1134709 (https://phabricator.wikimedia.org/T354791) (owner: 10Clément Goubert) [14:31:04] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1169 [14:31:10] !log jclark@cumin1002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host an-worker1169 [14:31:14] (03PS2) 10Kamila Součková: alertmanager: add task receivers for 4 teams [puppet] - 10https://gerrit.wikimedia.org/r/1134696 (https://phabricator.wikimedia.org/T385782) [14:31:17] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 10netops: Link down between cr3-ulsfo and cr4-ulsfo - https://phabricator.wikimedia.org/T390731#10717981 (10RobH) Case 01045114 opened just swapped out the info about a bit: > Support, We recently rolled some OS upgrades to our routers and du... [14:31:23] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1134696 (https://phabricator.wikimedia.org/T385782) (owner: 10Kamila Součková) [14:32:33] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 2 - rack F6) - https://phabricator.wikimedia.org/T390169#10717991 (10Jclark-ctr) a:05Jclark-ctr→03BTullis Completed swapping drives and relocating 1169 from F6 to rack F8 [14:33:43] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10717997 (10Jclark-ctr) a:03VRiley-WMF [14:34:01] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudvirt10[68-76] - https://phabricator.wikimedia.org/T382492#10718001 (10Jclark-ctr) @VRiley-WMF procurement T379370 looks like you updated racking ticket Jan 30 2025 have these been Received in coupa / or can the ticke... [14:35:55] 10ops-eqiad, 06SRE, 06DC-Ops: fasw2-c1[a|b]-eqiad:ge-0/0/27 flapping while admin down - https://phabricator.wikimedia.org/T391257#10718036 (10VRiley-WMF) It looks like pay-1b1001 is currently connected to these ports. Would you like us to remove the SFPs? [14:37:14] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, April 08 UTC morning backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133603 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira) [14:37:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-jobrunner_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:37:57] (03PS3) 10Kamila Součková: alertmanager: add task receivers for 4 teams [puppet] - 10https://gerrit.wikimedia.org/r/1134696 (https://phabricator.wikimedia.org/T385782) [14:39:26] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Ozge Karakaya - https://phabricator.wikimedia.org/T390855#10718084 (10Jelto) 05In progress→03Resolved Özg... [14:39:29] 10ops-eqiad, 06SRE, 06DC-Ops: fasw2-c1[a|b]-eqiad:ge-0/0/27 flapping while admin down - https://phabricator.wikimedia.org/T391257#10718086 (10Jclark-ctr) a:05Jclark-ctr→03VRiley-WMF [14:39:39] (03CR) 10Kamila Součková: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1134696 (https://phabricator.wikimedia.org/T385782) (owner: 10Kamila Součková) [14:42:25] RESOLVED: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-jobrunner_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:42:26] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q2:rack/setup/install ganeti105[34].eqiad.wmnet - https://phabricator.wikimedia.org/T381576#10718091 (10Jclark-ctr) a:03VRiley-WMF [14:45:17] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Q2:rack/setup E8/F8 new leaf switches - https://phabricator.wikimedia.org/T382017#10718101 (10Jclark-ctr) a:03VRiley-WMF These have been racked and in use is anything else needed for you for ticket? [14:53:11] (03CR) 10Majavah: [C:03+1] "relying on the fqdn containing the site feels hacky, but I can't really think of anything better atm (except maybe getting the data from p" [puppet] - 10https://gerrit.wikimedia.org/r/1134708 (owner: 10Filippo Giunchedi) [14:55:58] !log enabling unchecked_tombstone_compaction on sessionstore Cassandra — T390514 [14:56:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:37] (03CR) 10Filippo Giunchedi: [V:03+1 C:03+1] "Thank you for the quick review!" [puppet] - 10https://gerrit.wikimedia.org/r/1134708 (owner: 10Filippo Giunchedi) [14:56:40] (03CR) 10Filippo Giunchedi: [V:03+1 C:03+2] prometheus: scrape maintain_dbusers from its site only [puppet] - 10https://gerrit.wikimedia.org/r/1134708 (owner: 10Filippo Giunchedi) [14:56:41] 06SRE, 06SRE Observability, 07Kubernetes, 13Patch-For-Review: etcd: adapt etcd-backup.py for etcd 3.4 - https://phabricator.wikimedia.org/T385727#10718133 (10herron) 05Open→03Stalled [14:57:31] (03CR) 10MVernon: [C:03+2] Add two new ms-fe nodes [puppet] - 10https://gerrit.wikimedia.org/r/1134210 (https://phabricator.wikimedia.org/T388887) (owner: 10MVernon) [14:59:11] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1202 [14:59:12] !log jclark@cumin1002 END (FAIL) - Cookbook sre.network.configure-switch-interfaces (exit_code=99) for host an-worker1202 [15:00:01] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: HelmfileAdminNGPendingChanges (instance deploy1003:9100) - https://phabricator.wikimedia.org/T391241#10718153 (10joanna_borun) p:05Triage→03Low [15:01:01] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1202 [15:01:10] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1202 [15:01:12] !log elukey@deploy1003 helmfile [aux-k8s-codfw] START helmfile.d/admin 'sync'. [15:01:57] !log mvernon@cumin1002 START - Cookbook sre.hosts.reboot-single for host ms-fe2015.codfw.wmnet [15:02:22] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1202.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:02:23] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install ms-fe201[56] - https://phabricator.wikimedia.org/T388887#10718159 (10ops-monitoring-bot) Host rebooted by mvernon@cumin1002 with reason: reboot before bringing into service [15:02:34] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: HelmfileAdminNGPendingChanges (instance deploy1003:9100) - https://phabricator.wikimedia.org/T391241#10718162 (10elukey) 05Open→03Resolved a:03elukey Synced! [15:02:56] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1202.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:03:01] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: ProbeDown (instance ripe-atlas-codfw:0) - https://phabricator.wikimedia.org/T390676#10718167 (10ayounsi) @tappof is that still valid since you recent changes ? [15:03:45] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host ms-fe2016.codfw.wmnet [15:04:11] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1202.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:04:16] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 13Patch-For-Review: Q4:rack/setup/install ms-fe201[56] - https://phabricator.wikimedia.org/T388887#10718169 (10ops-monitoring-bot) Host rebooted by mvernon@cumin2002 with reason: reboot before bri nging into service [15:04:35] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1202.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:04:54] 06SRE, 06Infrastructure-Foundations, 07LDAP: Extend LDAP group cross check - https://phabricator.wikimedia.org/T390817#10718170 (10joanna_borun) p:05Triage→03Low [15:04:56] 06SRE, 06Infrastructure-Foundations, 07LDAP: Extend LDAP group cross check - https://phabricator.wikimedia.org/T390817#10718171 (10SLyngshede-WMF) a:03SLyngshede-WMF [15:07:46] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: HelmfileAdminNGPendingChanges (instance deploy1003:9100) - https://phabricator.wikimedia.org/T391241#10718186 (10elukey) 05Resolved→03Open There is an issue in codfw, the default images for ceph-related stuff are not ok. [15:07:59] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe2015.codfw.wmnet [15:09:15] 07sre-alert-triage, 06Infrastructure-Foundations: Alert in need of triage: HelmfileAdminNGPendingChanges (instance deploy1003:9100) - https://phabricator.wikimedia.org/T391241#10718193 (10elukey) ` kube-system ceph-csi-rbd-nodeplugin-dd955 0/3 ErrImagePull 0 6m35s kube-syste... [15:09:41] jouncebot: now and next [15:09:41] No deployments scheduled for the next 0 hour(s) and 20 minute(s) [15:09:45] jouncebot: next [15:09:46] In 0 hour(s) and 20 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250407T1530) [15:10:02] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ms-fe2016.codfw.wmnet [15:10:42] FIRING: [2x] JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:11:27] !log elukey@deploy1003 helmfile [aux-k8s-codfw] DONE helmfile.d/admin 'sync'. [15:12:13] FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:16:35] !log mvernon@cumin2002 START - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies rolling restart_daemons on A:swift-fe-codfw [15:18:25] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [15:20:31] !log mvernon@cumin2002 END (PASS) - Cookbook sre.swift.roll-restart-reboot-swift-ms-proxies (exit_code=0) rolling restart_daemons on A:swift-fe-codfw [15:21:08] !log pool ms-fe2015 ms-fe2016 T388887 [15:21:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:11] T388887: Q4:rack/setup/install ms-fe201[56] - https://phabricator.wikimedia.org/T388887 [15:21:21] !log mvernon@cumin1002 conftool action : set/weight=40; selector: service=swift-fe,name=ms-fe2015.codfw.wmnet [15:21:21] !log mvernon@cumin1002 conftool action : set/pooled=yes; selector: service=swift-fe,name=ms-fe2015.codfw.wmnet [15:21:22] !log mvernon@cumin1002 conftool action : set/weight=40; selector: service=nginx,name=ms-fe2015.codfw.wmnet [15:21:22] !log mvernon@cumin1002 conftool action : set/pooled=yes; selector: service=nginx,name=ms-fe2015.codfw.wmnet [15:21:34] !log mvernon@cumin1002 conftool action : set/weight=40; selector: service=swift-fe,name=ms-fe2016.codfw.wmnet [15:21:40] !log mvernon@cumin1002 conftool action : set/pooled=yes; selector: service=swift-fe,name=ms-fe2016.codfw.wmnet [15:21:46] !log mvernon@cumin1002 conftool action : set/weight=40; selector: service=nginx,name=ms-fe2016.codfw.wmnet [15:21:51] !log mvernon@cumin1002 conftool action : set/pooled=yes; selector: service=nginx,name=ms-fe2016.codfw.wmnet [15:22:30] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for an-worker1202 - jclark@cumin1002" [15:22:49] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added mgmt for an-worker1202 - jclark@cumin1002" [15:22:49] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:23:01] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1202 [15:23:09] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1202 [15:23:58] (03PS1) 10Volans: tox.ini: remove optimization for tox <4 [software/homer] - 10https://gerrit.wikimedia.org/r/1134712 [15:23:59] (03PS1) 10Volans: capirca: optimization refactor [software/homer] - 10https://gerrit.wikimedia.org/r/1134713 (https://phabricator.wikimedia.org/T250415) [15:24:00] (03PS1) 10Volans: homer: move NetboxData initialization [software/homer] - 10https://gerrit.wikimedia.org/r/1134714 (https://phabricator.wikimedia.org/T250415) [15:24:02] (03PS1) 10Volans: commit: refactor asking for approval [software/homer] - 10https://gerrit.wikimedia.org/r/1134715 (https://phabricator.wikimedia.org/T250415) [15:24:03] (03PS1) 10Volans: commit: allow to approve/reject diffs globally [software/homer] - 10https://gerrit.wikimedia.org/r/1134716 (https://phabricator.wikimedia.org/T250415) [15:24:05] (03PS1) 10Volans: doc: update documentation configuration [software/homer] - 10https://gerrit.wikimedia.org/r/1134717 [15:24:10] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1202.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:25:31] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1202.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:27:02] (03CR) 10AOkoth: [C:03+1] miscweb: os-report: use puppetdb from external_services [deployment-charts] - 10https://gerrit.wikimedia.org/r/1131952 (https://phabricator.wikimedia.org/T350794) (owner: 10Jelto) [15:28:00] (03CR) 10Volans: "I've used the spicerack settings here and after we can remove the types from the docstrings." [software/homer] - 10https://gerrit.wikimedia.org/r/1134717 (owner: 10Volans) [15:28:19] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1202 [15:28:26] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1202 [15:28:38] (03PS1) 10Btullis: Deploy the updated ceph-csi container plugin to aux-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134718 (https://phabricator.wikimedia.org/T389184) [15:29:00] (03PS1) 10Herron: aux-k8s-codfw: disable ceph [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134719 (https://phabricator.wikimedia.org/T391241) [15:29:19] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1202.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:30:05] jan_drewniak: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250407T1530). [15:30:56] 07sre-alert-triage, 06Infrastructure-Foundations, 13Patch-For-Review: Alert in need of triage: HelmfileAdminNGPendingChanges (instance deploy1003:9100) - https://phabricator.wikimedia.org/T391241#10718340 (10BTullis) This is odd. I had no idea that the ceph-csi-rnd plugin had been deployed to aux-k8s-codfw.... [15:35:09] (03PS1) 10Ahmon Dancy: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1134720 [15:35:42] FIRING: [2x] JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:36:30] (03CR) 10Ahmon Dancy: [C:03+2] Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1134720 (owner: 10Ahmon Dancy) [15:37:12] (03CR) 10Dzahn: [C:03+1] Apache config for arbcom_plwiki [puppet] - 10https://gerrit.wikimedia.org/r/1133995 (https://phabricator.wikimedia.org/T391009) (owner: 10Superpes15) [15:37:28] (03Merged) 10jenkins-bot: Merge remote-tracking branch 'origin/master' into train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1134720 (owner: 10Ahmon Dancy) [15:37:39] (03CR) 10Dzahn: [C:03+2] hiera: cleanup gitlab-runner docker gc settings [puppet] - 10https://gerrit.wikimedia.org/r/1133992 (https://phabricator.wikimedia.org/T390948) (owner: 10Dzahn) [15:37:55] 06SRE, 10Wikimedia-Mailing-lists: mailman/postorius: errors when changing subscription or when trying to unsubscribe - https://phabricator.wikimedia.org/T391260#10718379 (10Anoop) [15:39:33] (03CR) 10Dzahn: [C:03+1] trafficserver: switch querybuilder scholarly to wikikube [puppet] - 10https://gerrit.wikimedia.org/r/1134697 (https://phabricator.wikimedia.org/T350793) (owner: 10Jelto) [15:39:46] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1202.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:40:38] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1202.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:41:09] (03CR) 10Elukey: [C:03+1] aux-k8s-codfw: disable ceph [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134719 (https://phabricator.wikimedia.org/T391241) (owner: 10Herron) [15:42:13] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:43:21] (03CR) 10Elukey: [C:03+2] Release version 4.0.4 [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1131666 (owner: 10Elukey) [15:43:26] (03CR) 10Ssingh: "Looks good but question from me:" [puppet] - 10https://gerrit.wikimedia.org/r/1134698 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [15:44:15] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host an-worker1202.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [15:44:22] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1202 [15:44:37] (03CR) 10MVernon: [C:03+1] "I wasn't sure about that, so I went looking in the ceph source. Yes, the keys are always 40 / 20 characters long, so I tightened @btullis@" [puppet] - 10https://gerrit.wikimedia.org/r/1133916 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [15:44:38] (03CR) 10Ssingh: [C:03+1] "I M ao" [puppet] - 10https://gerrit.wikimedia.org/r/1134699 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [15:44:41] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1202 [15:44:58] (03CR) 10Ssingh: [C:03+1] "I meant "I am OK with this approach"." [puppet] - 10https://gerrit.wikimedia.org/r/1134699 (https://phabricator.wikimedia.org/T379282) (owner: 10Majavah) [15:46:12] (03CR) 10Dzahn: [C:03+2] "this needs a follow-up. Function lookup() did not find a value for the name 'profile::etherpad::service_ensure'" [puppet] - 10https://gerrit.wikimedia.org/r/1133996 (https://phabricator.wikimedia.org/T390948) (owner: 10Dzahn) [15:48:20] 10ops-eqiad, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q3-Q4): Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10718403 (10fnegri) [15:49:14] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1202 [15:49:21] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1202 [15:50:23] (03CR) 10Alexandros Kosiaris: [C:03+2] ats: Switch mw-wikifunctions back to original FQDN [puppet] - 10https://gerrit.wikimedia.org/r/1134281 (owner: 10Alexandros Kosiaris) [15:50:32] (03PS1) 10Ahmon Dancy: Updates for train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1134722 [15:51:36] (03CR) 10Ahmon Dancy: [C:03+2] Updates for train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1134722 (owner: 10Ahmon Dancy) [15:52:27] (03Merged) 10jenkins-bot: Updates for train-dev [mediawiki-config] (train-dev) - 10https://gerrit.wikimedia.org/r/1134722 (owner: 10Ahmon Dancy) [15:53:35] (03PS1) 10Dzahn: hieradata: add etherpad service_ensure key to devtools project level [puppet] - 10https://gerrit.wikimedia.org/r/1134723 (https://phabricator.wikimedia.org/T390948) [15:54:31] (03CR) 10Hoo man: [C:03+1] Fix EntitySchema propertyType on Test Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134691 (https://phabricator.wikimedia.org/T371196) (owner: 10Lucas Werkmeister (WMDE)) [15:55:02] (03CR) 10Hoo man: [C:03+1] "Good to go once announced." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134692 (https://phabricator.wikimedia.org/T371196) (owner: 10Lucas Werkmeister (WMDE)) [15:56:10] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [15:58:48] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:59:48] (03CR) 10Fabfur: "Good question... The two structures are identical (`profile::cache::haproxy::available_unified_certificates`) and the only part that is sp" [puppet] - 10https://gerrit.wikimedia.org/r/1134698 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [15:59:54] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host an-worker1202.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:01:00] (03PS1) 10Ssingh: package_builder: add bc package for OpenSSL build [puppet] - 10https://gerrit.wikimedia.org/r/1134726 (https://phabricator.wikimedia.org/T205378) [16:01:21] (03CR) 10Dzahn: [C:03+2] "https://gerrit.wikimedia.org/r/c/operations/puppet/+/1134723" [puppet] - 10https://gerrit.wikimedia.org/r/1133996 (https://phabricator.wikimedia.org/T390948) (owner: 10Dzahn) [16:01:28] (03CR) 10Dzahn: [C:03+2] hieradata: add etherpad service_ensure key to devtools project level [puppet] - 10https://gerrit.wikimedia.org/r/1134723 (https://phabricator.wikimedia.org/T390948) (owner: 10Dzahn) [16:03:00] (03PS2) 10Ssingh: package_builder: add bc package for OpenSSL build [puppet] - 10https://gerrit.wikimedia.org/r/1134726 (https://phabricator.wikimedia.org/T205378) [16:05:18] (03CR) 10Herron: [C:03+2] aux-k8s-codfw: disable ceph [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134719 (https://phabricator.wikimedia.org/T391241) (owner: 10Herron) [16:07:24] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in relforge [16:07:25] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in relforge [16:07:26] 10ops-eqiad, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q3-Q4): Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10718484 (10fnegri) @VRiley-WMF sorry for the lack of updates, I had to prioritize other things. This server is still alerting, and I didn't manage... [16:08:03] (03CR) 10Fabfur: [C:03+1] package_builder: add bc package for OpenSSL build [puppet] - 10https://gerrit.wikimedia.org/r/1134726 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [16:08:19] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: relforge1004* for test ban syntax - bking@cumin2002 - T391151 [16:08:20] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: relforge1004* for test ban syntax - bking@cumin2002 - T391151 [16:08:22] T391151: Ensure ban.py cookbook can ban not-yet-existing hosts - https://phabricator.wikimedia.org/T391151 [16:08:29] (03CR) 10Ssingh: [C:03+2] package_builder: add bc package for OpenSSL build [puppet] - 10https://gerrit.wikimedia.org/r/1134726 (https://phabricator.wikimedia.org/T205378) (owner: 10Ssingh) [16:09:07] (03PS1) 10Elukey: Updating docker-pkg to 4.0.4 [docker-images/docker-pkg/deploy] - 10https://gerrit.wikimedia.org/r/1134727 [16:10:00] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host an-worker1202.mgmt.eqiad.wmnet with chassis set policy FORCE_RESTART and with Dell SCP reboot policy FORCED [16:10:38] (03Merged) 10jenkins-bot: aux-k8s-codfw: disable ceph [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134719 (https://phabricator.wikimedia.org/T391241) (owner: 10Herron) [16:13:53] (03CR) 10Elukey: [C:03+2] services: use the data-gw staging endpoint in commons-impact-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134647 (owner: 10Elukey) [16:14:01] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [16:15:03] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host an-worker1202 [16:15:11] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host an-worker1202 [16:17:16] !log mforns@deploy1003 helmfile [staging] START helmfile.d/services/commons-impact-analytics: apply [16:17:28] !log mforns@deploy1003 helmfile [staging] DONE helmfile.d/services/commons-impact-analytics: apply [16:21:51] (03PS1) 10Elukey: services: fix staging settings for commons-impact-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134728 [16:22:32] (03CR) 10Mforns: [C:03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134728 (owner: 10Elukey) [16:23:17] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache an-worker1202.eqiad.wmnet on all recursors [16:23:20] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) an-worker1202.eqiad.wmnet on all recursors [16:24:34] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache an-worker1202.eqiad.wmnet on all recursors [16:24:37] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) an-worker1202.eqiad.wmnet on all recursors [16:24:39] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [16:25:23] (03PS2) 10Fabfur: hiera: cleanup TLS on volatile storage custom files [puppet] - 10https://gerrit.wikimedia.org/r/1134698 (https://phabricator.wikimedia.org/T384227) [16:25:31] (03PS2) 10Elukey: services: fix staging settings for commons-impact-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134728 [16:25:34] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1202.eqiad.wmnet with OS bullseye [16:25:44] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): an-worker1202 is in the private vlan instead of the analytics vlan - https://phabricator.wikimedia.org/T390048#10718574 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker... [16:25:58] 06SRE, 10Wikimedia-Mailing-lists, 10Catalyst (olin): Create a PatchDemo/Catalyst mailing list - https://phabricator.wikimedia.org/T388922#10718576 (10thcipriani) >>! In T388922#10667692, @jnuche wrote: >> Let's make a Patchdemo/Catalyst-specific list for users (Info) > > In that link it's mentioned we will... [16:26:15] 06SRE, 10Wikimedia-Mailing-lists, 10Catalyst (olin): Create a PatchDemo/Catalyst mailing list - https://phabricator.wikimedia.org/T388922#10718577 (10thcipriani) [16:26:21] (03CR) 10Mforns: [C:03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134728 (owner: 10Elukey) [16:26:24] (03CR) 10Fabfur: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1134698 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [16:27:04] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [16:28:14] (03CR) 10Elukey: [C:03+2] services: fix staging settings for commons-impact-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134728 (owner: 10Elukey) [16:29:22] !log mforns@deploy1003 helmfile [staging] START helmfile.d/services/commons-impact-analytics: apply [16:29:30] !log mforns@deploy1003 helmfile [staging] DONE helmfile.d/services/commons-impact-analytics: apply [16:29:58] 06SRE, 10Wikimedia-Mailing-lists, 10Catalyst (olin): Create a PatchDemo/Catalyst mailing list - https://phabricator.wikimedia.org/T388922#10718585 (10Ladsgroup) You want it on mailman (lists.wikimedia.org?) [16:30:10] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: relforge1003* for test ban syntax - bking@cumin2002 - T391151 [16:30:11] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: relforge1003* for test ban syntax - bking@cumin2002 - T391151 [16:30:12] T391151: Ensure ban.py cookbook can ban not-yet-existing hosts - https://phabricator.wikimedia.org/T391151 [16:30:49] 10ops-eqiad, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q3-Q4): Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10718591 (10VRiley-WMF) I'm available for this activity at anytime. 15:00 UTC - 17:00 UTC works for me. [16:32:48] 10ops-eqiad, 06DC-Ops, 10cloud-services-team (FY2024/2025-Q3-Q4): Temperature Inlet Temp issue on clouddumps1001:9290 - https://phabricator.wikimedia.org/T383723#10718604 (10fnegri) Great! I talked with @VRiley-WMF and the plan is: 1. I'm gonna shut down the server tomorrow for about 1 hour, to check if ther... [16:33:18] !log Upload ncmonitor 1.3.4-1 to bookworm-wikimedia [16:33:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:13] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [16:42:29] (03CR) 10AikoChou: [C:03+1] "Thanks for working on this :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133603 (https://phabricator.wikimedia.org/T326179) (owner: 10Kevin Bazira) [16:47:04] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [16:48:06] (03PS1) 10Slyngshede: data.yaml Revoke access for user [puppet] - 10https://gerrit.wikimedia.org/r/1134734 [16:48:44] (03CR) 10CI reject: [V:04-1] data.yaml Revoke access for user [puppet] - 10https://gerrit.wikimedia.org/r/1134734 (owner: 10Slyngshede) [16:49:28] (03PS2) 10Slyngshede: data.yaml Revoke access for user [puppet] - 10https://gerrit.wikimedia.org/r/1134734 [16:52:07] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1202.eqiad.wmnet with OS bullseye [16:52:15] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): an-worker1202 is in the private vlan instead of the analytics vlan - https://phabricator.wikimedia.org/T390048#10718671 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1202... [16:52:28] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1202.eqiad.wmnet with OS bullseye [16:52:36] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): an-worker1202 is in the private vlan instead of the analytics vlan - https://phabricator.wikimedia.org/T390048#10718672 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker... [16:53:34] (03CR) 10JHathaway: [C:03+1] "one nit, but looks fine" [puppet] - 10https://gerrit.wikimedia.org/r/1134734 (owner: 10Slyngshede) [16:54:53] !log slyngshede@cumin1002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Xiaoxiao out of all services on: 2396 hosts [16:55:18] (03CR) 10Slyngshede: data.yaml Revoke access for user (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1134734 (owner: 10Slyngshede) [16:55:55] (03CR) 10Slyngshede: [C:03+2] data.yaml Revoke access for user [puppet] - 10https://gerrit.wikimedia.org/r/1134734 (owner: 10Slyngshede) [16:59:11] !log dancy@deploy1003 Installing scap version "4.151.0" for 190 host(s) [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250407T1700) [17:00:04] ryankemper: Time to do the Wikidata Query Service weekly deploy deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250407T1700). [17:00:40] (03CR) 10Ssingh: [C:03+1] "Looks good. We can perhaps set use_tls_tmpfiles to true by default and get rid of the override but it's not a big deal." [puppet] - 10https://gerrit.wikimedia.org/r/1134698 (https://phabricator.wikimedia.org/T384227) (owner: 10Fabfur) [17:04:01] !log dancy@deploy1003 Installation of scap version "4.151.0" completed for 190 hosts [17:04:26] !log Disabling puppet on A:cp to roll out removal of vanrish 6/7 template switching (T378737) [17:04:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:28] T378737: Upgrade Varnish from 6.0 to 7.1 - https://phabricator.wikimedia.org/T378737 [17:05:04] (03CR) 10BCornwall: [V:03+1 C:03+2] varnish: Remove support for below version 7 [puppet] - 10https://gerrit.wikimedia.org/r/1132765 (https://phabricator.wikimedia.org/T378737) (owner: 10BCornwall) [17:13:19] (03PS1) 10Dzahn: hieradata: add phab shell user groups on phab1005 [puppet] - 10https://gerrit.wikimedia.org/r/1134736 (https://phabricator.wikimedia.org/T390034) [17:14:16] (03PS2) 10Dzahn: hieradata: add phab shell user groups on phab1005 [puppet] - 10https://gerrit.wikimedia.org/r/1134736 (https://phabricator.wikimedia.org/T390034) [17:14:50] (03CR) 10Dzahn: [C:03+2] hieradata: add phab shell user groups on phab1005 [puppet] - 10https://gerrit.wikimedia.org/r/1134736 (https://phabricator.wikimedia.org/T390034) (owner: 10Dzahn) [17:17:10] !log Re-enabling Puppet on A:cp (T378737) [17:17:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:17:13] T378737: Upgrade Varnish from 6.0 to 7.1 - https://phabricator.wikimedia.org/T378737 [17:22:09] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1165.eqiad.wmnet with reason: Maintenance [17:22:27] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [17:22:35] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1165 (T391056)', diff saved to https://phabricator.wikimedia.org/P74619 and previous config saved to /var/cache/conftool/dbconfig/20250407-172234-fceratto.json [17:22:37] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [17:23:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T391056)', diff saved to https://phabricator.wikimedia.org/P74620 and previous config saved to /var/cache/conftool/dbconfig/20250407-172343-fceratto.json [17:26:43] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp7001.magru.wmnet [17:27:23] !log brett@puppetserver1001 conftool action : set/pooled=no; selector: name=cp7002.magru.wmnet [17:29:33] FIRING: KubernetesCalicoDown: wikikube-worker2142.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-worker2142.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [17:30:38] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10718758 (10phaultfinder) [17:38:51] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P74621 and previous config saved to /var/cache/conftool/dbconfig/20250407-173851-fceratto.json [17:44:27] !log Remove libvmod-netmapper, libvmod-querysort, varnish-re2, varnish, varnishkafka, varnish-modules from bullseye-wikimedia component/varnish-staging [17:44:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:50:43] !log jhathaway@cumin1002 DONE (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Xiaoxiao out of all services on: 2397 hosts [17:51:16] 10SRE-swift-storage, 10MediaWiki-libs-HTTP, 06MW-Interfaces-Team, 07Wikimedia-production-error: PHP Warning: Cannot modify header information - headers already sent by includes/libs/http/MultiHttpClient.php - https://phabricator.wikimedia.org/T369186#10718786 (10HCoplin-WMF) Wondering if this is still happ... [17:53:58] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P74622 and previous config saved to /var/cache/conftool/dbconfig/20250407-175358-fceratto.json [17:59:17] !log Upload varnishkafka 1.2.0-2 to bullseye-wikimedia (T389605) [17:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:31] (03PS33) 10Federico Ceratto: sre.mysql.upgrade: add depool/pool logic [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb) [18:03:31] (03CR) 10Federico Ceratto: "Updated based on the feedback and rebased over the main branch." [cookbooks] - 10https://gerrit.wikimedia.org/r/1080718 (https://phabricator.wikimedia.org/T368881) (owner: 10Arnaudb) [18:05:24] 10SRE-swift-storage, 10MediaWiki-libs-HTTP, 06MW-Interfaces-Team, 07Wikimedia-production-error: PHP Warning: Cannot modify header information - headers already sent by includes/libs/http/MultiHttpClient.php - https://phabricator.wikimedia.org/T369186#10718852 (10Tgr) Hasn't reoccured since then. The fact... [18:07:15] (03PS1) 10AOkoth: site: revert releases2003 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1134740 (https://phabricator.wikimedia.org/T384595) [18:07:25] 06SRE, 06Infrastructure-Foundations, 07LDAP: Extend LDAP group cross check against names in data.yaml - https://phabricator.wikimedia.org/T390817#10718859 (10Aklapper) [18:07:39] (03CR) 10CI reject: [V:04-1] site: revert releases2003 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1134740 (https://phabricator.wikimedia.org/T384595) (owner: 10AOkoth) [18:08:58] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1202.eqiad.wmnet with OS bullseye [18:09:06] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T391056)', diff saved to https://phabricator.wikimedia.org/P74623 and previous config saved to /var/cache/conftool/dbconfig/20250407-180905-fceratto.json [18:09:07] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): an-worker1202 is in the private vlan instead of the analytics vlan - https://phabricator.wikimedia.org/T390048#10718862 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1202... [18:09:08] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [18:09:20] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1168.eqiad.wmnet with reason: Maintenance [18:09:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1168 (T391056)', diff saved to https://phabricator.wikimedia.org/P74624 and previous config saved to /var/cache/conftool/dbconfig/20250407-180927-fceratto.json [18:09:37] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1202.eqiad.wmnet with OS bullseye [18:09:43] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): an-worker1202 is in the private vlan instead of the analytics vlan - https://phabricator.wikimedia.org/T390048#10718867 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host an-worker... [18:10:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T391056)', diff saved to https://phabricator.wikimedia.org/P74625 and previous config saved to /var/cache/conftool/dbconfig/20250407-181035-fceratto.json [18:11:38] (03PS2) 10AOkoth: site: revert releases2003 to insetup [puppet] - 10https://gerrit.wikimedia.org/r/1134740 (https://phabricator.wikimedia.org/T384595) [18:12:50] (03PS1) 10Sbisson: CX: Redirect to target wiki if needed, when CX cookie is set [extensions/ContentTranslation] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1134741 (https://phabricator.wikimedia.org/T390934) [18:13:27] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 07 UTC late backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item" [extensions/ContentTranslation] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1134741 (https://phabricator.wikimedia.org/T390934) (owner: 10Sbisson) [18:14:13] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): an-worker1202 is in the private vlan instead of the analytics vlan - https://phabricator.wikimedia.org/T390048#10718888 (10Jclark-ctr) @BTullis i have fixed it and removed it to the analytics vlan but having issues with it passin... [18:14:42] (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/output/1134740/5225/" [puppet] - 10https://gerrit.wikimedia.org/r/1134740 (https://phabricator.wikimedia.org/T384595) (owner: 10AOkoth) [18:20:27] (03PS1) 10BCornwall: varnish: Check for 'busy' in vcl.list output [puppet] - 10https://gerrit.wikimedia.org/r/1134742 [18:20:48] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10718916 (10phaultfinder) [18:22:34] (03CR) 10Ssingh: [C:03+1] varnish: Check for 'busy' in vcl.list output [puppet] - 10https://gerrit.wikimedia.org/r/1134742 (owner: 10BCornwall) [18:23:15] (03CR) 10BCornwall: [C:03+2] varnish: Check for 'busy' in vcl.list output [puppet] - 10https://gerrit.wikimedia.org/r/1134742 (owner: 10BCornwall) [18:23:28] (03PS5) 10SBassett: OATHAuth: Mark checkuser and suppress as requiring 2FA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133245 (https://phabricator.wikimedia.org/T150898) [18:24:05] (03PS6) 10SBassett: OATHAuth: Mark checkuser and suppress as requiring 2FA [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1133245 (https://phabricator.wikimedia.org/T150898) [18:25:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P74627 and previous config saved to /var/cache/conftool/dbconfig/20250407-182542-fceratto.json [18:26:26] jouncebot now [18:26:26] No deployments scheduled for the next 1 hour(s) and 33 minute(s) [18:26:31] (03PS2) 10Volans: commit: refactor asking for approval [software/homer] - 10https://gerrit.wikimedia.org/r/1134715 (https://phabricator.wikimedia.org/T250415) [18:26:31] (03PS2) 10Volans: commit: allow to approve/reject diffs globally [software/homer] - 10https://gerrit.wikimedia.org/r/1134716 (https://phabricator.wikimedia.org/T250415) [18:26:32] (03PS2) 10Volans: doc: update documentation configuration [software/homer] - 10https://gerrit.wikimedia.org/r/1134717 [18:26:53] !log dancy@deploy1003 Started scap sync-world: testing [18:30:06] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp7001.magru.wmnet [18:31:55] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Q2:rack/setup E8/F8 new leaf switches - https://phabricator.wikimedia.org/T382017#10718936 (10VRiley-WMF) [18:32:00] 10ops-eqiad, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Q2:rack/setup E8/F8 new leaf switches - https://phabricator.wikimedia.org/T382017#10718937 (10VRiley-WMF) 05Open→03Resolved [18:32:28] !log dancy@deploy1003 Finished scap sync-world: testing (duration: 05m 35s) [18:38:07] (03CR) 10CI reject: [V:04-1] commit: refactor asking for approval [software/homer] - 10https://gerrit.wikimedia.org/r/1134715 (https://phabricator.wikimedia.org/T250415) (owner: 10Volans) [18:40:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P74628 and previous config saved to /var/cache/conftool/dbconfig/20250407-184049-fceratto.json [18:42:55] (03CR) 10Dzahn: "so reimage and not a new VM after all?" [puppet] - 10https://gerrit.wikimedia.org/r/1134740 (https://phabricator.wikimedia.org/T384595) (owner: 10AOkoth) [18:43:37] !log brett@puppetserver1001 conftool action : set/pooled=yes; selector: name=cp7002.magru.wmnet [18:49:54] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on P{cp700[3-8].magru.wmnet} and A:cp [18:51:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [18:55:03] !log brett@cumin2002 START - Cookbook sre.cdn.roll-upgrade-varnish rolling upgrade of Varnish on A:cp-upload_magru [18:55:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T391056)', diff saved to https://phabricator.wikimedia.org/P74629 and previous config saved to /var/cache/conftool/dbconfig/20250407-185556-fceratto.json [18:56:00] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [18:56:12] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1180.eqiad.wmnet with reason: Maintenance [18:56:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1180 (T391056)', diff saved to https://phabricator.wikimedia.org/P74630 and previous config saved to /var/cache/conftool/dbconfig/20250407-185619-fceratto.json [18:56:57] 06SRE-OnFire, 06Release-Engineering-Team, 10Scap, 06serviceops, 10Sustainability (Incident Followup): Should scap be able to update helmfile-defaults when -Dbuild_mw_container_image:False ? - https://phabricator.wikimedia.org/T390531#10719044 (10dancy) [18:58:28] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T391056)', diff saved to https://phabricator.wikimedia.org/P74631 and previous config saved to /var/cache/conftool/dbconfig/20250407-185828-fceratto.json [19:00:33] (03CR) 10AOkoth: "I was thinking of trying both approaches and compare. I'll create new VMs for aphlict." [puppet] - 10https://gerrit.wikimedia.org/r/1134740 (https://phabricator.wikimedia.org/T384595) (owner: 10AOkoth) [19:01:12] (03Abandoned) 10Gergő Tisza: CentralAuth: lower timeout for token validation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128913 (owner: 10Giuseppe Lavagetto) [19:06:01] (03CR) 10Dzahn: [C:03+1] "alright, +1 then. keep in mind it's possible though the role won't work on bookworm and then the question becomes how long we are ok witho" [puppet] - 10https://gerrit.wikimedia.org/r/1134740 (https://phabricator.wikimedia.org/T384595) (owner: 10AOkoth) [19:06:01] !log extending vg0/srv logical volume, sesionstore2004 — T390514 [19:06:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:07:20] FIRING: CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [19:08:20] FIRING: [2x] CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [19:08:38] FIRING: CirrusSearchThreadPoolRejectionsTooHigh: elastic1096-production-search-eqiad is rejecting excessive amounts of queries due to a full thread pool - https://w.wiki/DTaY - https://grafana.wikimedia.org/goto/aoZBw8pNR?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchThreadPoolRejectionsTooHigh [19:09:20] FIRING: [2x] CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [19:12:13] FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:12:20] FIRING: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [19:12:35] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1202.eqiad.wmnet with OS bullseye [19:12:40] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): an-worker1202 is in the private vlan instead of the analytics vlan - https://phabricator.wikimedia.org/T390048#10719076 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host an-worker1202... [19:13:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P74632 and previous config saved to /var/cache/conftool/dbconfig/20250407-191335-fceratto.json [19:14:20] RESOLVED: [2x] CirrusSearchMoreLikeLatencyTooHigh: CirrusSearch more_like 95th percentiles latency is too high (mw@codfw to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchMoreLikeLatencyTooHigh [19:18:38] RESOLVED: CirrusSearchThreadPoolRejectionsTooHigh: elastic1096-production-search-eqiad is rejecting excessive amounts of queries due to a full thread pool - https://w.wiki/DTaY - https://grafana.wikimedia.org/goto/aoZBw8pNR?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchThreadPoolRejectionsTooHigh [19:19:12] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic1096* for ban node to stop high rejection rates - bking@cumin2002 [19:19:15] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic1096* for ban node to stop high rejection rates - bking@cumin2002 [19:22:20] RESOLVED: [2x] CirrusSearchCompletionLatencyTooHigh: CirrusSearch comp_suggest 95th percentiles latency is too high (mw@codfw to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchCompletionLatencyTooHigh [19:23:20] RESOLVED: [2x] CirrusSearchFullTextLatencyTooHigh: CirrusSearch full_text 95th percentiles latency is too high (mw@codfw to eqiad) - https://wikitech.wikimedia.org/wiki/Search#Health/Activity_Monitoring - https://grafana.wikimedia.org/d/dc04b9f2-b8d5-4ab6-9482-5d9a75728951/elasticsearch-percentiles?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchFullTextLatencyTooHigh [19:28:43] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P74633 and previous config saved to /var/cache/conftool/dbconfig/20250407-192842-fceratto.json [19:35:42] FIRING: JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:38:07] (03PS1) 10Bking: cirrussearch: update conftool data with new hostnames (row A) [puppet] - 10https://gerrit.wikimedia.org/r/1134755 (https://phabricator.wikimedia.org/T388610) [19:38:25] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1134755 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [19:39:39] (03CR) 10Bking: [C:03+2] cirrussearch: Add cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/1134078 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [19:41:49] !log extending vg0/srv logical volume, sesionstore2005 — T390514 [19:41:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:42:13] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:43:50] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T391056)', diff saved to https://phabricator.wikimedia.org/P74634 and previous config saved to /var/cache/conftool/dbconfig/20250407-194350-fceratto.json [19:43:53] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [19:44:05] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1187.eqiad.wmnet with reason: Maintenance [19:44:13] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1187 (T391056)', diff saved to https://phabricator.wikimedia.org/P74635 and previous config saved to /var/cache/conftool/dbconfig/20250407-194412-fceratto.json [19:44:15] !log extending vg0/srv logical volume, sesionstore2006 — T390514 [19:44:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:21] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T391056)', diff saved to https://phabricator.wikimedia.org/P74636 and previous config saved to /var/cache/conftool/dbconfig/20250407-194621-fceratto.json [19:48:58] !log extending vg0/srv logical volume, sessionstore100[4-6].eqiad.wmnet — T390514 [19:48:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [19:59:50] stephanebisson: OK for me to deploy, or did you want to? [20:00:05] RoanKattouw, Urbanecm, TheresNoTime, and kindrobot: Your horoscope predicts another UTC late backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250407T2000). [20:00:05] James_F and stephanebisson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:06] James_F go for it [20:00:16] (03CR) 10Jforrester: [C:03+2] CX: Redirect to target wiki if needed, when CX cookie is set [extensions/ContentTranslation] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1134741 (https://phabricator.wikimedia.org/T390934) (owner: 10Sbisson) [20:00:28] Cool. I'll sneak out my config ones whilst we wait for it to merge. [20:00:40] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128050 (https://phabricator.wikimedia.org/T389019) (owner: 10Jforrester) [20:00:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jforrester@deploy1003 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134057 (owner: 10Jforrester) [20:01:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P74637 and previous config saved to /var/cache/conftool/dbconfig/20250407-200128-fceratto.json [20:04:58] (03Merged) 10jenkins-bot: search-redirect: Handle $_GET potential vulnerability scanning [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1128050 (https://phabricator.wikimedia.org/T389019) (owner: 10Jforrester) [20:05:01] (03Merged) 10jenkins-bot: wikifunctionswiki: Make 'native' mode the default for Maths [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134057 (owner: 10Jforrester) [20:05:07] (03Merged) 10jenkins-bot: CX: Redirect to target wiki if needed, when CX cookie is set [extensions/ContentTranslation] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1134741 (https://phabricator.wikimedia.org/T390934) (owner: 10Sbisson) [20:05:19] !log jforrester@deploy1003 Started scap sync-world: Backport for [[gerrit:1128050|search-redirect: Handle $_GET potential vulnerability scanning (T389019)]], [[gerrit:1134057|wikifunctionswiki: Make 'native' mode the default for Maths]] [20:05:22] T389019: Argument #2 must be of type string, array given in /srv/mediawiki/docroot/wwwportal/w/search-redirect.php - https://phabricator.wikimedia.org/T389019 [20:05:35] Aha, nice, we're doing it all together, very efficient. [20:05:43] * James_F clearly shouldn't have doubted CI. [20:07:44] FIRING: RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [20:09:36] !log jforrester@deploy1003 jforrester: Backport for [[gerrit:1128050|search-redirect: Handle $_GET potential vulnerability scanning (T389019)]], [[gerrit:1134057|wikifunctionswiki: Make 'native' mode the default for Maths]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:10:39] stephanebisson: Please test and confirm it's OK to continue. [20:10:58] James_F On it. I'll need a few minutes [20:11:34] Of course. [20:12:07] James_F I'm done. All good [20:12:13] Awesome. [20:12:15] !log jforrester@deploy1003 jforrester: Continuing with sync [20:12:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [20:16:36] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P74638 and previous config saved to /var/cache/conftool/dbconfig/20250407-201635-fceratto.json [20:17:13] FIRING: [5x] PuppetCertificateAboutToExpire: Puppet CA certificate ganeti01.svc.codfw.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [20:17:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [20:19:26] !log jforrester@deploy1003 Finished scap sync-world: Backport for [[gerrit:1128050|search-redirect: Handle $_GET potential vulnerability scanning (T389019)]], [[gerrit:1134057|wikifunctionswiki: Make 'native' mode the default for Maths]] (duration: 14m 06s) [20:19:29] T389019: Argument #2 must be of type string, array given in /srv/mediawiki/docroot/wwwportal/w/search-redirect.php - https://phabricator.wikimedia.org/T389019 [20:20:09] Okie-dokie, all done. [20:20:22] !log Backport window complete. [20:20:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:24] Thank you sir! [20:24:39] FIRING: CirrusSearchTitleSuggestIndexTooOld: Some search indices that power autocomplete have not been updated recently - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#CirrusSearch_titlesuggest_index_is_too_old - TODO - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchTitleSuggestIndexTooOld [20:31:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T391056)', diff saved to https://phabricator.wikimedia.org/P74640 and previous config saved to /var/cache/conftool/dbconfig/20250407-203142-fceratto.json [20:31:45] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [20:31:58] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1201.eqiad.wmnet with reason: Maintenance [20:32:04] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [20:32:05] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db1201 (T391056)', diff saved to https://phabricator.wikimedia.org/P74641 and previous config saved to /var/cache/conftool/dbconfig/20250407-203205-fceratto.json [20:33:14] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T391056)', diff saved to https://phabricator.wikimedia.org/P74642 and previous config saved to /var/cache/conftool/dbconfig/20250407-203313-fceratto.json [20:37:13] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [20:37:39] FIRING: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance elastic1096-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [20:39:01] !log bking@cumin2002 START - Cookbook sre.hosts.reboot-single for host elastic1096.eqiad.wmnet [20:45:17] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host elastic1096.eqiad.wmnet [20:48:22] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P74643 and previous config saved to /var/cache/conftool/dbconfig/20250407-204821-fceratto.json [20:49:39] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Unbanning all hosts in search_eqiad [20:49:44] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Unbanning all hosts in search_eqiad [20:52:04] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [20:55:03] !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2055.codfw.wmnet [20:55:12] !log bking@cumin2002 conftool action : set/weight=10:pooled=yes; selector: name=cirrussearch2056.codfw.wmnet [20:57:24] (03PS1) 10Bking: cirrussearch: Add row A hosts to new cirrussearch role [puppet] - 10https://gerrit.wikimedia.org/r/1134761 (https://phabricator.wikimedia.org/T388610) [21:00:05] Reedy, sbassett, Maryum, and manfredi: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250407T2100). [21:03:23] (03PS2) 10Bking: cirrussearch: update conftool data with new hostnames (row A) [puppet] - 10https://gerrit.wikimedia.org/r/1134755 (https://phabricator.wikimedia.org/T388610) [21:03:29] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P74644 and previous config saved to /var/cache/conftool/dbconfig/20250407-210328-fceratto.json [21:06:45] (03CR) 10Ryan Kemper: [C:03+1] cirrussearch: Add row A hosts to new cirrussearch role [puppet] - 10https://gerrit.wikimedia.org/r/1134761 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [21:43:08] 06SRE-OnFire, 10Incident Tooling: Incident documents are less visible with Corto - https://phabricator.wikimedia.org/T390126#10719499 (10jhathaway) reached out to ITS in a follow-up task: https://wikimediainternal.zendesk.com/hc/en-us/requests/111894 [21:53:44] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231', diff saved to https://phabricator.wikimedia.org/P74650 and previous config saved to /var/cache/conftool/dbconfig/20250407-215342-fceratto.json [22:03:32] (03PS1) 10Bking: cirrussearch: Add regex data for cirrussearch hosts [puppet] - 10https://gerrit.wikimedia.org/r/1134765 (https://phabricator.wikimedia.org/T388610) [22:03:57] (03CR) 10CI reject: [V:04-1] cirrussearch: Add regex data for cirrussearch hosts [puppet] - 10https://gerrit.wikimedia.org/r/1134765 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [22:08:52] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1231 (T391056)', diff saved to https://phabricator.wikimedia.org/P74651 and previous config saved to /var/cache/conftool/dbconfig/20250407-220851-fceratto.json [22:08:55] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [22:09:07] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [22:12:18] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2151.codfw.wmnet with reason: Maintenance [22:12:24] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2151 (T391056)', diff saved to https://phabricator.wikimedia.org/P74652 and previous config saved to /var/cache/conftool/dbconfig/20250407-221224-fceratto.json [22:15:29] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 2 - rack F6) - https://phabricator.wikimedia.org/T390169#10719576 (10BTullis) >>! In T390169#10717991, @Jclark-ctr wrote: > Completed swapping drives and relocating 1169 from F6... [22:16:59] (03PS2) 10Bking: cirrussearch: Add regex data for cirrussearch hosts [puppet] - 10https://gerrit.wikimedia.org/r/1134765 (https://phabricator.wikimedia.org/T388610) [22:18:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T391056)', diff saved to https://phabricator.wikimedia.org/P74653 and previous config saved to /var/cache/conftool/dbconfig/20250407-221812-fceratto.json [22:18:15] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [22:18:38] (03CR) 10Bking: [C:03+1] search: allow any cirrussearch host to join cluster [puppet] - 10https://gerrit.wikimedia.org/r/1134764 (owner: 10Ryan Kemper) [22:20:38] !log btullis@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1170.eqiad.wmnet [22:22:00] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 2 - rack F6) - https://phabricator.wikimedia.org/T390169#10719589 (10BTullis) Added the 12 new RAID0 volumes on an-worker1170 and an-worker1171. ` btullis@an-worker1170:~$ sudo... [22:24:40] !log btullis@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1170.eqiad.wmnet [22:24:51] !log btullis@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker1171.eqiad.wmnet [22:26:29] !log btullis@cumin1002 END (PASS) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=0) for hosts an-worker1171.eqiad.wmnet [22:26:37] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1170.eqiad.wmnet [22:26:54] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 1 - rack F7) - https://phabricator.wikimedia.org/T390168#10719597 (10ops-monitoring-bot) Host rebooted by btullis@cumin1002 with reason: Reboot post HDD replacement [22:30:54] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1170.eqiad.wmnet [22:32:35] !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-upgrade-varnish (exit_code=0) rolling upgrade of Varnish on A:cp-upload_magru [22:33:20] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P74654 and previous config saved to /var/cache/conftool/dbconfig/20250407-223319-fceratto.json [22:35:05] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1171.eqiad.wmnet [22:35:24] 10ops-eqiad, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11): Upgrade an-worker hard drives from 4TB to 8TB (group 1 - rack F7) - https://phabricator.wikimedia.org/T390168#10719614 (10ops-monitoring-bot) Host rebooted by btullis@cumin1002 with reason: Reboot post HDD replacement [22:41:01] (03PS1) 10Btullis: Bring an-worker117[0-1] back into service [puppet] - 10https://gerrit.wikimedia.org/r/1134768 (https://phabricator.wikimedia.org/T390169) [22:44:17] (03CR) 10Btullis: [C:03+2] Bring an-worker117[0-1] back into service [puppet] - 10https://gerrit.wikimedia.org/r/1134768 (https://phabricator.wikimedia.org/T390169) (owner: 10Btullis) [22:48:27] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151', diff saved to https://phabricator.wikimedia.org/P74655 and previous config saved to /var/cache/conftool/dbconfig/20250407-224827-fceratto.json [22:51:01] (03PS1) 10Btullis: Temporarily put an-worker1202 back into insetup mode [puppet] - 10https://gerrit.wikimedia.org/r/1134769 (https://phabricator.wikimedia.org/T390048) [22:52:11] (03CR) 10Btullis: [C:03+2] Temporarily put an-worker1202 back into insetup mode [puppet] - 10https://gerrit.wikimedia.org/r/1134769 (https://phabricator.wikimedia.org/T390048) (owner: 10Btullis) [22:57:33] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1171.eqiad.wmnet [23:00:04] Deploy window Web Team deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250407T2300) [23:03:34] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2151 (T391056)', diff saved to https://phabricator.wikimedia.org/P74656 and previous config saved to /var/cache/conftool/dbconfig/20250407-230333-fceratto.json [23:03:37] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [23:03:50] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2158.codfw.wmnet with reason: Maintenance [23:03:52] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-worker1202.eqiad.wmnet with OS bullseye [23:04:05] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on db2187.codfw.wmnet with reason: Maintenance [23:04:12] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2158 (T391056)', diff saved to https://phabricator.wikimedia.org/P74657 and previous config saved to /var/cache/conftool/dbconfig/20250407-230411-fceratto.json [23:09:57] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T391056)', diff saved to https://phabricator.wikimedia.org/P74658 and previous config saved to /var/cache/conftool/dbconfig/20250407-230956-fceratto.json [23:09:59] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [23:12:13] FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:18:30] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-worker1202.eqiad.wmnet with reason: host reimage [23:18:31] 10ops-eqiad, 06SRE, 06DC-Ops, 10Data-Platform-SRE (2025.03.22 - 2025.04.11), 13Patch-For-Review: Upgrade an-worker hard drives from 4TB to 8TB (group 2 - rack F6) - https://phabricator.wikimedia.org/T390169#10719677 (10BTullis) I have put an-worker1170 and an-worker1171 back into service. [23:21:11] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-worker1202.eqiad.wmnet with reason: host reimage [23:25:04] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P74659 and previous config saved to /var/cache/conftool/dbconfig/20250407-232503-fceratto.json [23:35:42] FIRING: JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:37:54] (03PS1) 10Samtar: InitialiseSettings: wgTemplateDataEnableDiscovery on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134771 (https://phabricator.wikimedia.org/T377975) [23:40:02] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1134772 [23:40:03] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1134772 (owner: 10TrainBranchBot) [23:40:11] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158', diff saved to https://phabricator.wikimedia.org/P74660 and previous config saved to /var/cache/conftool/dbconfig/20250407-234011-fceratto.json [23:40:39] (03CR) 10Samtar: [C:04-2] "Hold for recent TemplateData merges to ride the train" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134771 (https://phabricator.wikimedia.org/T377975) (owner: 10Samtar) [23:42:13] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:44:17] !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1002" [23:44:20] (03CR) 10Samwilson: [C:03+1] InitialiseSettings: wgTemplateDataEnableDiscovery on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134771 (https://phabricator.wikimedia.org/T377975) (owner: 10Samtar) [23:45:50] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q2:rack/setup/install cloudcontrol1011 - https://phabricator.wikimedia.org/T380499#10719731 (10RobH) [23:51:03] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1134772 (owner: 10TrainBranchBot) [23:55:19] !log fceratto@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2158 (T391056)', diff saved to https://phabricator.wikimedia.org/P74661 and previous config saved to /var/cache/conftool/dbconfig/20250407-235518-fceratto.json [23:55:22] T391056: Drop afl_patrolled_by from abuse_filter_log in production - https://phabricator.wikimedia.org/T391056 [23:55:34] !log fceratto@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2169.codfw.wmnet with reason: Maintenance [23:55:42] !log fceratto@cumin1002 dbctl commit (dc=all): 'Depooling db2169 (T391056)', diff saved to https://phabricator.wikimedia.org/P74662 and previous config saved to /var/cache/conftool/dbconfig/20250407-235541-fceratto.json