[00:05:39] RESOLVED: CirrusSearchNodeIndexingNotIncreasing: Elasticsearch instance cirrussearch2056-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [00:10:15] (03PS1) 10TrainBranchBot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1134096 [00:10:15] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1134096 (owner: 10TrainBranchBot) [00:10:48] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [00:11:12] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [00:14:41] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [00:15:42] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10711236 (10phaultfinder) [00:23:44] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [00:23:57] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [00:27:49] FIRING: HelmReleaseBadStatus: Helm release airflow-test-k8s/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-test-k8s - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [00:31:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [00:34:15] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [00:37:12] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [00:37:17] FIRING: [8x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:37:49] RESOLVED: HelmReleaseBadStatus: Helm release airflow-test-k8s/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=airflow-test-k8s - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [00:39:44] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10711261 (10phaultfinder) [00:41:20] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [00:47:03] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [00:51:26] (03PS9) 10Superpes15: [pswiki] Change the logo and wordmark/tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031963 (https://phabricator.wikimedia.org/T360851) [00:51:39] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [01:04:18] (03Merged) 10jenkins-bot: Branch commit for wmf/next [core] (wmf/next) - 10https://gerrit.wikimedia.org/r/1134096 (owner: 10TrainBranchBot) [01:07:03] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [01:28:49] (03CR) 10Superpes15: update wikimaniawiki perms configurations: (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131119 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky) [01:29:42] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10711344 (10phaultfinder) [01:31:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1070-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [01:59:38] (03PS5) 10Robertsky: update wikimaniawiki perms configurations: [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131119 (https://phabricator.wikimedia.org/T389729) [02:01:09] (03CR) 10Robertsky: update wikimaniawiki perms configurations: (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131119 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky) [02:18:05] (03CR) 10Superpes15: update wikimaniawiki perms configurations: (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1131119 (https://phabricator.wikimedia.org/T389729) (owner: 10Robertsky) [02:25:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:12:12] FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:42:12] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:31:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [04:37:12] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [04:39:01] FIRING: [8x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:52:03] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [05:02:58] !log on mwmaint1002 ran cleanupBlocks.php on all wikis [05:02:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:06:41] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:40:00 on db1154.eqiad.wmnet with reason: Maintenance in sanitarium [05:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [05:07:16] !log marostegui@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:40:00 on db2186.codfw.wmnet with reason: Maintenance in sanitarium [05:12:03] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [05:31:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1070-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [05:56:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250404T0600) [06:25:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:44:08] !log aqu@deploy1003 Started deploy [airflow-dags/analytics@d6ad899]: Update artifacts for analytics [06:44:44] !log aqu@deploy1003 Finished deploy [airflow-dags/analytics@d6ad899]: Update artifacts for analytics (duration: 00m 35s) [06:45:22] !log aqu@deploy1003 Started deploy [airflow-dags/analytics_test@d6ad899]: Update artifacts for analytics_test [06:45:37] !log aqu@deploy1003 Finished deploy [airflow-dags/analytics_test@d6ad899]: Update artifacts for analytics_test (duration: 00m 15s) [07:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250404T0700) [07:12:12] FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:42:12] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:30:15] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [08:31:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [08:37:12] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [08:40:23] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [08:42:12] FIRING: [8x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:57:03] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [09:17:03] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [09:29:45] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [09:31:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1070-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [09:39:53] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [09:44:49] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [09:45:34] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [09:49:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [09:49:44] Deployment function-orchestrator-main-orchestrator in wikifunctions at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=wikifunctions&var-deployment=function-orchestrator-main-orchestrator - ... [09:49:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [09:57:19] !log installing vim security updates [09:57:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:08] I would like an emergency deployment backporting https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/1134174 to wmf.23 once it makes it through gate-and-submit (context is T390949) – is that okay for SREs? (cc thcipriani, dancy, andre; I can do the actual deployment myself) [09:58:09] T390949: Dependency jquery.wikibase.linkitem failed to load on unconnected pages ("Add interlanguage links" broken everywhere) - https://phabricator.wikimedia.org/T390949 [09:58:32] looking at gate-and-submit, I’m guessing this would be in ca. 30-60 minutes from now [09:59:11] (I don’t think it’s urgent enough to forcibly bump it up in the queue or backport it before it’s merged on master, I’d just like to deploy it before Monday) [10:00:28] Fine with me (RelEng) if there are no other deployment collisions [10:01:17] thx :) [10:02:36] !log mvernon@cumin1002 DONE (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on ms-be1070.eqiad.wmnet with reason: vacuum overlarge container dbs [10:02:54] !log bulk-VACUUM of container dbs ms-be1070 T377827 [10:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:56] T377827: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827 [10:16:38] Lucas_WMDE: go ahead [10:25:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:27:35] mmhh wikibugs is gone :( [10:29:23] I’ll go ahead with my emergency deploy, shout if i should stop (you still have some time while the gate-and-submit runs ^^) [10:30:36] taavi or Reedy would you mind poking wikibugs ? thank you [10:30:51] or maybe someone else can too? I'm going off https://toolsadmin.wikimedia.org/tools/id/wikibugs [10:31:55] I tried to restart it [10:32:13] it doesn’t seem to have rejoined yet… [10:32:20] thank you Lucas_WMDE ! yeah I see it is online at least [10:35:03] huh, but it rejoined #wikimedia-dev [10:35:06] oh wait, I think I remember this [10:35:13] it only rejoins the channel once there’s something to comment [10:35:16] so it’ll rejoin this one if I… [10:35:23] (03CR) 10Lucas Werkmeister (WMDE): "test comment please ignore" [extensions/Wikibase] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1134190 (https://phabricator.wikimedia.org/T390949) (owner: 10Lucas Werkmeister (WMDE)) [10:35:26] ^ do that :) [10:35:30] lol well done [10:35:51] same behavior as jinxer-wm FWIW, joins on demand [10:38:59] !log mvernon@cumin1002 START - Cookbook sre.hosts.remove-downtime for ms-be1070.eqiad.wmnet [10:38:59] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for ms-be1070.eqiad.wmnet [10:39:38] 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#10712255 (10MatthewVernon) ms-be1070 bulk-VACUUMd on sdb3. [10:40:20] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1134188 (https://phabricator.wikimedia.org/T391083) (owner: 10Filippo Giunchedi) [10:40:47] (03CR) 10Filippo Giunchedi: [C:03+2] wmflib: postgresql_version add trixie [puppet] - 10https://gerrit.wikimedia.org/r/1134188 (https://phabricator.wikimedia.org/T391083) (owner: 10Filippo Giunchedi) [10:40:57] (03PS2) 10Filippo Giunchedi: wmflib: postgresql_version add trixie [puppet] - 10https://gerrit.wikimedia.org/r/1134188 (https://phabricator.wikimedia.org/T391083) [10:41:26] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] wmflib: postgresql_version add trixie [puppet] - 10https://gerrit.wikimedia.org/r/1134188 (https://phabricator.wikimedia.org/T391083) (owner: 10Filippo Giunchedi) [10:41:34] 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#10712263 (10Ladsgroup) Thanks! it went from 94% to 78%: {F58979334} (If you write instructions down somewhere, I'll do it myself next... [10:42:05] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1134187 (https://phabricator.wikimedia.org/T391083) (owner: 10Filippo Giunchedi) [10:42:49] (03CR) 10Tiziano Fogli: [C:03+2] ripe atlas anchors: icmp to http check [puppet] - 10https://gerrit.wikimedia.org/r/1127552 (https://phabricator.wikimedia.org/T388419) (owner: 10Tiziano Fogli) [10:44:13] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: fix edit-check blubber image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133952 (owner: 10Ilias Sarantopoulos) [10:45:39] (03Merged) 10jenkins-bot: Add Item and CustomItem classes as properties to `$.ui.ooMenu` [extensions/Wikibase] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1134190 (https://phabricator.wikimedia.org/T390949) (owner: 10Lucas Werkmeister (WMDE)) [10:45:41] (03CR) 10Filippo Giunchedi: [C:03+2] uwsgi: trixie support [puppet] - 10https://gerrit.wikimedia.org/r/1134187 (https://phabricator.wikimedia.org/T391083) (owner: 10Filippo Giunchedi) [10:45:43] (03Merged) 10jenkins-bot: ml-services: fix edit-check blubber image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1133952 (owner: 10Ilias Sarantopoulos) [10:46:25] !log lucaswerkmeister-wmde@deploy1003 Started scap sync-world: Backport for [[gerrit:1134190|Add Item and CustomItem classes as properties to `$.ui.ooMenu` (T390949)]] [10:46:28] T390949: Dependency jquery.wikibase.linkitem failed to load on unconnected pages ("Add interlanguage links" broken everywhere) - https://phabricator.wikimedia.org/T390949 [10:47:39] (03CR) 10Vgutierrez: ssl_ciphersuite: drop stretch support (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1134179 (owner: 10Filippo Giunchedi) [10:51:24] (03PS3) 10Vgutierrez: liberica: Allow configuring UDP services [puppet] - 10https://gerrit.wikimedia.org/r/1128892 (https://phabricator.wikimedia.org/T389210) [10:51:24] (03PS3) 10Vgutierrez: wmflib,liberica: Add support for DNS healthchecks [puppet] - 10https://gerrit.wikimedia.org/r/1129326 (https://phabricator.wikimedia.org/T389211) [10:51:24] (03PS1) 10Vgutierrez: wmflib,liberica: Add support for NTP healthchecks [puppet] - 10https://gerrit.wikimedia.org/r/1134197 (https://phabricator.wikimedia.org/T389212) [10:54:03] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Backport for [[gerrit:1134190|Add Item and CustomItem classes as properties to `$.ui.ooMenu` (T390949)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:54:06] T390949: Dependency jquery.wikibase.linkitem failed to load on unconnected pages ("Add interlanguage links" broken everywhere) - https://phabricator.wikimedia.org/T390949 [10:54:20] 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#10712297 (10MatthewVernon) I don't mind doing it myself, but I've written up the slightly hacky process I use: https://wikitech.wikimed... [10:54:30] !log lucaswerkmeister-wmde@deploy1003 lucaswerkmeister-wmde: Continuing with sync [10:56:05] (03CR) 10Kamila Součková: [C:03+1] helmfile: remove videoscaler references, replace jobrunner with mw-jobrunner [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134185 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [10:57:15] (03CR) 10Vgutierrez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1134197 (https://phabricator.wikimedia.org/T389212) (owner: 10Vgutierrez) [11:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250404T0700) [11:00:05] jelto, arnoldokoth, and mutante: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for GitLab version upgrades. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20250404T1100). [11:01:29] !log lucaswerkmeister-wmde@deploy1003 Finished scap sync-world: Backport for [[gerrit:1134190|Add Item and CustomItem classes as properties to `$.ui.ooMenu` (T390949)]] (duration: 15m 04s) [11:01:32] T390949: Dependency jquery.wikibase.linkitem failed to load on unconnected pages ("Add interlanguage links" broken everywhere) - https://phabricator.wikimedia.org/T390949 [11:04:01] FIRING: [8x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:04:16] * Lucas_WMDE done deploying [11:07:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.621s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:12:12] FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:12:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: codfw mw-parsoid releases routed via main (k8s) 1.151s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:15:03] (03PS1) 10Btullis: Configure the ceph-csi-rbd storageclass to retain PVs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134200 (https://phabricator.wikimedia.org/T391087) [11:19:47] (03CR) 10Hnowlan: [C:03+2] helmfile: remove videoscaler references, replace jobrunner with mw-jobrunner [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134185 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [11:22:10] (03Merged) 10jenkins-bot: helmfile: remove videoscaler references, replace jobrunner with mw-jobrunner [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134185 (https://phabricator.wikimedia.org/T354791) (owner: 10Hnowlan) [11:22:12] FIRING: [4x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:23:41] (03PS2) 10Btullis: Configure the ceph-csi-rbd storageclass to retain PVs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134200 (https://phabricator.wikimedia.org/T391087) [11:24:01] RESOLVED: [4x] ProbeDown: Service ripe-atlas-codfw:0 has failed probes (icmp_ripe_atlas_codfw_ip4) - https://wikitech.wikimedia.org/wiki/Network_monitoring#ProbeDown - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:30:25] RESOLVED: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:30:57] (03PS3) 10Btullis: Configure the ceph-csi-rbd storageclass to retain PVs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134200 (https://phabricator.wikimedia.org/T391087) [11:39:00] 06SRE, 10SRE-swift-storage: Disk near-full warnings on ms swift backends for container filesystems due to some bloated sqlite files - https://phabricator.wikimedia.org/T377827#10712461 (10Ladsgroup) Thanks! I make sure to keep an eye on them and act if needed. [11:42:12] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:55:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid releases routed via main at eqiad: 11.8% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [11:57:15] FIRING: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [11:57:57] 06SRE, 07Wikimedia-Incident: Original exception: [c978e1ae-1ac4-4b9d-83cf-077c7a3e7609] 2025-04-04 11:55:32: Fatal exception of type "Wikimedia\Rdbms\DBUnexpectedError" - https://phabricator.wikimedia.org/T391099#10712574 (10Iniquity) [11:58:16] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 18.19s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:00:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid releases routed via main at eqiad: 2.727% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:02:15] RESOLVED: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:03:16] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 18.49s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:09:43] (03CR) 10Alexandros Kosiaris: [C:03+2] service: Cleanup of wikifunctions [puppet] - 10https://gerrit.wikimedia.org/r/1133940 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris) [12:09:55] (03PS5) 10Alexandros Kosiaris: service: Cleanup of wikifunctions [puppet] - 10https://gerrit.wikimedia.org/r/1133940 (https://phabricator.wikimedia.org/T384944) [12:10:40] (03CR) 10Alexandros Kosiaris: [C:03+2] service: Cleanup of wikifunctions [puppet] - 10https://gerrit.wikimedia.org/r/1133940 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris) [12:14:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid releases routed via main at eqiad: 9.224% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:16:15] FIRING: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:18:08] (03PS1) 10Dreamy Jazz: Remove wgCheckUserCentralIndexRangesToExclude definition [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134203 (https://phabricator.wikimedia.org/T389055) [12:19:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid releases routed via main at eqiad: 20.36% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [12:21:15] RESOLVED: [4x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [12:29:52] (03PS2) 10Alexandros Kosiaris: wikifunctions: sextant update function-orchestrator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134170 [12:29:52] (03PS2) 10Alexandros Kosiaris: wikifunctions: sextant update function-evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134172 [12:29:52] (03PS1) 10Alexandros Kosiaris: mesh.configuration: Add sni_rewrites_host_header toggle [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134205 [12:29:53] (03PS1) 10Alexandros Kosiaris: mesh::configuration: Add sni_rewrites_host_header [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134206 [12:31:09] (03CR) 10CI reject: [V:04-1] wikifunctions: sextant update function-orchestrator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134170 (owner: 10Alexandros Kosiaris) [12:31:16] (03CR) 10CI reject: [V:04-1] wikifunctions: sextant update function-evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134172 (owner: 10Alexandros Kosiaris) [12:31:24] (03CR) 10CI reject: [V:04-1] mesh::configuration: Add sni_rewrites_host_header [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134206 (owner: 10Alexandros Kosiaris) [12:31:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [12:34:46] (03PS2) 10Alexandros Kosiaris: mesh.configuration: Add sni_rewrites_host_header toggle (c/p) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134205 [12:34:46] (03PS2) 10Alexandros Kosiaris: mesh::configuration: Add sni_rewrites_host_header toggle [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134206 [12:34:46] (03PS3) 10Alexandros Kosiaris: wikifunctions: sextant update function-orchestrator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134170 [12:34:47] (03PS3) 10Alexandros Kosiaris: wikifunctions: sextant update function-evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134172 [12:37:12] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [12:38:12] (03PS1) 10MVernon: Add apus-fe2003 to hiera and conftool [puppet] - 10https://gerrit.wikimedia.org/r/1134208 (https://phabricator.wikimedia.org/T390578) [12:42:53] (03PS6) 10Arturo Borrero Gonzalez: openstack: networktests: support IPv6 and IPv4-only networks [puppet] - 10https://gerrit.wikimedia.org/r/1097440 (https://phabricator.wikimedia.org/T380728) [12:43:00] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1097440 (https://phabricator.wikimedia.org/T380728) (owner: 10Arturo Borrero Gonzalez) [12:44:09] (03PS1) 10MVernon: Add two new ms-fe nodes [puppet] - 10https://gerrit.wikimedia.org/r/1134210 (https://phabricator.wikimedia.org/T388887) [12:58:02] 06SRE, 07SRE-Unowned, 06serviceops-radar, 10wikitech.wikimedia.org, 13Patch-For-Review: Redesign wikitech-static - https://phabricator.wikimedia.org/T376400#10712758 (10Andrew) Hello @Volans ! I think I've addressed almost all of your bullet points above -- do you mind retesting? I needed to rebuild the... [12:58:50] (03PS1) 10Alexandros Kosiaris: Add sni_rewrites_host_header [puppet] - 10https://gerrit.wikimedia.org/r/1134217 [13:02:03] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [13:02:27] (03CR) 10Dreamy Jazz: [C:04-2] "Shouldn't be merged until after Ida423d1a2ae8873b89b2611e0e631816e6b39365 is merged and then until the next train split has happened. This" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134203 (https://phabricator.wikimedia.org/T389055) (owner: 10Dreamy Jazz) [13:05:03] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:05:06] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:07:51] (03PS1) 10MVernon: Thanos: add new thanos-fe200[5-7] nodes [puppet] - 10https://gerrit.wikimedia.org/r/1134221 (https://phabricator.wikimedia.org/T389634) [13:10:20] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10712801 (10MatthewVernon) Yeah, I doubt the size of disk is critical here (as long as we end up with an 8T disk back in when we're done... [13:10:49] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:11:46] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [13:12:09] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1134217 (owner: 10Alexandros Kosiaris) [13:12:12] FIRING: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:12:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid releases routed via main at eqiad: 6.32% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:13:15] FIRING: [6x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:13:57] FIRING: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:14:47] here [13:15:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 11.68s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:17:12] RESOLVED: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:17:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-parsoid releases routed via main at eqiad: 8.772% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-container_name=All&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [13:18:15] RESOLVED: [8x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [13:18:33] (03CR) 10MVernon: [C:03+1] "This looks sensible to me, thanks. I've pinged the Ceph IRC channel in case anyone else wants to provide input." [puppet] - 10https://gerrit.wikimedia.org/r/1133916 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [13:18:42] time to ban! :) [13:18:57] RESOLVED: ProbeDown: Service mw-parsoid:4452 has failed probes (http_mw-parsoid_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-parsoid:4452 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:20:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-parsoid releases routed via main (k8s) 11.68s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-parsoid&var-release=main - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [13:20:55] (03CR) 10Phedenskog: "I see now, when I tested I just did one of the queries. This makes sense, sorry!" [alerts] - 10https://gerrit.wikimedia.org/r/1133152 (https://phabricator.wikimedia.org/T325283) (owner: 10Tiziano Fogli) [13:22:03] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [13:31:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1070-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [13:34:38] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10712908 (10phaultfinder) [13:36:20] (03CR) 10Ladsgroup: [C:03+1] Add two new ms-fe nodes [puppet] - 10https://gerrit.wikimedia.org/r/1134210 (https://phabricator.wikimedia.org/T388887) (owner: 10MVernon) [13:45:33] (03PS2) 10Alexandros Kosiaris: Add sni_rewrites_host_header [puppet] - 10https://gerrit.wikimedia.org/r/1134217 [13:49:44] FIRING: KubernetesDeploymentUnavailableReplicas: ... [13:49:44] Deployment function-orchestrator-main-orchestrator in wikifunctions at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=wikifunctions&var-deployment=function-orchestrator-main-orchestrator - ... [13:49:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [13:52:20] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1134217 (owner: 10Alexandros Kosiaris) [13:52:21] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1134217 (owner: 10Alexandros Kosiaris) [13:57:48] !bash hnowlan> when in doubt, assume some AI shit [13:57:48] Amir1: Stored quip at https://bash.toolforge.org/quip/xrYXAZYBffdvpiTr7yar [13:58:26] b& [14:01:16] * Lucas_WMDE edits in the missing < [14:07:27] (03CR) 10JHathaway: [C:03+1] ruby: move to .exist? [puppet] - 10https://gerrit.wikimedia.org/r/1134189 (https://phabricator.wikimedia.org/T391083) (owner: 10Filippo Giunchedi) [14:10:23] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 or lvs1018 - https://phabricator.wikimedia.org/T387145#10712996 (10Vgutierrez) Hi @VRiley-WMF, after discussing this with @cmooney it looks like we need to swap lvs1016 and lvs1017 for a while so we can install the Mellanox card in lvs1017 an... [14:13:20] (03PS1) 10Alexandros Kosiaris: DNM: dummy PCC check patch [puppet] - 10https://gerrit.wikimedia.org/r/1134230 [14:17:06] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 or lvs1018 - https://phabricator.wikimedia.org/T387145#10713009 (10cmooney) >>! In T387145#10712995, @Vgutierrez wrote: > Hi @VRiley-WMF, after discussing this with @cmooney it looks like we need to swap lvs1016 and lvs1017 for a while so we... [14:17:20] (03CR) 10JHathaway: [C:03+2] efi: add efi fact to facter [puppet] - 10https://gerrit.wikimedia.org/r/1133491 (https://phabricator.wikimedia.org/T389217) (owner: 10JHathaway) [14:20:31] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1134189 (https://phabricator.wikimedia.org/T391083) (owner: 10Filippo Giunchedi) [14:21:39] FIRING: [2x] TransitBGPDown: Transit BGP session down between cr2-magru and Hurricane Electric (187.16.221.197) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [14:23:37] 10ops-codfw, 06SRE, 10SRE-swift-storage, 06DC-Ops, 06Infrastructure-Foundations: Perform fake disk swap on ms-be2088 as test - https://phabricator.wikimedia.org/T384003#10713025 (10Jhancock.wm) We can try this today then, I have plenty of 4TB disks on hand that we can try. [14:24:13] (03PS1) 10Cathal Mooney: Cloudsw: adjust routing-policies to reflect change to IBGP [homer/public] - 10https://gerrit.wikimedia.org/r/1134234 (https://phabricator.wikimedia.org/T389958) [14:24:16] (03CR) 10Alexandros Kosiaris: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1134230 (owner: 10Alexandros Kosiaris) [14:26:39] RESOLVED: [2x] TransitBGPDown: Transit BGP session down between cr2-magru and Hurricane Electric (187.16.221.197) - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DTransitBGPDown [14:31:18] 06SRE, 06Infrastructure-Foundations, 10Puppet-Core: Add Puppet fact to determine the boot method - https://phabricator.wikimedia.org/T389217#10713052 (10jhathaway) 05Open→03Resolved a:03jhathaway fact merged [14:31:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [14:36:14] (03PS3) 10Alexandros Kosiaris: mesh::configuration: Add sni_rewrites_host_header toggle [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134206 [14:36:14] (03PS4) 10Alexandros Kosiaris: wikifunctions: sextant update function-orchestrator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134170 [14:36:14] (03PS4) 10Alexandros Kosiaris: wikifunctions: sextant update function-evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134172 [14:37:32] (03CR) 10CI reject: [V:04-1] mesh::configuration: Add sni_rewrites_host_header toggle [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134206 (owner: 10Alexandros Kosiaris) [14:37:37] (03CR) 10CI reject: [V:04-1] wikifunctions: sextant update function-orchestrator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134170 (owner: 10Alexandros Kosiaris) [14:37:44] (03CR) 10CI reject: [V:04-1] wikifunctions: sextant update function-evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134172 (owner: 10Alexandros Kosiaris) [14:38:11] 10ops-eqiad, 06SRE, 06DC-Ops, 06Traffic: Q3:test NIC for lvs1017 or lvs1018 - https://phabricator.wikimedia.org/T387145#10713076 (10Vgutierrez) reimaging them is fine by me [14:39:28] (03PS1) 10Eevans: sessionstore alert lint problem [alerts] - 10https://gerrit.wikimedia.org/r/1134241 [14:40:33] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team, 13Patch-For-Review: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Ozge Karakaya - https://phabricator.wikimedia.org/T390855#10713099 (10thcipriani) >>! In T3... [14:41:07] (03CR) 10CI reject: [V:04-1] sessionstore alert lint problem [alerts] - 10https://gerrit.wikimedia.org/r/1134241 (owner: 10Eevans) [14:43:06] (03PS4) 10Alexandros Kosiaris: mesh::configuration: Add sni_rewrites_host_header toggle [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134206 [14:43:06] (03PS5) 10Alexandros Kosiaris: wikifunctions: sextant update function-orchestrator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134170 [14:43:06] (03PS5) 10Alexandros Kosiaris: wikifunctions: sextant update function-evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134172 [14:43:07] (03PS1) 10Scott French: mw-(api-ext|web): additional post-migration cleanup [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134100 (https://phabricator.wikimedia.org/T383845) [14:43:14] (03PS1) 10Scott French: shellbox*: normalize migration releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134098 (https://phabricator.wikimedia.org/T377038) [14:43:18] !log Extending root vg on mwmaint1002 by 20GB [14:43:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:26] (03PS1) 10Tiziano Fogli: ripe atlas anchors: change hiera device name [puppet] - 10https://gerrit.wikimedia.org/r/1134235 (https://phabricator.wikimedia.org/T388419) [14:45:50] !log Deploying refinery for T389162 [14:45:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:53] T389162: [Data Quality] Add ability to add tags to alerts - https://phabricator.wikimedia.org/T389162 [14:46:03] (03CR) 10Alexandros Kosiaris: [C:03+2] mesh.configuration: Add sni_rewrites_host_header toggle (c/p) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134205 (owner: 10Alexandros Kosiaris) [14:46:05] (03CR) 10Alexandros Kosiaris: [C:03+2] mesh::configuration: Add sni_rewrites_host_header toggle [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134206 (owner: 10Alexandros Kosiaris) [14:46:09] (03CR) 10Alexandros Kosiaris: [C:03+2] wikifunctions: sextant update function-orchestrator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134170 (owner: 10Alexandros Kosiaris) [14:46:12] (03CR) 10Alexandros Kosiaris: [C:03+2] wikifunctions: sextant update function-evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134172 (owner: 10Alexandros Kosiaris) [14:46:29] !log tchin@deploy1003 Started deploy [analytics/refinery@c4ab9ef] (hadoop-test): TEST [analytics/refinery@c4ab9efd] [14:46:32] (03Abandoned) 10Eevans: sessionstore alert lint problem [alerts] - 10https://gerrit.wikimedia.org/r/1134241 (owner: 10Eevans) [14:47:33] (03Merged) 10jenkins-bot: mesh.configuration: Add sni_rewrites_host_header toggle (c/p) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134205 (owner: 10Alexandros Kosiaris) [14:47:33] (03PS1) 10Fabfur: external_cloud_vendors: Added Google SpeciaCaseCrawlers list [puppet] - 10https://gerrit.wikimedia.org/r/1134243 (https://phabricator.wikimedia.org/T391108) [14:47:38] (03Merged) 10jenkins-bot: mesh::configuration: Add sni_rewrites_host_header toggle [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134206 (owner: 10Alexandros Kosiaris) [14:47:46] (03Abandoned) 10Alexandros Kosiaris: DNM: dummy PCC check patch [puppet] - 10https://gerrit.wikimedia.org/r/1134230 (owner: 10Alexandros Kosiaris) [14:47:51] (03CR) 10Alexandros Kosiaris: [C:03+2] Add sni_rewrites_host_header [puppet] - 10https://gerrit.wikimedia.org/r/1134217 (owner: 10Alexandros Kosiaris) [14:47:54] (03Merged) 10jenkins-bot: wikifunctions: sextant update function-orchestrator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134170 (owner: 10Alexandros Kosiaris) [14:48:05] (03CR) 10CI reject: [V:04-1] external_cloud_vendors: Added Google SpeciaCaseCrawlers list [puppet] - 10https://gerrit.wikimedia.org/r/1134243 (https://phabricator.wikimedia.org/T391108) (owner: 10Fabfur) [14:48:14] (03Merged) 10jenkins-bot: wikifunctions: sextant update function-evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134172 (owner: 10Alexandros Kosiaris) [14:48:52] (03CR) 10Clément Goubert: [C:03+1] mw-(api-ext|web): additional post-migration cleanup [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134100 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [14:49:30] !log tchin@deploy1003 Finished deploy [analytics/refinery@c4ab9ef] (hadoop-test): TEST [analytics/refinery@c4ab9efd] (duration: 03m 01s) [14:49:57] (03CR) 10Clément Goubert: [C:03+1] shellbox*: normalize migration releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134098 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [14:50:00] (03PS2) 10Tiziano Fogli: auth_metrics: add recording rules for grafana widgets [puppet] - 10https://gerrit.wikimedia.org/r/1133924 (https://phabricator.wikimedia.org/T390672) [14:50:49] !log tchin@deploy1003 Started deploy [analytics/refinery@c4ab9ef]: [analytics/refinery@c4ab9efd] [14:52:08] (03PS2) 10Fabfur: external_cloud_vendors: Added Google SpeciaCaseCrawlers list [puppet] - 10https://gerrit.wikimedia.org/r/1134243 (https://phabricator.wikimedia.org/T391108) [14:52:28] (03CR) 10Tiziano Fogli: [C:03+2] auth_metrics: add recording rules for grafana widgets [puppet] - 10https://gerrit.wikimedia.org/r/1133924 (https://phabricator.wikimedia.org/T390672) (owner: 10Tiziano Fogli) [14:53:44] !log tchin@deploy1003 Finished deploy [analytics/refinery@c4ab9ef]: [analytics/refinery@c4ab9efd] (duration: 02m 54s) [14:54:09] !log tchin@deploy1003 Started deploy [analytics/refinery@c4ab9ef] (thin): THIN [analytics/refinery@c4ab9efd] [14:54:29] (03CR) 10Arturo Borrero Gonzalez: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1097440 (https://phabricator.wikimedia.org/T380728) (owner: 10Arturo Borrero Gonzalez) [14:54:56] (03CR) 10Scott French: "Thanks for the reviews!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134100 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [14:55:05] (03CR) 10Scott French: [C:03+2] mw-(api-ext|web): additional post-migration cleanup [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134100 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [14:55:08] !log tchin@deploy1003 Finished deploy [analytics/refinery@c4ab9ef] (thin): THIN [analytics/refinery@c4ab9efd] (duration: 00m 59s) [14:55:11] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM, thanks." [homer/public] - 10https://gerrit.wikimedia.org/r/1134234 (https://phabricator.wikimedia.org/T389958) (owner: 10Cathal Mooney) [14:55:18] (03PS1) 10Eevans: sessionstore alert lint errors [alerts] - 10https://gerrit.wikimedia.org/r/1134247 [14:56:48] (03Merged) 10jenkins-bot: mw-(api-ext|web): additional post-migration cleanup [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134100 (https://phabricator.wikimedia.org/T383845) (owner: 10Scott French) [14:59:28] !log akosiaris@deploy1003 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:59:47] (03PS2) 10Eevans: sessionstore alert lint errors [alerts] - 10https://gerrit.wikimedia.org/r/1134247 [15:00:07] (03CR) 10Scott French: [C:03+2] shellbox*: normalize migration releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134098 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [15:00:20] !log akosiaris@deploy1003 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:00:24] (03CR) 10Filippo Giunchedi: [C:03+1] ripe atlas anchors: change hiera device name [puppet] - 10https://gerrit.wikimedia.org/r/1134235 (https://phabricator.wikimedia.org/T388419) (owner: 10Tiziano Fogli) [15:00:43] (03CR) 10Scott French: [C:03+1] "Whoops, thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/1134247 (owner: 10Eevans) [15:01:48] (03Merged) 10jenkins-bot: shellbox*: normalize migration releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134098 (https://phabricator.wikimedia.org/T377038) (owner: 10Scott French) [15:03:14] !log akosiaris@deploy1003 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:03:34] !log swfrench@deploy1003 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [15:03:51] !log akosiaris@deploy1003 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:03:51] !log swfrench@deploy1003 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [15:04:14] (03CR) 10Arturo Borrero Gonzalez: [C:03+2] openstack: networktests: support IPv6 and IPv4-only networks [puppet] - 10https://gerrit.wikimedia.org/r/1097440 (https://phabricator.wikimedia.org/T380728) (owner: 10Arturo Borrero Gonzalez) [15:04:38] !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:04:44] RESOLVED: KubernetesDeploymentUnavailableReplicas: ... [15:04:44] Deployment function-orchestrator-main-orchestrator in wikifunctions at codfw has persistently unavailable replicas - https://wikitech.wikimedia.org/wiki/Kubernetes/Troubleshooting#Troubleshooting_a_deployment - https://grafana.wikimedia.org/d/a260da06-259a-4ee4-9540-5cab01a246c8/kubernetes-deployment-details?var-site=codfw&var-cluster=k8s&var-namespace=wikifunctions&var-deployment=function-orchestrator-main-orchestrator - ... [15:04:44] https://alerts.wikimedia.org/?q=alertname%3DKubernetesDeploymentUnavailableReplicas [15:05:14] !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:06:42] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:11:21] !log tchin@deploy1003 Started deploy [airflow-dags/analytics@bece0a7]: (no justification provided) [15:11:53] !log tchin@deploy1003 Finished deploy [airflow-dags/analytics@bece0a7]: (no justification provided) (duration: 00m 34s) [15:12:01] (03CR) 10MVernon: [C:03+1] "👍" [alerts] - 10https://gerrit.wikimedia.org/r/1134247 (owner: 10Eevans) [15:12:12] FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:16:49] (03PS1) 10Alexandros Kosiaris: mw-wikifunctions: Remove default from gatewayHosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134251 (https://phabricator.wikimedia.org/T384944) [15:24:54] (03PS3) 10Umherirrender: Improve function and property documentation for php code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130201 (https://phabricator.wikimedia.org/T171115) [15:28:33] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests, 06Machine-Learning-Team, 13Patch-For-Review: Requesting access to analytics-privatedata-users & Kerberos identity & deployment POSIX group & ml-team-admins for Ozge Karakaya - https://phabricator.wikimedia.org/T390855#10713369 (10thcipriani) [15:29:37] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10713384 (10phaultfinder) [15:31:56] (03PS4) 10Umherirrender: Improve function and property documentation for php code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130201 (https://phabricator.wikimedia.org/T171115) [15:32:00] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 07 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134106 (https://phabricator.wikimedia.org/T390420) (owner: 10C. Scott Ananian) [15:32:54] (03CR) 10Umherirrender: Improve function and property documentation for php code (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130201 (https://phabricator.wikimedia.org/T171115) (owner: 10Umherirrender) [15:32:55] (03CR) 10CI reject: [V:04-1] Improve function and property documentation for php code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130201 (https://phabricator.wikimedia.org/T171115) (owner: 10Umherirrender) [15:34:13] (03PS5) 10Umherirrender: Improve function and property documentation for php code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1130201 (https://phabricator.wikimedia.org/T171115) [15:36:27] (03CR) 10Eevans: [C:03+2] sessionstore alert lint errors [alerts] - 10https://gerrit.wikimedia.org/r/1134247 (owner: 10Eevans) [15:36:42] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:41:58] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:42:01] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:42:12] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:42:39] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 07 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1134107 (https://phabricator.wikimedia.org/T376048) (owner: 10C. Scott Ananian) [15:43:20] (03CR) 10Bking: "Cool, do y'all mind if we wait until after the OpenSearch migration in T388610? We are renaming the hosts from elastic* to cirrussearch* a" [puppet] - 10https://gerrit.wikimedia.org/r/1130162 (https://phabricator.wikimedia.org/T387569) (owner: 10Bking) [15:43:37] o/ I will deploy a private MediaWiki change [15:46:03] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:46:04] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:46:11] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:46:44] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:48:58] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:49:01] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [15:58:41] (03CR) 10Dzahn: [C:03+2] Phabricator: Update recipients of quarterly metrics mail [puppet] - 10https://gerrit.wikimedia.org/r/1134176 (owner: 10Aklapper) [16:01:09] 06SRE, 06Infrastructure-Foundations, 10netops, 07IPv6, 13Patch-For-Review: WMCS Eqiad: Enable IPv6 in cloud vrf on switches - https://phabricator.wikimedia.org/T389958#10713522 (10cmooney) [16:04:20] (03CR) 10Scott French: [C:03+1] Create insetup role for ServiceOps with nftables and rename existing one [puppet] - 10https://gerrit.wikimedia.org/r/1133927 (https://phabricator.wikimedia.org/T389825) (owner: 10Muehlenhoff) [16:09:38] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-a4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390787#10713554 (10phaultfinder) [16:17:18] (03CR) 10Dzahn: [C:03+2] hiera: cleanup some gerrit and etherpad hiera values [puppet] - 10https://gerrit.wikimedia.org/r/1133996 (https://phabricator.wikimedia.org/T390948) (owner: 10Dzahn) [16:19:55] finished [16:20:35] (03CR) 10Dzahn: hiera: cleanup gitlab-runner docker gc settings (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1133992 (https://phabricator.wikimedia.org/T390948) (owner: 10Dzahn) [16:21:04] (03PS2) 10Dzahn: hiera: cleanup gitlab-runner docker gc settings [puppet] - 10https://gerrit.wikimedia.org/r/1133992 (https://phabricator.wikimedia.org/T390948) [16:22:08] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [16:22:58] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [16:30:42] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10713648 (10phaultfinder) [16:32:25] FIRING: SystemdUnitFailed: httpbb_kubernetes_mw-jobrunner_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:35:43] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [16:37:12] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [16:37:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-jobrunner_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:45:39] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [16:45:57] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [16:46:01] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [16:49:13] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [16:49:16] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [16:53:54] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [16:53:57] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [16:57:24] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [16:57:27] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [16:59:22] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [16:59:25] !log btullis@deploy1003 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [17:00:04] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [17:01:04] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [17:03:02] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [17:04:44] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [17:07:03] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [17:09:24] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [17:09:46] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [17:10:26] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [17:10:29] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [17:10:38] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/airflow-test-k8s: apply [17:11:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [17:12:09] !log amastilovic@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/airflow-test-k8s: apply [17:16:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [17:27:03] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [17:31:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1070-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [17:34:50] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1185 - https://phabricator.wikimedia.org/T391049#10713942 (10VRiley-WMF) a:03VRiley-WMF [17:49:57] (03PS1) 10Ebernhardson: mjolnir: temp remove msearch daemon from codfw [puppet] - 10https://gerrit.wikimedia.org/r/1134268 [17:50:56] (03PS2) 10Ebernhardson: mjolnir: temp remove msearch daemon from codfw [puppet] - 10https://gerrit.wikimedia.org/r/1134268 [17:51:22] (03CR) 10Ebernhardson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1134268 (owner: 10Ebernhardson) [17:56:49] (03PS3) 10Ebernhardson: mjolnir: temp remove msearch daemon from codfw [puppet] - 10https://gerrit.wikimedia.org/r/1134268 [17:56:59] (03CR) 10Ebernhardson: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1134268 (owner: 10Ebernhardson) [18:01:53] (03PS4) 10Ebernhardson: mjolnir: temp remove msearch daemon from codfw [puppet] - 10https://gerrit.wikimedia.org/r/1134268 [18:16:19] 06SRE: NDA request coverage for KFrancis's PTO - https://phabricator.wikimedia.org/T391032#10714046 (10Dzahn) 05Open→03In progress Let's just keep this open and close it on April 21. That's the simplest way because it shows up on the dashboards for access request that way. Optionally we can do the mail alia... [18:16:30] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: Degraded RAID on db1185 - https://phabricator.wikimedia.org/T391049#10714048 (10VRiley-WMF) Opened up an SR for this from dell and ordered a drive. Dell SR 208070871 [18:17:04] 06SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: NDA request coverage for KFrancis's PTO - https://phabricator.wikimedia.org/T391032#10714052 (10Dzahn) [18:34:24] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2045.codfw.wmnet with OS bookworm [18:34:31] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10714104 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2045.codfw.wmnet with OS bookworm [18:42:30] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4: install ssds into ganeti20[45-50] - https://phabricator.wikimedia.org/T390320#10714137 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [18:44:06] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti2046.codfw.wmnet with OS bookworm [18:44:14] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10714144 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host ganeti2046.codfw.wmnet with OS bookworm [18:46:21] (03CR) 10Bking: [C:03+2] mjolnir: temp remove msearch daemon from codfw [puppet] - 10https://gerrit.wikimedia.org/r/1134268 (owner: 10Ebernhardson) [18:48:49] (03CR) 10Alexandros Kosiaris: [C:03+2] mw-wikifunctions: Remove default from gatewayHosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134251 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris) [18:49:52] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host ganeti2045.codfw.wmnet with OS bookworm [18:49:58] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q3:rack/setup/install ganeti20[45-50] - https://phabricator.wikimedia.org/T384838#10714154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host ganeti2045.codfw.wmnet with OS bookworm executed with err... [18:50:20] (03Merged) 10jenkins-bot: mw-wikifunctions: Remove default from gatewayHosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134251 (https://phabricator.wikimedia.org/T384944) (owner: 10Alexandros Kosiaris) [18:55:42] FIRING: JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:56:55] !log akosiaris@deploy1003 helmfile [codfw] START helmfile.d/services/mw-wikifunctions: apply [18:57:13] !log akosiaris@deploy1003 helmfile [codfw] DONE helmfile.d/services/mw-wikifunctions: apply [18:57:30] (03CR) 10Herron: [C:03+1] wdqs-update-lag: don't count wdqs-categories lag [puppet] - 10https://gerrit.wikimedia.org/r/1133554 (owner: 10Ryan Kemper) [18:58:00] (03CR) 10Herron: [C:03+1] prometheus: cleanup k8s instances from prometheus100[56] [puppet] - 10https://gerrit.wikimedia.org/r/1133910 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [18:58:11] (03CR) 10Herron: [C:03+1] prometheus: cleanup k8s instances from prometheus200[56] [puppet] - 10https://gerrit.wikimedia.org/r/1133909 (https://phabricator.wikimedia.org/T383232) (owner: 10Filippo Giunchedi) [19:03:21] !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/services/mw-wikifunctions: apply [19:03:29] !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/services/mw-wikifunctions: apply [19:09:41] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10714174 (10phaultfinder) [19:12:12] FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:15:15] (03PS1) 10Alexandros Kosiaris: ats: Switch mw-wikifunctions back to original FQDN [puppet] - 10https://gerrit.wikimedia.org/r/1134281 [19:16:49] (03PS1) 10Alexandros Kosiaris: Remove mw-wikifunctions-ingress RRs [dns] - 10https://gerrit.wikimedia.org/r/1134282 [19:32:48] (03PS1) 10Btullis: Reduce the verbosity of pgbouncer logs in airflow deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134283 (https://phabricator.wikimedia.org/T362788) [19:34:05] (03CR) 10Ladsgroup: [C:03+1] "Thanks for the quick fix!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134283 (https://phabricator.wikimedia.org/T362788) (owner: 10Btullis) [19:34:43] (03Abandoned) 10Bking: cirrussearch: Enable new role with existing alias [puppet] - 10https://gerrit.wikimedia.org/r/1133230 (https://phabricator.wikimedia.org/T388610) (owner: 10Bking) [19:35:50] (03PS2) 10Btullis: Reduce the verbosity of pgbouncer logs in airflow deployments [deployment-charts] - 10https://gerrit.wikimedia.org/r/1134283 (https://phabricator.wikimedia.org/T362788) [19:37:46] (03PS1) 10Ebernhardson: Bump ltr plugin to 1.5.4-wmf1-os1.3.20 [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1134285 [19:38:42] (03PS14) 10Bking: elasticsearch rolling-operation: add arguments for rename & reimage cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/1131446 (https://phabricator.wikimedia.org/T383811) [19:42:12] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:42:41] (03PS1) 10Majavah: realm: Drop unused $network_zone global [puppet] - 10https://gerrit.wikimedia.org/r/1134287 [19:48:51] FIRING: [2x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [19:49:39] FIRING: CoreBGPDown: Core BGP session down between cr3-ulsfo and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://grafana.wikimedia.org/d/ed8da087-4bcb-407d-9596-d158b8145d45/bgp-neighbors-detail?orgId=1&var-site=ulsfo&var-device=cr3-ulsfo:9804&var-bgp_group=Confed_eqord&var-bgp_neighbor=cr2-eqord - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [19:53:51] FIRING: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [19:54:39] FIRING: [4x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [19:57:56] (03CR) 10Bking: "I think this needs an updated BUILD_VERSION in debian/rules and debian/changelog?" [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1134285 (owner: 10Ebernhardson) [20:03:51] RESOLVED: [4x] CoreRouterInterfaceDown: Core router interface down - cr2-codfw:xe-1/0/1:1 (Transport: cr2-eqord:xe-0/1/0 (Arelion, IC-314534 29ms 10Gbps wave) {#10694_12249-2}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [20:04:39] RESOLVED: [4x] CoreBGPDown: Core BGP session down between cr2-codfw and cr2-eqord (208.80.154.198) - group Confed_eqord - https://wikitech.wikimedia.org/wiki/Network_monitoring#BGP_status - https://alerts.wikimedia.org/?q=alertname%3DCoreBGPDown [20:07:30] (03CR) 10Btullis: Ceph: add types for S3 credential and account (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1133916 (https://phabricator.wikimedia.org/T378922) (owner: 10Jelto) [20:22:37] !log starting `nodetool garbage collect -j 2`, sessionstore Cassandra [20:22:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:18] (03PS2) 10Ebernhardson: Bump ltr plugin to 1.5.4-wmf1-os1.3.20 [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1134285 [20:24:27] (03PS1) 10JHathaway: puppetmaster tests: remove resolving www.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1134289 [20:24:41] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1134289 (owner: 10JHathaway) [20:25:58] (03PS2) 10JHathaway: puppetmaster tests: remove resolving www.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1134289 [20:26:05] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1134289 (owner: 10JHathaway) [20:29:20] (03PS3) 10Ebernhardson: Bump ltr plugin to 1.5.4-wmf1-os1.3.20 [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1134285 [20:30:42] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10714406 (10phaultfinder) [20:31:40] (03CR) 10Ebernhardson: "I eternally forget to update those...even with the extra rule in prepare_commit that checks if it was already released. Thanks for the rme" [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1134285 (owner: 10Ebernhardson) [20:33:06] (03PS3) 10JHathaway: puppetmaster tests: remove resolving www.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1134289 [20:33:12] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1134289 (owner: 10JHathaway) [20:35:51] (03PS1) 10Ahmon Dancy: scap.cfg.erb: Allow users in spiderpig-access LDAP group [puppet] - 10https://gerrit.wikimedia.org/r/1134291 (https://phabricator.wikimedia.org/T383947) [20:36:57] (03PS4) 10JHathaway: puppetmaster tests: remove resolving www.wikipedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1134289 [20:37:04] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1134289 (owner: 10JHathaway) [20:37:12] FIRING: [3x] CirrusSearchSaneitizerFixRateTooHigh: MediaWiki CirrusSearch Saneitizer is fixing an abnormally high number of documents in cloudelastic - https://wikitech.wikimedia.org/wiki/Search/CirrusStreamingUpdater#San(e)itizing - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchSaneitizerFixRateTooHigh [20:37:25] FIRING: [2x] SystemdUnitFailed: httpbb_kubernetes_mw-jobrunner_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [21:03:11] (03CR) 10Bking: [C:03+2] Bump ltr plugin to 1.5.4-wmf1-os1.3.20 [software/opensearch/plugins] - 10https://gerrit.wikimedia.org/r/1134285 (owner: 10Ebernhardson) [21:07:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [21:10:06] (03CR) 10Thcipriani: [C:03+1] scap.cfg.erb: Allow users in spiderpig-access LDAP group [puppet] - 10https://gerrit.wikimedia.org/r/1134291 (https://phabricator.wikimedia.org/T383947) (owner: 10Ahmon Dancy) [21:12:03] FIRING: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [21:12:44] FIRING: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [21:13:57] (03PS1) 10Ahmon Dancy: idp: spiderpig: Add spiderpig-access to required_groups [puppet] - 10https://gerrit.wikimedia.org/r/1134292 (https://phabricator.wikimedia.org/T383947) [21:17:19] (03PS2) 10Ahmon Dancy: idp: spiderpig: Add spiderpig-access to required_groups [puppet] - 10https://gerrit.wikimedia.org/r/1134292 (https://phabricator.wikimedia.org/T383947) [21:17:19] (03PS2) 10Ahmon Dancy: scap.cfg.erb: Allow users in spiderpig-access LDAP group [puppet] - 10https://gerrit.wikimedia.org/r/1134291 (https://phabricator.wikimedia.org/T383947) [21:17:44] RESOLVED: [2x] RipeAtlasAnchorUnreachable: ipv6 ping to magru RIPE Atlas anchor: failures over threshold for measurement 95140317 - https://wikitech.wikimedia.org/wiki/Network_monitoring#Atlas_alerts - https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas?orgId=1 - https://alerts.wikimedia.org/?q=alertname%3DRipeAtlasAnchorUnreachable [21:18:28] (03CR) 10Thcipriani: [C:03+1] "Copied votes on follow-up patch sets have been updated:" [puppet] - 10https://gerrit.wikimedia.org/r/1134292 (https://phabricator.wikimedia.org/T383947) (owner: 10Ahmon Dancy) [21:18:31] !log bking@apt1002 publish-wmf-opensearch-search-plugins_1.3.20-4 to component/opensearch13 bullseye-wikimedia 1134285 [21:18:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:15] (03CR) 10Thcipriani: idp: spiderpig: Add spiderpig-access to required_groups [puppet] - 10https://gerrit.wikimedia.org/r/1134292 (https://phabricator.wikimedia.org/T383947) (owner: 10Ahmon Dancy) [21:29:36] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10714500 (10phaultfinder) [21:31:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1070-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [21:32:03] RESOLVED: [2x] DatasourceNoData: - https://alerts.wikimedia.org/?q=alertname%3DDatasourceNoData [21:59:38] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b4-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390922#10714580 (10phaultfinder) [22:25:43] 10ops-eqiad, 06SRE, 06DC-Ops: Alert for device ps1-b7-eqiad.mgmt.eqiad.wmnet - PDU sensor over limit - https://phabricator.wikimedia.org/T390778#10714646 (10phaultfinder) [22:36:39] RESOLVED: CirrusSearchHighOldGCFrequency: Elasticsearch instance elastic1070-production-search-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search/Elasticsearch_Administration#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [22:42:51] FIRING: CoreRouterInterfaceDown: Core router interface down - cr3-ulsfo:et-0/0/0 (Core: cr4-ulsfo:et-0/0/0 {#1073}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [22:47:51] RESOLVED: CoreRouterInterfaceDown: Core router interface down - cr3-ulsfo:et-0/0/0 (Core: cr4-ulsfo:et-0/0/0 {#1073}) - https://wikitech.wikimedia.org/wiki/Network_monitoring#Router_interface_down - https://grafana.wikimedia.org/d/fb403d62-5f03-434a-9dff-bd02b9fff504/network-device-overview?var-instance=cr3-ulsfo:9804 - https://alerts.wikimedia.org/?q=alertname%3DCoreRouterInterfaceDown [22:55:42] FIRING: JobUnavailable: Reduced availability for job mjolnir in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:12:12] FIRING: SystemdUnitFailed: opensearch_1@relforge-eqiad-small-alpha.service\x2copensearch_1@relforge-eqiad.service on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:29:34] (03PS1) 10C. Scott Ananian: Improve GeoCrumbs fallback when page property is not (yet) set [extensions/GeoCrumbs] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1134309 (https://phabricator.wikimedia.org/T391128) [23:29:50] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, April 07 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [extensions/GeoCrumbs] (wmf/1.44.0-wmf.23) - 10https://gerrit.wikimedia.org/r/1134309 (https://phabricator.wikimedia.org/T391128) (owner: 10C. Scott Ananian) [23:41:18] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1134310 [23:41:18] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1134310 (owner: 10TrainBranchBot) [23:42:12] FIRING: SystemdUnitFailed: waterlines.service on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:53:31] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1134310 (owner: 10TrainBranchBot)