[00:03:57] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1041790 (owner: 10TrainBranchBot) [00:07:48] (03PS1) 10BryanDavis: [DNM] Testing things in Gerrit UI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041168 [00:08:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T364069)', diff saved to https://phabricator.wikimedia.org/P64648 and previous config saved to /var/cache/conftool/dbconfig/20240612-000825-marostegui.json [00:08:31] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [00:09:25] FIRING: [10x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:10:57] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [00:23:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P64649 and previous config saved to /var/cache/conftool/dbconfig/20240612-002332-marostegui.json [00:38:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P64650 and previous config saved to /var/cache/conftool/dbconfig/20240612-003840-marostegui.json [00:42:02] (03PS3) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) [00:53:05] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-text_codfw [00:53:40] FIRING: [2x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2088:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:53:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T364069)', diff saved to https://phabricator.wikimedia.org/P64651 and previous config saved to /var/cache/conftool/dbconfig/20240612-005347-marostegui.json [00:53:50] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance [00:53:52] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [00:54:14] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2168.codfw.wmnet with reason: Maintenance [00:54:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2168 (T364069)', diff saved to https://phabricator.wikimedia.org/P64652 and previous config saved to /var/cache/conftool/dbconfig/20240612-005420-marostegui.json [01:14:25] FIRING: [10x] SystemdUnitFailed: httpbb_hourly_appserver.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:16:45] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [01:34:42] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [01:48:46] 10ops-eqiad, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T367253 (10phaultfinder) 03NEW [02:06:45] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [02:09:25] FIRING: [10x] SystemdUnitFailed: httpbb_hourly_appserver.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:11:01] (03PS2) 10Jdlrobson: Don't squish images in non-responsive skins e.g. Vector 2010 [core] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1041674 (https://phabricator.wikimedia.org/T113101) [02:14:22] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Audrey Penven - https://phabricator.wikimedia.org/T367184#9882865 (10KFrancis) @Dzahn Thanks! As soon as I have @AudreyPenven_WMDE 's email address, I'll get this processed. [02:29:24] (03PS1) 10David Martin: Add wikilambda_zobject_join to puppet script for sqooping Wikifunctions tables [puppet] - 10https://gerrit.wikimedia.org/r/1041817 (https://phabricator.wikimedia.org/T363435) [02:31:50] (03CR) 10David Martin: "Marking as WIP because the table-creation patch (in WikiLambda) has not merged yet." [puppet] - 10https://gerrit.wikimedia.org/r/1041817 (https://phabricator.wikimedia.org/T363435) (owner: 10David Martin) [02:35:46] FIRING: ProbeDown: Service ganeti1019:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:38:45] FIRING: [2x] JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:49:30] FIRING: [2x] ProbeDown: Service wdqs1021:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1021:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:54:30] RESOLVED: [2x] ProbeDown: Service wdqs1021:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1021:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:58:45] FIRING: [2x] JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:04:25] FIRING: [9x] SystemdUnitFailed: rsyslog-imfile-remedy.service on kubernetes1048:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:37:30] FIRING: AppserversUnreachable: Appserver unavailable for cluster api_appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [03:49:46] FIRING: Storage /var over 50%: Alert for device lsw1-f5-eqiad.mgmt.eqiad.wmnet - Storage /var over 50% - https://alerts.wikimedia.org/?q=alertname%3DStorage+%2Fvar+over+50%25 [04:02:43] PROBLEM - Disk space on thanos-be1004 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sde1 186987 MB (4% inode=92%): /srv/swift-storage/sdc1 186524 MB (4% inode=91%): /srv/swift-storage/sdh1 199628 MB (5% inode=91%): /srv/swift-storage/sdd1 177181 MB (4% inode=91%): /srv/swift-storage/sdf1 165649 MB (4% inode=91%): /srv/swift-storage/sdg1 227268 MB (5% inode=92%): /srv/swift-storage/sdi1 154752 MB (4% inode=91%): /srv/swift-s [04:02:43] j1 175672 MB (4% inode=91%): /srv/swift-storage/sdl1 211901 MB (5% inode=91%): /srv/swift-storage/sdk1 186368 MB (4% inode=92%): /srv/swift-storage/sdm1 184814 MB (4% inode=92%): /srv/swift-storage/sdn1 150814 MB (3% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1004&var-datasource=eqiad+prometheus/ops [04:03:25] FIRING: [3x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2088:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:12:21] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 213, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:16:25] PROBLEM - Disk space on thanos-be2002 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sdf1 176337 MB (4% inode=91%): /srv/swift-storage/sdc1 165684 MB (4% inode=92%): /srv/swift-storage/sdd1 179372 MB (4% inode=92%): /srv/swift-storage/sdg1 148007 MB (3% inode=90%): /srv/swift-storage/sdi1 186074 MB (4% inode=92%): /srv/swift-storage/sde1 173826 MB (4% inode=92%): /srv/swift-storage/sdj1 184538 MB (4% inode=91%): /srv/swift-s [04:16:25] h1 189267 MB (4% inode=91%): /srv/swift-storage/sdk1 183630 MB (4% inode=92%): /srv/swift-storage/sdl1 158093 MB (4% inode=91%): /srv/swift-storage/sdn1 201255 MB (5% inode=92%): /srv/swift-storage/sdm1 186929 MB (4% inode=92%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be2002&var-datasource=codfw+prometheus/ops [04:22:17] PROBLEM - Disk space on thanos-be1003 is CRITICAL: DISK CRITICAL - free space: /srv/swift-storage/sdh1 155505 MB (4% inode=90%): /srv/swift-storage/sdc1 194675 MB (5% inode=92%): /srv/swift-storage/sdf1 208330 MB (5% inode=91%): /srv/swift-storage/sdg1 194736 MB (5% inode=92%): /srv/swift-storage/sdd1 174381 MB (4% inode=92%): /srv/swift-storage/sde1 184297 MB (4% inode=91%): /srv/swift-storage/sdi1 163160 MB (4% inode=91%): /srv/swift-s [04:22:17] k1 160791 MB (4% inode=91%): /srv/swift-storage/sdj1 165704 MB (4% inode=91%): /srv/swift-storage/sdl1 156907 MB (4% inode=91%): /srv/swift-storage/sdm1 178787 MB (4% inode=92%): /srv/swift-storage/sdn1 149278 MB (3% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=thanos-be1003&var-datasource=eqiad+prometheus/ops [05:03:57] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2129 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1041875 (https://phabricator.wikimedia.org/T367262) [05:09:25] FIRING: [9x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:11:57] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:12:25] (03PS1) 10DDesouza: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041879 (https://phabricator.wikimedia.org/T349774) [05:14:52] (03CR) 10DDesouza: [C:03+2] miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041879 (https://phabricator.wikimedia.org/T349774) (owner: 10DDesouza) [05:15:55] (03Merged) 10jenkins-bot: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041879 (https://phabricator.wikimedia.org/T349774) (owner: 10DDesouza) [05:16:17] !log dani@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [05:16:38] !log dani@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [05:16:39] !log dani@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [05:17:13] !log dani@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [05:17:14] !log dani@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [05:17:45] !log dani@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [05:34:42] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [05:48:17] (03CR) 10Giuseppe Lavagetto: [C:03+2] mw-debug: protect debug endpoints with a password [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041593 (owner: 10Giuseppe Lavagetto) [05:49:07] (03Merged) 10jenkins-bot: mw-debug: protect debug endpoints with a password [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041593 (owner: 10Giuseppe Lavagetto) [05:51:48] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [05:51:58] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [05:56:23] (03PS1) 10Giuseppe Lavagetto: mw-debug: use absolute path in php include [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041898 [05:56:54] (03PS1) 10KartikMistry: Content Translation: Set MT threshold 85% in the Portuguese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041899 (https://phabricator.wikimedia.org/T356356) [05:57:16] (03CR) 10Giuseppe Lavagetto: [C:03+2] mw-debug: use absolute path in php include [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041898 (owner: 10Giuseppe Lavagetto) [05:58:24] (03Merged) 10jenkins-bot: mw-debug: use absolute path in php include [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041898 (owner: 10Giuseppe Lavagetto) [05:58:42] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [05:58:45] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [05:59:18] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [05:59:26] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240612T0600) [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:25] FIRING: [9x] SystemdUnitFailed: httpbb_kubernetes_mw-wikifunctions_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:09:52] (03PS1) 10Giuseppe Lavagetto: mediawiki: restart deployment on env changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041911 [06:11:33] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:11:57] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [06:16:22] (03PS2) 10Giuseppe Lavagetto: mediawiki: restart deployment on env changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041911 [06:17:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T364069)', diff saved to https://phabricator.wikimedia.org/P64653 and previous config saved to /var/cache/conftool/dbconfig/20240612-061718-marostegui.json [06:17:23] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [06:24:36] (03CR) 10Giuseppe Lavagetto: [C:03+2] mediawiki: restart deployment on env changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041911 (owner: 10Giuseppe Lavagetto) [06:25:40] <_joe_> jouncebot: next [06:25:41] In 0 hour(s) and 34 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240612T0700) [06:26:13] (03Merged) 10jenkins-bot: mediawiki: restart deployment on env changes [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041911 (owner: 10Giuseppe Lavagetto) [06:32:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P64654 and previous config saved to /var/cache/conftool/dbconfig/20240612-063225-marostegui.json [06:33:21] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:33:21] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:34:27] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:35:46] FIRING: ProbeDown: Service ganeti1019:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [06:37:11] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 52066 bytes in 0.202 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:37:11] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.315 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:37:17] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 13 Aug 2024 12:55:14 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:37:22] (03CR) 10Hashar: [C:03+2] "Lets give it a try given that seems to work from the browser console and that is how I wrote and tested the code previously." [software/gerrit] (deploy/wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1041243 (https://phabricator.wikimedia.org/T360550) (owner: 10Hashar) [06:38:03] (03Merged) 10jenkins-bot: wm-zuul-status: fix reload button [software/gerrit] (deploy/wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1041243 (https://phabricator.wikimedia.org/T360550) (owner: 10Hashar) [06:38:44] !log hashar@deploy1002 Started deploy [gerrit/gerrit@69984f7]: wm-zuul-status: fix reload button - T360550 [06:38:48] T360550: Gerrit 3.7.8: CI has completed checks. Reload the change view? RELOAD button doesn't work - https://phabricator.wikimedia.org/T360550 [06:38:51] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@69984f7]: wm-zuul-status: fix reload button - T360550 (duration: 00m 07s) [06:40:13] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [06:40:34] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [06:41:39] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 26 hosts with reason: Primary switchover s6 T367262 [06:41:43] T367262: Switchover s6 master (db2214 -> db2129) - https://phabricator.wikimedia.org/T367262 [06:42:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2129 with weight 0 T367262', diff saved to https://phabricator.wikimedia.org/P64655 and previous config saved to /var/cache/conftool/dbconfig/20240612-064200-root.json [06:42:01] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 26 hosts with reason: Primary switchover s6 T367262 [06:42:26] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2129 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1041875 (https://phabricator.wikimedia.org/T367262) (owner: 10Gerrit maintenance bot) [06:43:58] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [06:44:19] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [06:47:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168', diff saved to https://phabricator.wikimedia.org/P64656 and previous config saved to /var/cache/conftool/dbconfig/20240612-064733-marostegui.json [06:53:17] (03PS1) 10Marostegui: db2214: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1042046 [06:53:53] (03CR) 10Marostegui: [C:03+2] db2214: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1042046 (owner: 10Marostegui) [06:54:07] 10ops-eqiad, 06SRE, 06DC-Ops, 10Ganeti, 06Infrastructure-Foundations: ganeti1019 is down - https://phabricator.wikimedia.org/T367071#9883186 (10MoritzMuehlenhoff) >>! In T367071#9882394, @Jclark-ctr wrote: > @MoritzMuehlenhoff after replacing failed drive looked like it might boot but still fails. M... [06:54:18] !log rebalance ganeti clusters in codfw following reboots [06:54:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:54:49] (03PS2) 10Giuseppe Lavagetto: mw-debug: add general values to the statsd releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041673 [06:55:29] !log remove ganeti1019 from eqiad cluster T367071 [06:55:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:33] T367071: ganeti1019 is down - https://phabricator.wikimedia.org/T367071 [06:58:21] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1019.eqiad.wmnet with OS bullseye [06:58:29] 10ops-eqiad, 06SRE, 06DC-Ops, 10Ganeti, 06Infrastructure-Foundations: ganeti1019 is down - https://phabricator.wikimedia.org/T367071#9883190 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1019.eqiad.wmnet with OS bullseye [06:58:45] RESOLVED: JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:59:51] (03CR) 10Giuseppe Lavagetto: [C:03+2] mw-debug: add general values to the statsd releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041673 (owner: 10Giuseppe Lavagetto) [07:00:05] Amir1 and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240612T0700). [07:00:05] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:22] \o/ [07:00:32] I'll deploy my patch. [07:00:41] (03Merged) 10jenkins-bot: mw-debug: add general values to the statsd releases [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041673 (owner: 10Giuseppe Lavagetto) [07:01:02] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041899 (https://phabricator.wikimedia.org/T356356) (owner: 10KartikMistry) [07:01:41] (03Merged) 10jenkins-bot: Content Translation: Set MT threshold 85% in the Portuguese Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041899 (https://phabricator.wikimedia.org/T356356) (owner: 10KartikMistry) [07:02:18] (03CR) 10Brouberol: deployment_server: alert on admin-ng pending changes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1040992 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [07:02:37] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [07:02:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168 (T364069)', diff saved to https://phabricator.wikimedia.org/P64657 and previous config saved to /var/cache/conftool/dbconfig/20240612-070240-marostegui.json [07:02:42] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance [07:02:44] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [07:02:48] !log kartik@deploy1002 Started scap: Backport for [[gerrit:1041899|Content Translation: Set MT threshold 85% in the Portuguese Wikipedia (T356356)]] [07:02:49] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [07:02:52] T356356: Set the threshold of translation to 85% in the Portuguese Wikipedia. - https://phabricator.wikimedia.org/T356356 [07:02:55] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2182.codfw.wmnet with reason: Maintenance [07:02:59] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [07:03:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2182 (T364069)', diff saved to https://phabricator.wikimedia.org/P64658 and previous config saved to /var/cache/conftool/dbconfig/20240612-070302-marostegui.json [07:03:09] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [07:03:31] RECOVERY - SSH on ganeti1019 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:03:58] (03PS4) 10Giuseppe Lavagetto: mw-debug: remove vintage setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039780 [07:04:40] !log Starting s6 codfw failover from db2214 to db2129 - T367262 [07:04:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:04:44] T367262: Switchover s6 master (db2214 -> db2129) - https://phabricator.wikimedia.org/T367262 [07:05:24] !log kartik@deploy1002 kartik: Backport for [[gerrit:1041899|Content Translation: Set MT threshold 85% in the Portuguese Wikipedia (T356356)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:06:27] !log kartik@deploy1002 kartik: Continuing with sync [07:07:36] (03CR) 10Giuseppe Lavagetto: [C:03+2] mw-debug: remove vintage setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039780 (owner: 10Giuseppe Lavagetto) [07:08:24] (03Merged) 10jenkins-bot: mw-debug: remove vintage setting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039780 (owner: 10Giuseppe Lavagetto) [07:11:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2129 to s6 primary T367262', diff saved to https://phabricator.wikimedia.org/P64659 and previous config saved to /var/cache/conftool/dbconfig/20240612-071158-root.json [07:12:03] T367262: Switchover s6 master (db2214 -> db2129) - https://phabricator.wikimedia.org/T367262 [07:13:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2214 T367262', diff saved to https://phabricator.wikimedia.org/P64660 and previous config saved to /var/cache/conftool/dbconfig/20240612-071340-root.json [07:14:20] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5004.eqsin.wmnet [07:14:22] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on db2214.codfw.wmnet with reason: Long schema change [07:14:25] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db2214.codfw.wmnet with reason: Long schema change [07:14:25] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [07:14:29] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [07:14:38] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2214.codfw.wmnet with reason: Long schema change [07:14:39] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5004.eqsin.wmnet [07:14:40] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2214.codfw.wmnet with reason: Long schema change [07:15:13] (03PS2) 10Giuseppe Lavagetto: Allow running CI in a container when using rootless podman [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040218 [07:15:59] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:1041899|Content Translation: Set MT threshold 85% in the Portuguese Wikipedia (T356356)]] (duration: 13m 11s) [07:16:03] T356356: Set the threshold of translation to 85% in the Portuguese Wikipedia. - https://phabricator.wikimedia.org/T356356 [07:16:19] Done with my config patch ^^ [07:18:43] (03PS2) 10Giuseppe Lavagetto: Use the statsd-exporter service where available [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041656 (https://phabricator.wikimedia.org/T365265) [07:19:16] (03PS10) 10Slyngshede: Fix bug where SSH keys are imported incorrectly. [software/bitu] - 10https://gerrit.wikimedia.org/r/1038778 (https://phabricator.wikimedia.org/T366525) [07:19:34] (03CR) 10Giuseppe Lavagetto: Use the statsd-exporter service where available (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041656 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto) [07:20:11] !log dbmaint optimize pagelinks on old s6 codfw master db2214 T364069 [07:20:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:20:15] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [07:21:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5004.eqsin.wmnet [07:21:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5004.eqsin.wmnet [07:23:45] RESOLVED: ProbeDown: Service ganeti5004:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [07:23:49] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5005.eqsin.wmnet [07:26:07] (03CR) 10Giuseppe Lavagetto: [C:03+1] Move etcd.php from wmf-config/ to src/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891733 (https://phabricator.wikimedia.org/T308932) (owner: 10Krinkle) [07:31:42] (03PS3) 10Hashar: gerrit: set changes_by_project in cache [puppet] - 10https://gerrit.wikimedia.org/r/1040567 (owner: 10Paladox) [07:34:47] (03CR) 10Hashar: [C:03+1] "I have slightly adjusted the commit message." [puppet] - 10https://gerrit.wikimedia.org/r/1040567 (owner: 10Paladox) [07:36:05] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5005.eqsin.wmnet [07:36:18] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host ganeti1019.eqiad.wmnet with OS bullseye [07:36:36] (03PS1) 10David Caro: wmf_sink.base: ignore also any saved host key [puppet] - 10https://gerrit.wikimedia.org/r/1042128 (https://phabricator.wikimedia.org/T367235) [07:36:52] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ganeti1019.eqiad.wmnet with OS bullseye [07:36:57] 10ops-eqiad, 06SRE, 06DC-Ops, 10Ganeti, 06Infrastructure-Foundations: ganeti1019 is down - https://phabricator.wikimedia.org/T367071#9883255 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ganeti1019.eqiad.wmnet with OS bullseye [07:37:30] FIRING: AppserversUnreachable: Appserver unavailable for cluster api_appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [07:39:55] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f6-eqiad - https://phabricator.wikimedia.org/T365983#9883256 (10ABran-WMF) [07:40:38] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f7-eqiad - https://phabricator.wikimedia.org/T365984#9883270 (10ABran-WMF) [07:40:59] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e5-eqiad - https://phabricator.wikimedia.org/T365986#9883271 (10ABran-WMF) [07:41:36] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e6-eqiad - https://phabricator.wikimedia.org/T365987#9883272 (10ABran-WMF) [07:41:48] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e7-eqiad - https://phabricator.wikimedia.org/T365988#9883273 (10ABran-WMF) [07:42:39] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5005.eqsin.wmnet [07:42:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5005.eqsin.wmnet [07:49:46] FIRING: Storage /var over 50%: Alert for device lsw1-f5-eqiad.mgmt.eqiad.wmnet - Storage /var over 50% - https://alerts.wikimedia.org/?q=alertname%3DStorage+%2Fvar+over+50%25 [07:50:06] (03PS1) 10Slyngshede: Offboarding mttp [puppet] - 10https://gerrit.wikimedia.org/r/1042136 [08:03:40] FIRING: [3x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2088:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:05:07] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1150.eqiad.wmnet with reason: Maintenance [08:05:20] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1150.eqiad.wmnet with reason: Maintenance [08:09:55] !log jmm@cumin2002 START - Cookbook sre.ganeti.resource-report [08:09:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.resource-report (exit_code=0) [08:11:32] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance [08:11:45] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance [08:11:47] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:11:52] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:11:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1156 (T352010)', diff saved to https://phabricator.wikimedia.org/P64661 and previous config saved to /var/cache/conftool/dbconfig/20240612-081158-ladsgroup.json [08:12:03] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [08:12:17] (03CR) 10Muehlenhoff: Offboarding mttp (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1042136 (owner: 10Slyngshede) [08:12:30] !log brouberol@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet) [08:12:35] T349069: Design and implement a WDQS data-reload mechanism that sources its data from HDFS instead of the snapshot servers - https://phabricator.wikimedia.org/T349069 [08:12:42] !log brouberol@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet) [08:13:18] (03PS25) 10DCausse: wdqs.data-reload: various fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) (owner: 10Ryan Kemper) [08:14:28] (03PS2) 10Slyngshede: Offboarding mttp [puppet] - 10https://gerrit.wikimedia.org/r/1042136 [08:14:58] (03CR) 10Slyngshede: Offboarding mttp (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1042136 (owner: 10Slyngshede) [08:14:58] !log brouberol@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet) [08:15:18] !log brouberol@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet) [08:15:31] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1166.eqiad.wmnet with reason: Maintenance [08:15:44] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1166.eqiad.wmnet with reason: Maintenance [08:15:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1166 (T367261)', diff saved to https://phabricator.wikimedia.org/P64662 and previous config saved to /var/cache/conftool/dbconfig/20240612-081551-marostegui.json [08:15:55] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [08:16:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2123 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P64663 and previous config saved to /var/cache/conftool/dbconfig/20240612-081643-root.json [08:17:54] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host ganeti1019.eqiad.wmnet with OS bullseye [08:17:56] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5006.eqsin.wmnet [08:19:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1223 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P64664 and previous config saved to /var/cache/conftool/dbconfig/20240612-081918-ladsgroup.json [08:19:45] (03CR) 10Filippo Giunchedi: [C:03+2] logstash: align benthos mw-accesslog-sampler consumer group [puppet] - 10https://gerrit.wikimedia.org/r/1041155 (https://phabricator.wikimedia.org/T366308) (owner: 10Filippo Giunchedi) [08:21:47] (03CR) 10Filippo Giunchedi: "Tested locally with podman and ROOTLESS_PODMAN=1 rake run_locally and I can confirm that it works!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040218 (owner: 10Giuseppe Lavagetto) [08:22:13] (03CR) 10Clément Goubert: [C:03+1] mediawiki: Move outbound email to mx-out{1001,2001}.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1041758 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [08:22:31] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2214.codfw.wmnet with reason: Maintenance [08:22:33] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2214.codfw.wmnet with reason: Maintenance [08:23:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2214 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P64665 and previous config saved to /var/cache/conftool/dbconfig/20240612-082318-ladsgroup.json [08:23:40] (03PS26) 10DCausse: wdqs.data-reload: various fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) (owner: 10Ryan Kemper) [08:24:03] !log brouberol@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet) [08:24:07] T349069: Design and implement a WDQS data-reload mechanism that sources its data from HDFS instead of the snapshot servers - https://phabricator.wikimedia.org/T349069 [08:24:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T367261)', diff saved to https://phabricator.wikimedia.org/P64666 and previous config saved to /var/cache/conftool/dbconfig/20240612-082415-marostegui.json [08:24:19] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [08:24:59] jouncebot: now [08:24:59] No deployments scheduled for the next 1 hour(s) and 35 minute(s) [08:25:02] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-ctrl1002 [08:25:06] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-ctrl1002 [08:25:16] I’ll run two short maintenance scripts in a moment if nobody objects :) [08:25:33] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1002.eqiad.wmnet with OS bullseye [08:25:40] (03CR) 10Volans: "sorry for the intrusion, did a quick pass as I saw it passing by" [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) (owner: 10Ryan Kemper) [08:25:45] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9883340 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1002.eq... [08:26:23] !log start rebooting all cp-upload_codfw hosts for T366555 (spaced 1.5 hrs) [08:26:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:26] !log fabfur@cumin1002 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-upload_codfw [08:27:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2123', diff saved to https://phabricator.wikimedia.org/P64667 and previous config saved to /var/cache/conftool/dbconfig/20240612-082702-marostegui.json [08:27:34] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance [08:27:47] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2123.codfw.wmnet with reason: Maintenance [08:27:54] (03PS1) 10Majavah: openstack: designate: Fix floating IP updater for UUID project IDs [puppet] - 10https://gerrit.wikimedia.org/r/1042148 (https://phabricator.wikimedia.org/T367268) [08:28:07] !log brouberol@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet) [08:29:05] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1042136 (owner: 10Slyngshede) [08:29:34] (03CR) 10Slyngshede: [C:03+2] Offboarding mttp [puppet] - 10https://gerrit.wikimedia.org/r/1042136 (owner: 10Slyngshede) [08:29:57] (03CR) 10David Caro: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1042148 (https://phabricator.wikimedia.org/T367268) (owner: 10Majavah) [08:30:16] (03CR) 10Majavah: [C:03+2] openstack: designate: Fix floating IP updater for UUID project IDs [puppet] - 10https://gerrit.wikimedia.org/r/1042148 (https://phabricator.wikimedia.org/T367268) (owner: 10Majavah) [08:30:34] (03PS1) 10Hashar: wm-patch-demo: silently ignore errors [software/gerrit] (deploy/wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1042153 (https://phabricator.wikimedia.org/T367155) [08:34:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1223 (re)pooling @ 50%: Maint over', diff saved to https://phabricator.wikimedia.org/P64668 and previous config saved to /var/cache/conftool/dbconfig/20240612-083424-ladsgroup.json [08:34:28] (03CR) 10Hashar: "Whenever the patchdemo service fails, that causes a red chipset to be displayed on any change which is a bit confusing. There are few rea" [software/gerrit] (deploy/wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1042153 (https://phabricator.wikimedia.org/T367155) (owner: 10Hashar) [08:35:34] !log lucaswerkmeister-wmde@deploy1002 ~ $ mwscript-k8s --comment 'T367174, P12583' extensions/Wikibase/repo/maintenance/changePropertyDataType.php wikidatawiki -- --property-id P12583 --new-data-type external-id --summary '[[phabricator:T367174|T367174]]' # succeeded [08:35:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:38] T367174: Change Property datatypes from String to External Identifier for P12583 and P12703 - https://phabricator.wikimedia.org/T367174 [08:35:46] FIRING: [2x] JobUnavailable: Reduced availability for job benthos in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:36:05] 06SRE: Download of Azure cloud ranges for requestctl is broken - https://phabricator.wikimedia.org/T367269 (10Joe) 03NEW [08:36:19] !log lucaswerkmeister-wmde@deploy1002 ~ $ mwscript-k8s --comment 'T367174, P12703' extensions/Wikibase/repo/maintenance/changePropertyDataType.php wikidatawiki -- --property-id P12703 --new-data-type external-id --summary '[[phabricator:T367174|T367174]]' # succeeded [08:36:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:36:33] 06SRE: Download of Azure cloud ranges for requestctl is broken - https://phabricator.wikimedia.org/T367269#9883366 (10Joe) p:05Triage→03Unbreak! [08:36:55] 06SRE: Download of Azure cloud ranges for requestctl is broken - https://phabricator.wikimedia.org/T367269#9883369 (10Joe) [08:38:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2214 (re)pooling @ 50%: Maint over', diff saved to https://phabricator.wikimedia.org/P64669 and previous config saved to /var/cache/conftool/dbconfig/20240612-083824-ladsgroup.json [08:38:52] !log slyngshede@cumin1002 START - Cookbook sre.idm.logout Logging Mike Pham out of all services on: 2200 hosts [08:39:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P64670 and previous config saved to /var/cache/conftool/dbconfig/20240612-083923-marostegui.json [08:39:33] !log kamila@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on wikikube-ctrl1002.eqiad.wmnet with reason: host reimage [08:39:44] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Mike Pham out of all services on: 2200 hosts [08:40:46] RESOLVED: [2x] JobUnavailable: Reduced availability for job benthos in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:41:50] that was me btw ^ [08:42:14] !log zabe@mwmaint1002:~$ mwscript refreshImageMetadata.php commonswiki --mime image/webp # T364680 [08:42:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:42:18] T364680: Run refreshImageMetadata.php --mime image/webp on wikimedia - https://phabricator.wikimedia.org/T364680 [08:42:30] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on wikikube-ctrl1002.eqiad.wmnet with reason: host reimage [08:43:36] (03PS1) 10Filippo Giunchedi: logstash: move benthos mw-accesslog-sampler to port 4155 [puppet] - 10https://gerrit.wikimedia.org/r/1042159 (https://phabricator.wikimedia.org/T366308) [08:44:47] (03CR) 10Filippo Giunchedi: [C:03+2] logstash: move benthos mw-accesslog-sampler to port 4155 [puppet] - 10https://gerrit.wikimedia.org/r/1042159 (https://phabricator.wikimedia.org/T366308) (owner: 10Filippo Giunchedi) [08:45:29] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5006.eqsin.wmnet [08:45:30] (03PS1) 10Stevemunene: wdqs graph-split: add final svcs [dns] - 10https://gerrit.wikimedia.org/r/1042160 (https://phabricator.wikimedia.org/T364364) [08:47:09] (03PS1) 10Clément Goubert: dump_cloud_ip_ranges: This is owned by all of SRE [puppet] - 10https://gerrit.wikimedia.org/r/1042161 [08:47:41] (03PS1) 10Clément Goubert: mediawiki: Remove legacy parsoid deployment [alerts] - 10https://gerrit.wikimedia.org/r/1042162 (https://phabricator.wikimedia.org/T357392) [08:49:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1223 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P64671 and previous config saved to /var/cache/conftool/dbconfig/20240612-084929-ladsgroup.json [08:49:52] (03PS2) 10Jcrespo: dbbackups: Remove all production references to db2102 [puppet] - 10https://gerrit.wikimedia.org/r/1040117 (https://phabricator.wikimedia.org/T366892) [08:49:52] (03PS1) 10Jcrespo: dbbackups: Stop backing up es4 and es5 hosts [puppet] - 10https://gerrit.wikimedia.org/r/1042163 (https://phabricator.wikimedia.org/T363812) [08:51:02] (03CR) 10Jcrespo: "FYI" [puppet] - 10https://gerrit.wikimedia.org/r/1042163 (https://phabricator.wikimedia.org/T363812) (owner: 10Jcrespo) [08:51:28] (03CR) 10Marostegui: [C:03+1] "thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1042163 (https://phabricator.wikimedia.org/T363812) (owner: 10Jcrespo) [08:52:03] (03CR) 10Clément Goubert: [C:03+1] mw: change mail_host [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041763 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [08:52:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5006.eqsin.wmnet [08:52:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5006.eqsin.wmnet [08:53:29] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2214 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P64672 and previous config saved to /var/cache/conftool/dbconfig/20240612-085329-ladsgroup.json [08:54:28] (03PS2) 10Jcrespo: dbbackups: Stop backing up es4 and es5 hosts [puppet] - 10https://gerrit.wikimedia.org/r/1042163 (https://phabricator.wikimedia.org/T363812) [08:54:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P64673 and previous config saved to /var/cache/conftool/dbconfig/20240612-085430-marostegui.json [08:55:06] (03CR) 10Arnaudb: [C:03+1] dbbackups: Stop backing up es4 and es5 hosts [puppet] - 10https://gerrit.wikimedia.org/r/1042163 (https://phabricator.wikimedia.org/T363812) (owner: 10Jcrespo) [08:57:48] (03PS1) 10Muehlenhoff: Stop sending cross check mails to sre-foundations [puppet] - 10https://gerrit.wikimedia.org/r/1042164 [08:59:49] (03PS1) 10Marostegui: s5-pager.sql: Remove from repo [software] - 10https://gerrit.wikimedia.org/r/1042166 [09:00:11] (03CR) 10JMeybohm: [C:03+2] Remove deprecated uses_ingress option from service-proxy [puppet] - 10https://gerrit.wikimedia.org/r/1041644 (https://phabricator.wikimedia.org/T346638) (owner: 10JMeybohm) [09:00:21] (03CR) 10Alexandros Kosiaris: [C:03+2] mediawiki: Remove legacy parsoid deployment [alerts] - 10https://gerrit.wikimedia.org/r/1042162 (https://phabricator.wikimedia.org/T357392) (owner: 10Clément Goubert) [09:00:32] (03CR) 10Jcrespo: [C:03+2] dbbackups: Stop backing up es4 and es5 hosts [puppet] - 10https://gerrit.wikimedia.org/r/1042163 (https://phabricator.wikimedia.org/T363812) (owner: 10Jcrespo) [09:01:32] (03Merged) 10jenkins-bot: mediawiki: Remove legacy parsoid deployment [alerts] - 10https://gerrit.wikimedia.org/r/1042162 (https://phabricator.wikimedia.org/T357392) (owner: 10Clément Goubert) [09:01:42] (03CR) 10Marostegui: [C:03+2] s5-pager.sql: Remove from repo [software] - 10https://gerrit.wikimedia.org/r/1042166 (owner: 10Marostegui) [09:01:58] (03CR) 10JMeybohm: [C:03+2] "I think they're not counted as error because the deployments fail to render both in HEAD and this change. But tbh. I did not really check." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041646 (https://phabricator.wikimedia.org/T346638) (owner: 10JMeybohm) [09:02:07] (03Merged) 10jenkins-bot: s5-pager.sql: Remove from repo [software] - 10https://gerrit.wikimedia.org/r/1042166 (owner: 10Marostegui) [09:02:10] (03PS1) 10Filippo Giunchedi: benthos: fix mw_accesslog_sampler metrics [puppet] - 10https://gerrit.wikimedia.org/r/1042167 (https://phabricator.wikimedia.org/T366308) [09:03:24] (03CR) 10JMeybohm: [C:03+2] "Done" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041646 (https://phabricator.wikimedia.org/T346638) (owner: 10JMeybohm) [09:04:14] !log STOPPED lucaswerkmeister-wmde@mwmaint1002:~$ time mwscript extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --wiki enwiki --current --all --touched-after=20240524120000 --start '["55019880"]' 2>&1 | tee -a ~/T315510-enwiki-8; date # Ctrl+C, had become very slow, trying restart [09:04:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1223 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P64674 and previous config saved to /var/cache/conftool/dbconfig/20240612-090435-ladsgroup.json [09:04:51] !log START lucaswerkmeister-wmde@mwmaint1002:~$ time mwscript extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --wiki enwiki --current --all --touched-after=20240524120000 --start '["55386869"]' 2>&1 | tee -a ~/T315510-enwiki-9; date [09:04:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:34] (03CR) 10Filippo Giunchedi: [C:03+2] benthos: fix mw_accesslog_sampler metrics [puppet] - 10https://gerrit.wikimedia.org/r/1042167 (https://phabricator.wikimedia.org/T366308) (owner: 10Filippo Giunchedi) [09:06:56] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host wikikube-ctrl1002.eqiad.wmnet with OS bullseye [09:07:07] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9883434 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1002.eqiad.... [09:08:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2214 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P64675 and previous config saved to /var/cache/conftool/dbconfig/20240612-090834-ladsgroup.json [09:09:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T367261)', diff saved to https://phabricator.wikimedia.org/P64676 and previous config saved to /var/cache/conftool/dbconfig/20240612-090937-marostegui.json [09:09:40] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1175.eqiad.wmnet with reason: Maintenance [09:09:42] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [09:09:53] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1175.eqiad.wmnet with reason: Maintenance [09:10:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1175 (T367261)', diff saved to https://phabricator.wikimedia.org/P64677 and previous config saved to /var/cache/conftool/dbconfig/20240612-090959-marostegui.json [09:10:15] FIRING: [5x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:10:26] (03PS3) 10Jcrespo: dbbackups: Remove all production references to db2102 [puppet] - 10https://gerrit.wikimedia.org/r/1040117 (https://phabricator.wikimedia.org/T366892) [09:11:15] !log failover ganeti cluster for eqsin to ganeti5004 [09:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:28] (03PS1) 10Giuseppe Lavagetto: fetch_external_clouds_vendors_nets: fix azure [puppet] - 10https://gerrit.wikimedia.org/r/1042173 (https://phabricator.wikimedia.org/T367269) [09:14:44] (03CR) 10Clément Goubert: [C:03+1] fetch_external_clouds_vendors_nets: fix azure [puppet] - 10https://gerrit.wikimedia.org/r/1042173 (https://phabricator.wikimedia.org/T367269) (owner: 10Giuseppe Lavagetto) [09:14:49] PROBLEM - ganeti-wconfd running on ganeti5007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 115 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [09:15:15] FIRING: [10x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:17:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T367261)', diff saved to https://phabricator.wikimedia.org/P64678 and previous config saved to /var/cache/conftool/dbconfig/20240612-091724-marostegui.json [09:17:29] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [09:19:39] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: notify ssh-gitlab service when interface aliases are added [puppet] - 10https://gerrit.wikimedia.org/r/1041636 (https://phabricator.wikimedia.org/T367021) (owner: 10Jelto) [09:20:15] RESOLVED: [10x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [09:21:31] (03CR) 10Hnowlan: [C:03+1] fetch_external_clouds_vendors_nets: fix azure [puppet] - 10https://gerrit.wikimedia.org/r/1042173 (https://phabricator.wikimedia.org/T367269) (owner: 10Giuseppe Lavagetto) [09:22:23] (03CR) 10JMeybohm: [C:03+1] "Hosts: auto PCC run: https://puppet-compiler.wmflabs.org/output/1040992/2894/" [puppet] - 10https://gerrit.wikimedia.org/r/1040992 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [09:27:21] (03CR) 10JMeybohm: [C:03+2] flink-operator: add securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041161 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [09:28:35] (03PS1) 10AOkoth: vtrs: upgrade cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1042179 (https://phabricator.wikimedia.org/T366078) [09:29:33] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e1-eqiad - https://phabricator.wikimedia.org/T365993#9883482 (10jcrespo) backup1010 is in intermittent usage to support mediabackups disk space, but mostly idle at the time, so unle... [09:30:26] (03Merged) 10jenkins-bot: flink-operator: add securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041161 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [09:30:37] (03PS1) 10Muehlenhoff: Extend access for arora [puppet] - 10https://gerrit.wikimedia.org/r/1042181 [09:32:05] !log jayme@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [09:32:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P64679 and previous config saved to /var/cache/conftool/dbconfig/20240612-093231-marostegui.json [09:33:10] !log jayme@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [09:33:43] (03CR) 10Muehlenhoff: [C:03+2] Extend access for arora [puppet] - 10https://gerrit.wikimedia.org/r/1042181 (owner: 10Muehlenhoff) [09:34:42] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [09:35:09] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-e3-eqiad - https://phabricator.wikimedia.org/T365995#9883497 (10jcrespo) backup1009 is the main backup node for bacula on eqiad. Most backups happen during the night- so just monito... [09:36:12] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f1-eqiad - https://phabricator.wikimedia.org/T365996#9883498 (10jcrespo) backup1011 is a mediabackups storage server. Ideally, mediabackups are paused during the maintenance to avoi... [09:39:41] (03PS1) 10Muehlenhoff: Remove access for springle [puppet] - 10https://gerrit.wikimedia.org/r/1042184 [09:40:29] (03CR) 10Giuseppe Lavagetto: [C:03+2] fetch_external_clouds_vendors_nets: fix azure [puppet] - 10https://gerrit.wikimedia.org/r/1042173 (https://phabricator.wikimedia.org/T367269) (owner: 10Giuseppe Lavagetto) [09:40:59] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 -lsw1-f3-eqiad - https://phabricator.wikimedia.org/T365998#9883516 (10jcrespo) db1205 is the secondary media backups metadata db server, usually just a standby to db1204. Unless it is the... [09:43:21] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1018.eqiad.wmnet,service=s2 [09:43:27] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1018.eqiad.wmnet,service=s7 [09:46:00] (03CR) 10Muehlenhoff: [C:03+2] Remove access for springle [puppet] - 10https://gerrit.wikimedia.org/r/1042184 (owner: 10Muehlenhoff) [09:47:00] <_joe_> !log running dump_cloud_ip_ranges on puppetmaster1001 to test fixed script [09:47:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:47:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P64680 and previous config saved to /var/cache/conftool/dbconfig/20240612-094738-marostegui.json [09:47:58] !log disabling puppet on cp4037 to test benthos configuration [09:48:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:27] !log disabling puppet on cp4037 to test benthos configuration (T360454) [09:48:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:31] T360454: Better Benthos performances - https://phabricator.wikimedia.org/T360454 [09:48:43] FIRING: ProbeDown: Service gitlab2002:22 has failed probes (tcp_gitlab_wikimedia_org_ssh_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#gitlab2002:22 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:49:25] FIRING: [8x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetmaster1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:50:19] (03PS1) 10Clément Goubert: image-suggestion: Update image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042188 (https://phabricator.wikimedia.org/T362518) [09:54:56] (03CR) 10Kosta Harlan: "I don't think this is super important, but might be nice to have for consistency." [puppet] - 10https://gerrit.wikimedia.org/r/1037531 (owner: 10Kosta Harlan) [09:55:31] (03PS1) 10Jelto: gitlab: set IPs for SSH blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/1042191 (https://phabricator.wikimedia.org/T367021) [09:55:38] (03CR) 10Majavah: [C:03+1] Deprecate system::role for Cloud VPS-specific roles [puppet] - 10https://gerrit.wikimedia.org/r/1040123 (owner: 10Muehlenhoff) [09:56:22] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1018.eqiad.wmnet,service=s2 [09:56:26] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1018.eqiad.wmnet,service=s7 [09:58:57] !log zabe@mwmaint1002:~$ foreachwikiindblist 'all - s4' refreshImageMetadata.php --mime image/webp # T364680 [09:59:31] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Moving 1G servers out of rack D4 in prep of switch migration - https://phabricator.wikimedia.org/T361856#9883573 (10Clement_Goubert) So sorry I didn't answer earlier. Apart from `mw2282` which has been migrated to k8s, we will decom these hosts so you don't have... [09:59:57] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2895/" [puppet] - 10https://gerrit.wikimedia.org/r/1042191 (https://phabricator.wikimedia.org/T367021) (owner: 10Jelto) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240612T1000) [10:00:41] !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [10:01:34] !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [10:01:43] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [10:02:02] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab: set IPs for SSH blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/1042191 (https://phabricator.wikimedia.org/T367021) (owner: 10Jelto) [10:02:42] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [10:02:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T367261)', diff saved to https://phabricator.wikimedia.org/P64681 and previous config saved to /var/cache/conftool/dbconfig/20240612-100245-marostegui.json [10:02:48] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1189.eqiad.wmnet with reason: Maintenance [10:02:59] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [10:03:01] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1189.eqiad.wmnet with reason: Maintenance [10:03:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1189 (T367261)', diff saved to https://phabricator.wikimedia.org/P64682 and previous config saved to /var/cache/conftool/dbconfig/20240612-100307-marostegui.json [10:03:36] (03CR) 10Filippo Giunchedi: "Good point, indeed that's what happens by default e.g. with Benthos (and can be tweaked when the consumer group is first created). I'll ma" [puppet] - 10https://gerrit.wikimedia.org/r/1040170 (https://phabricator.wikimedia.org/T366710) (owner: 10Filippo Giunchedi) [10:04:03] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [10:04:10] (03CR) 10JMeybohm: [C:03+2] shellbox: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037615 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [10:04:10] (03CR) 10Filippo Giunchedi: [C:03+1] Use the statsd-exporter service where available [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041656 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto) [10:05:15] (03PS1) 10Muehlenhoff: Remove access for natalia-rodriguez [puppet] - 10https://gerrit.wikimedia.org/r/1042195 [10:05:22] (03Merged) 10jenkins-bot: shellbox: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037615 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [10:05:26] (03CR) 10CI reject: [V:04-1] Remove access for natalia-rodriguez [puppet] - 10https://gerrit.wikimedia.org/r/1042195 (owner: 10Muehlenhoff) [10:05:33] (03PS1) 10DCausse: wdqs: add wdqs2023 as a scap target [puppet] - 10https://gerrit.wikimedia.org/r/1042196 [10:06:24] !log kamila@cumin1002 conftool action : set/pooled=yes; selector: name=wikikube-ctrl1002.eqiad.wmnet [10:07:44] !log zabe@mwmaint1002:~$ foreachwikiindblist 'all - s4' refreshImageMetadata.php --mime image/webp # T364680 [10:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:47] T364680: Run refreshImageMetadata.php --mime image/webp on wikimedia - https://phabricator.wikimedia.org/T364680 [10:07:50] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [10:07:53] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [10:08:00] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [10:08:36] (03PS2) 10Muehlenhoff: Remove access for natalia-rodriguez [puppet] - 10https://gerrit.wikimedia.org/r/1042195 [10:10:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T367261)', diff saved to https://phabricator.wikimedia.org/P64683 and previous config saved to /var/cache/conftool/dbconfig/20240612-101032-marostegui.json [10:10:38] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [10:10:55] !log Depooling mw2281.codfw.wmnet,mw22[83-90].codfw.wmnet for decommission - T367275 [10:10:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:58] T367275: decommission mw2281.codfw.wmnet mw22[83-90].codfw.wmnet - https://phabricator.wikimedia.org/T367275 [10:11:25] (03PS1) 10Ladsgroup: override circuit breaking threshold for ES hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042197 [10:12:50] (03CR) 10Muehlenhoff: [C:03+2] Remove access for natalia-rodriguez [puppet] - 10https://gerrit.wikimedia.org/r/1042195 (owner: 10Muehlenhoff) [10:14:06] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/shellbox: apply [10:14:12] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Audrey Penven - https://phabricator.wikimedia.org/T367184#9883618 (10AudreyPenven_WMDE) @KFrancis my email is audrey.penven@wikimedia.de [10:14:33] (03CR) 10Alexandros Kosiaris: [C:03+1] "Copied votes on follow-up patch sets have been updated:" [puppet] - 10https://gerrit.wikimedia.org/r/1042173 (https://phabricator.wikimedia.org/T367269) (owner: 10Giuseppe Lavagetto) [10:14:49] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox: apply [10:15:25] !log cgoubert@cumin1002 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on 9 hosts with reason: decommissioning [10:15:38] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/shellbox: apply [10:15:40] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox: apply [10:15:41] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [10:15:45] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on 9 hosts with reason: decommissioning [10:16:00] (03CR) 10Hnowlan: [C:03+1] image-suggestion: Update image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042188 (https://phabricator.wikimedia.org/T362518) (owner: 10Clément Goubert) [10:16:02] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Moving 1G servers out of rack D4 in prep of switch migration - https://phabricator.wikimedia.org/T361856#9883627 (10Clement_Goubert) [10:16:02] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [10:16:03] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-media: apply [10:16:17] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [10:16:18] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [10:16:31] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [10:16:32] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [10:16:51] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [10:16:57] (03PS1) 10Clément Goubert: decommission mw2281.codfw mw22[83-90].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1042199 (https://phabricator.wikimedia.org/T367275) [10:17:33] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [10:18:43] RESOLVED: ProbeDown: Service gitlab2002:22 has failed probes (tcp_gitlab_wikimedia_org_ssh_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#gitlab2002:22 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:19:02] !log zabe@mwmaint1002:~$ mwscript extensions/Translate/scripts/moveTranslatableBundle.php --wiki metawiki "Grants:Community Resources" "Wikimedia Foundation/Advancement/Community Growth/Community Resources" "Zabe" --reason "per request [[:phab:T365837|T365837]]" [10:19:06] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s4 [10:19:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:19:08] T365837: Request to move translatable page: Grants:Community Resources - https://phabricator.wikimedia.org/T365837 [10:19:12] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1019.eqiad.wmnet,service=s6 [10:21:56] PROBLEM - MariaDB Replica IO: s4 on clouddb1019 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:22:04] PROBLEM - MariaDB Replica SQL: s4 on clouddb1019 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:22:04] PROBLEM - MariaDB Replica SQL: s6 on clouddb1019 is CRITICAL: CRITICAL slave_sql_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:22:33] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox: apply [10:22:46] PROBLEM - MariaDB read only s4 on clouddb1019 is CRITICAL: Could not connect to localhost:3314 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:22:47] dhinus: ^ [10:22:56] PROBLEM - MariaDB Replica IO: s6 on clouddb1019 is CRITICAL: CRITICAL slave_io_state could not connect https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:23:14] PROBLEM - MariaDB read only wikireplica-s4 on clouddb1019 is CRITICAL: Could not connect to localhost:3314 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:23:19] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [10:23:26] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [10:23:42] !log remove MediaWiki.jawiki.GrowthExperiments.NewcomerTask.update_.* from graphite hosts - T362633 [10:23:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:23:45] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [10:23:45] T362633: Growth team product KPI Grafana dashboard has `update_` task type, which does not exist - https://phabricator.wikimedia.org/T362633 [10:23:51] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [10:23:54] PROBLEM - mysqld processes on clouddb1019 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [10:24:31] jouncebot: now [10:24:31] For the next 0 hour(s) and 35 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240612T1000) [10:24:34] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [10:24:40] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [10:24:46] PROBLEM - MariaDB read only wikireplica-s6 on clouddb1019 is CRITICAL: Could not connect to localhost:3316 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:24:46] PROBLEM - MariaDB read only s6 on clouddb1019 is CRITICAL: Could not connect to localhost:3316 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:24:49] does anyone mind if I get a head start on the backports I want to do in the window later today? the first one should be a no-op [10:24:53] (03PS1) 10Clément Goubert: Remove mw2289.codfw.wmnet from scap::proxies for decom [puppet] - 10https://gerrit.wikimedia.org/r/1042200 (https://phabricator.wikimedia.org/T367275) [10:24:54] (03PS1) 10Clément Goubert: decommission mw2281.codfw mw22[83-90].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1042201 (https://phabricator.wikimedia.org/T367275) [10:25:00] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [10:25:00] (i.e., I’d like to deploy it now already, to save time later) [10:25:06] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [10:25:36] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [10:25:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P64684 and previous config saved to /var/cache/conftool/dbconfig/20240612-102540-marostegui.json [10:25:42] 06SRE, 06Growth-Team, 10GrowthExperiments-Homepage, 07Grafana: Growth team product KPI Grafana dashboard has `update_` task type, which does not exist - https://phabricator.wikimedia.org/T362633#9883647 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi {{done}}; resolving though reopen if sth is amiss [10:26:34] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti5007.eqsin.wmnet [10:27:03] !log fnegri@cumin1002 START - Cookbook sre.hosts.reboot-single for host clouddb1019.eqiad.wmnet [10:27:12] (03PS1) 10Majavah: hieradata: Move cloudvirt1031 to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1042202 (https://phabricator.wikimedia.org/T364457) [10:27:35] Lucas_WMDE: as the oncall person I'm fine with that FWIW [10:27:50] alright, then I’ll start :) [10:27:56] if anyone objects there’s plenty of time before CI will finish anyway ^^ [10:27:57] thanks! [10:28:04] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [extensions/EntitySchema] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1041697 (owner: 10Lucas Werkmeister (WMDE)) [10:28:28] sure np, cc Amir1 ^ as the other oncall person [10:28:48] (^ that change removes a Phan comment and nothing else, I would be hard pressed to think of a less risky change ;)) [10:29:36] (03PS2) 10Majavah: hieradata: Move cloudvirt1031 to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1042202 (https://phabricator.wikimedia.org/T364457) [10:29:55] (03Abandoned) 10Clément Goubert: decommission mw2281.codfw mw22[83-90].codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1042199 (https://phabricator.wikimedia.org/T367275) (owner: 10Clément Goubert) [10:29:57] sure [10:30:48] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1042202 (https://phabricator.wikimedia.org/T364457) (owner: 10Majavah) [10:31:00] Lucas_WMDE: Good for me, cc effie so we don't start rebooting stuff :) [10:32:10] hehe [10:32:20] if you do want to reboot stuff, let me know and I can pause deploying :) [10:33:09] nah nah, it's a go for me and it doesn't look like e.ffie has started [10:33:52] !log taavi@cumin1002 START - Cookbook sre.hosts.reimage for host cloudvirt1031.eqiad.wmnet with OS bookworm [10:34:13] (03CR) 10Majavah: [V:03+1 C:03+2] hieradata: Move cloudvirt1031 to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1042202 (https://phabricator.wikimedia.org/T364457) (owner: 10Majavah) [10:35:09] (03PS1) 10Hnowlan: svg: use rsvg-convert's language parameter [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1042203 (https://phabricator.wikimedia.org/T337139) [10:35:18] RECOVERY - MariaDB read only wikireplica-s4 on clouddb1019 is OK: Version 10.6.17-MariaDB, Uptime 21s, read_only: True, event_scheduler: False, 117.78 QPS, connection latency: 0.020492s, query latency: 0.002276s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:35:48] RECOVERY - MariaDB read only s4 on clouddb1019 is OK: Version 10.6.17-MariaDB, Uptime 52s, read_only: True, event_scheduler: False, 1894.27 QPS, connection latency: 0.015286s, query latency: 0.000421s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:35:54] RECOVERY - mysqld processes on clouddb1019 is OK: PROCS OK: 2 processes with command name mysqld https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting [10:35:56] RECOVERY - MariaDB Replica IO: s4 on clouddb1019 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:36:06] RECOVERY - MariaDB Replica SQL: s4 on clouddb1019 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:36:08] RECOVERY - MariaDB Replica SQL: s6 on clouddb1019 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:36:41] (03PS2) 10Hnowlan: svg: use rsvg-convert's language parameter [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1042203 (https://phabricator.wikimedia.org/T261192) [10:36:48] RECOVERY - MariaDB read only s6 on clouddb1019 is OK: Version 10.6.17-MariaDB, Uptime 57s, read_only: True, event_scheduler: False, 208.20 QPS, connection latency: 0.022161s, query latency: 0.000568s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:36:48] RECOVERY - MariaDB read only wikireplica-s6 on clouddb1019 is OK: Version 10.6.17-MariaDB, Uptime 57s, read_only: True, event_scheduler: False, 208.11 QPS, connection latency: 0.023441s, query latency: 0.000531s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [10:36:56] RECOVERY - MariaDB Replica IO: s6 on clouddb1019 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:40:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P64685 and previous config saved to /var/cache/conftool/dbconfig/20240612-104047-marostegui.json [10:41:07] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddb1019.eqiad.wmnet [10:41:20] !log zabe@mwmaint1002:~$ mwscript extensions/Translate/scripts/moveTranslatableBundle.php --wiki metawiki "Global Advocacy" "Wikimedia Foundation/Legal/Global Advocacy" "Zabe" --reason "per request [[:phab:T367219|T367219]]" [10:41:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:23] T367219: Request to move translatable page: Global Advocacy - https://phabricator.wikimedia.org/T367219 [10:42:38] (03PS1) 10Muehlenhoff: Extend access for wangombe [puppet] - 10https://gerrit.wikimedia.org/r/1042204 [10:45:12] jouncebot: nowandnext [10:45:12] For the next 0 hour(s) and 14 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240612T1000) [10:45:12] In 0 hour(s) and 14 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240612T1100) [10:45:45] (03CR) 10Muehlenhoff: [C:03+2] Extend access for wangombe [puppet] - 10https://gerrit.wikimedia.org/r/1042204 (owner: 10Muehlenhoff) [10:46:39] !log kamila@cumin1002 conftool action : set/pooled=inactive; selector: name=wikikube-ctrl1003.eqiad.wmnet [10:48:22] !log kamila@cumin1002 START - Cookbook sre.hosts.decommission for hosts wikikube-ctrl1003.eqiad.wmnet [10:51:44] 10ops-eqiad, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel eqiad wikikube worker nodes - https://phabricator.wikimedia.org/T367285 (10Clement_Goubert) 03NEW [10:52:08] !log taavi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudvirt1031.eqiad.wmnet with reason: host reimage [10:53:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5007.eqsin.wmnet [10:53:38] !log zabe@mwmaint1002:~$ mwscript extensions/Translate/scripts/moveTranslatableBundle.php --wiki metawiki "Global Advocacy/About" "Wikimedia Foundation/Legal/Global Advocacy/About" "Zabe" --reason "per request [[:phab:T367219|T367219]]" [10:53:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:41] T367219: Request to move translatable page: Global Advocacy - https://phabricator.wikimedia.org/T367219 [10:54:23] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [10:54:45] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudvirt1031.eqiad.wmnet with reason: host reimage [10:55:31] (03Merged) 10jenkins-bot: EntitySchemaSlotViewRenderer: Fix Phan failure [extensions/EntitySchema] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1041697 (owner: 10Lucas Werkmeister (WMDE)) [10:55:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T367261)', diff saved to https://phabricator.wikimedia.org/P64686 and previous config saved to /var/cache/conftool/dbconfig/20240612-105554-marostegui.json [10:55:56] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1198.eqiad.wmnet with reason: Maintenance [10:55:58] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [10:56:06] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1041697|EntitySchemaSlotViewRenderer: Fix Phan failure]] [10:56:09] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1198.eqiad.wmnet with reason: Maintenance [10:56:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1198 (T367261)', diff saved to https://phabricator.wikimedia.org/P64687 and previous config saved to /var/cache/conftool/dbconfig/20240612-105615-marostegui.json [10:57:06] !log zabe@mwmaint1002:~$ mwscript extensions/Translate/scripts/moveTranslatableBundle.php --wiki metawiki "Global Advocacy/Conversation hours and Events" "Wikimedia Foundation/Legal/Global Advocacy/Conversation hours and Events" "Zabe" --reason "per request [[:phab:T367219|T367219]]" [10:57:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:42] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Backport for [[gerrit:1041697|EntitySchemaSlotViewRenderer: Fix Phan failure]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [10:58:44] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-ctrl1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - kamila@cumin1002" [10:58:46] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Continuing with sync [10:59:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5007.eqsin.wmnet [11:00:05] mvolz: #bothumor I � Unicode. All rise for Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240612T1100). [11:00:17] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti5007.eqsin.wmnet [11:01:04] !log zabe@mwmaint1002:~$ mwscript extensions/Translate/scripts/moveTranslatableBundle.php --wiki metawiki "Wikimedia Foundation Legal department" "Wikimedia Foundation/Legal" "Zabe" --reason "per request [[:phab:T367216|T367216]]" [11:01:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:08] T367216: Request to move translatable page: Wikimedia Foundation Legal department - https://phabricator.wikimedia.org/T367216 [11:02:42] mvolz: I’m still deploying a harmless backport (CI took a while), I hope that doesn’t interfere with your deployments [11:02:51] (currently 53% through the k8s deployment, so maybe 5 more minutes or so) [11:03:08] ok. lmk when you're done! [11:03:18] will do [11:03:31] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: wikikube-ctrl1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - kamila@cumin1002" [11:03:31] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:03:32] !log kamila@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts wikikube-ctrl1003.eqiad.wmnet [11:03:37] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9883835 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by kamila@cumin1002 for hosts: `wikikube-ctrl1003.eqiad.... [11:04:29] !log zabe@mwmaint1002:~$ mwscript extensions/Translate/scripts/moveTranslatableBundle.php --wiki metawiki "Trust and Safety" "Wikimedia Foundation/Legal/Community Resilience and Sustainability/Trust and Safety" "Zabe" --reason "per request [[:phab:T367217|T367217]]" [11:04:31] (03CR) 10Hnowlan: [C:03+2] mw-web, mw-api-ext: Raise replicas for 95% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039196 (https://phabricator.wikimedia.org/T362323) (owner: 10Hnowlan) [11:04:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:32] T367217: Request to move translatable page: Trust and Safety - https://phabricator.wikimedia.org/T367217 [11:05:23] (03Merged) 10jenkins-bot: mw-web, mw-api-ext: Raise replicas for 95% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039196 (https://phabricator.wikimedia.org/T362323) (owner: 10Hnowlan) [11:05:41] hm, should I be concerned that the last visible k8s ouput line is “K8s deployment progress: 95% (ok: 1981; fail: 0; left: 85)”? [11:05:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T367261)', diff saved to https://phabricator.wikimedia.org/P64688 and previous config saved to /var/cache/conftool/dbconfig/20240612-110541-marostegui.json [11:05:46] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [11:05:49] (and then it proceeded with the rest of the deployment) [11:05:59] or is that just an artifact of how the output is printed? ^^ [11:06:07] cc claime in case you know [11:06:29] I don't know, but I imagine it's an artifact yeah [11:06:41] (03CR) 10Muehlenhoff: [C:03+2] Stop sending cross check mails to sre-foundations [puppet] - 10https://gerrit.wikimedia.org/r/1042164 (owner: 10Muehlenhoff) [11:06:43] yeah, ok [11:06:51] I’m happy to leave it at that for now [11:07:04] (I guess I can see whether it recurs with the other backports later anyway) [11:07:11] (03CR) 10Giuseppe Lavagetto: [C:03+1] override circuit breaking threshold for ES hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042197 (owner: 10Ladsgroup) [11:08:17] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/mw-web: apply [11:08:17] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1041697|EntitySchemaSlotViewRenderer: Fix Phan failure]] (duration: 12m 10s) [11:08:31] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [11:08:44] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:09:15] (03PS1) 10Clément Goubert: trafficserver: move 95% of traffic to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1042205 (https://phabricator.wikimedia.org/T362323) [11:09:25] FIRING: [10x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:09:52] hnowlan: maybe don't deploy releases during the backports? x) [11:10:13] yeep, just realised after it finished :[ [11:10:40] !log rebalance ganeti cluster in eqsin following reboots [11:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:20] mvolz: I’m all done [11:11:32] (for now ^^) [11:11:45] 👍️ [11:11:48] (will resume during the backport window later, maybe starting during the break before already. but your window’s all yours ^^) [11:12:01] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply [11:12:23] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:12:35] heh, according to php-fpm-restart we’re down to 69 bare-metal hosts? [11:12:42] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:12:56] !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/citoid: apply [11:13:06] Lucas_WMDE: yeah, gonna keep going down fast now, we're pushing to 95% for mw-on-k8s probably today [11:13:30] !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [11:14:04] !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/citoid: apply [11:14:25] FIRING: [11x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:14:38] !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [11:15:47] claime: \o/ [11:15:55] (but, no more funny number? 😔) [11:16:20] (no sorry) [11:19:25] FIRING: [11x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:20:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P64689 and previous config saved to /var/cache/conftool/dbconfig/20240612-112048-marostegui.json [11:21:00] (03PS1) 10Lucas Werkmeister (WMDE): Load EntitySchema on Test Wikidata clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042208 (https://phabricator.wikimedia.org/T363153) [11:22:02] (03CR) 10Lucas Werkmeister (WMDE): Load EntitySchema on Test Wikidata clients (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042208 (https://phabricator.wikimedia.org/T363153) (owner: 10Lucas Werkmeister (WMDE)) [11:22:15] RESOLVED: AppserversUnreachable: Appserver unavailable for cluster api_appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-site=eqiad&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [11:22:25] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudvirt1031.eqiad.wmnet with OS bookworm [11:23:30] 10ops-eqiad, 06DC-Ops, 10Prod-Kubernetes, 06serviceops, 07Kubernetes: Relabel codfw wikikube worker nodes - https://phabricator.wikimedia.org/T367286 (10Clement_Goubert) 03NEW [11:26:18] (03PS1) 10Lucas Werkmeister (WMDE): Enable EntitySchema data type on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042209 (https://phabricator.wikimedia.org/T332157) [11:26:26] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041643 (owner: 10PipelineBot) [11:27:19] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041643 (owner: 10PipelineBot) [11:28:52] jouncebot: nowandnext [11:28:52] For the next 0 hour(s) and 31 minute(s): Services – Citoid / Zotero (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240612T1100) [11:28:52] In 1 hour(s) and 31 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240612T1300) [11:28:59] mvolz: are you done with your deployments? [11:29:00] (03PS2) 10Lucas Werkmeister (WMDE): Load EntitySchema on Test Wikidata clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042208 (https://phabricator.wikimedia.org/T363153) [11:29:00] (03PS2) 10Lucas Werkmeister (WMDE): Enable EntitySchema data type on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042209 (https://phabricator.wikimedia.org/T332157) [11:29:08] no [11:30:52] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply [11:31:07] hnowlan: I'll let you know when I'm done [11:31:11] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:31:51] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox: apply [11:32:17] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1042128 (https://phabricator.wikimedia.org/T367235) (owner: 10David Caro) [11:34:33] (03CR) 10David Caro: [C:03+2] wmf_sink.base: ignore also any saved host key [puppet] - 10https://gerrit.wikimedia.org/r/1042128 (https://phabricator.wikimedia.org/T367235) (owner: 10David Caro) [11:35:36] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox: apply [11:35:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P64690 and previous config saved to /var/cache/conftool/dbconfig/20240612-113556-marostegui.json [11:35:59] !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/citoid: apply [11:36:01] (03CR) 10Clément Goubert: [C:03+2] image-suggestion: Update image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042188 (https://phabricator.wikimedia.org/T362518) (owner: 10Clément Goubert) [11:36:21] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [11:36:27] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [11:36:40] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1034.eqiad.wmnet [11:36:42] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [11:36:48] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [11:36:49] (03Merged) 10jenkins-bot: image-suggestion: Update image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042188 (https://phabricator.wikimedia.org/T362518) (owner: 10Clément Goubert) [11:37:10] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [11:37:16] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [11:37:18] !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [11:37:42] !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/image-suggestion: apply [11:37:46] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [11:37:52] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [11:38:07] !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/image-suggestion: apply [11:39:39] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [11:40:38] (03CR) 10JMeybohm: [C:03+2] mcrouter: Bump chart modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038859 (https://phabricator.wikimedia.org/T362978) (owner: 10Alexandros Kosiaris) [11:41:31] (03Merged) 10jenkins-bot: mcrouter: Bump chart modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038859 (https://phabricator.wikimedia.org/T362978) (owner: 10Alexandros Kosiaris) [11:41:45] (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026954 (https://phabricator.wikimedia.org/T362978) (owner: 10Elukey) [11:42:00] !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/citoid: apply [11:42:29] !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [11:43:47] (03CR) 10JMeybohm: [C:03+2] eventstreams: add securityContext to all production containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037861 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [11:44:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1191', diff saved to https://phabricator.wikimedia.org/P64691 and previous config saved to /var/cache/conftool/dbconfig/20240612-114410-root.json [11:44:18] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/image-suggestion: apply [11:44:25] FIRING: [8x] SystemdUnitFailed: dump_cloud_ip_ranges.service on puppetmaster2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:44:48] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/image-suggestion: apply [11:44:52] (03Merged) 10jenkins-bot: eventstreams: add securityContext to all production containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037861 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [11:45:18] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/mw-mcrouter: apply [11:45:23] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-mcrouter: apply [11:45:39] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/image-suggestion: apply [11:45:50] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/mw-mcrouter: apply [11:45:59] (03PS4) 10Clément Goubert: trafficserver: move k8s traffic shift to 90% [puppet] - 10https://gerrit.wikimedia.org/r/1028844 (https://phabricator.wikimedia.org/T362323) (owner: 10Hnowlan) [11:46:06] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/image-suggestion: apply [11:47:00] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1034.eqiad.wmnet [11:47:01] (03PS5) 10Hnowlan: trafficserver: move k8s traffic shift to 95% [puppet] - 10https://gerrit.wikimedia.org/r/1028844 (https://phabricator.wikimedia.org/T362323) [11:47:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1191 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P64692 and previous config saved to /var/cache/conftool/dbconfig/20240612-114705-root.json [11:47:19] hnowlan: wait I did the 90% bump yesterday [11:47:47] hnowlan: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1042205 for 95 [11:47:54] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Also remove dummy keytabs for decommed stat servers [labs/private] - 10https://gerrit.wikimedia.org/r/1041686 (owner: 10Muehlenhoff) [11:48:58] hnowlan: I am done deploying [11:49:02] (03CR) 10Brouberol: [C:03+2] deployment_server: alert on admin-ng pending changes [puppet] - 10https://gerrit.wikimedia.org/r/1040992 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [11:49:02] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1041589 was the one that ended up merged, I guess I duplicated? [11:49:12] Related: Does anyone know how to add fields to the logstash index? https://logstash.wikimedia.org/app/management/opensearch-dashboards/indexPatterns/patterns/logstash-*#/?_a=h@9293420 [11:49:22] (03CR) 10Muehlenhoff: [C:03+2] Change ping host in codfw to ping1004 [homer/public] - 10https://gerrit.wikimedia.org/r/1041687 (https://phabricator.wikimedia.org/T366695) (owner: 10Muehlenhoff) [11:49:25] FIRING: [9x] SystemdUnitFailed: ferm.service on kubernetes1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:49:36] I have added some new fields I'd like to index and it let's me edit existing ones but I can't see where to add new ones [11:49:41] lets* [11:49:46] FIRING: Storage /var over 50%: Alert for device lsw1-f5-eqiad.mgmt.eqiad.wmnet - Storage /var over 50% - https://alerts.wikimedia.org/?q=alertname%3DStorage+%2Fvar+over+50%25 [11:50:02] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams: apply [11:50:22] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [11:50:23] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [11:51:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T367261)', diff saved to https://phabricator.wikimedia.org/P64693 and previous config saved to /var/cache/conftool/dbconfig/20240612-115103-marostegui.json [11:51:06] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1212.eqiad.wmnet with reason: Maintenance [11:51:07] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [11:51:08] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply [11:51:19] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1212.eqiad.wmnet with reason: Maintenance [11:51:21] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [11:51:36] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [11:51:39] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams: apply [11:51:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1212 (T367261)', diff saved to https://phabricator.wikimedia.org/P64695 and previous config saved to /var/cache/conftool/dbconfig/20240612-115143-marostegui.json [11:52:28] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [11:52:29] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams-internal: apply [11:52:44] mvolz: thanks! [11:52:59] claime: I was just repurposing the redundant 90% CR [11:53:06] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [11:53:09] ah [11:53:15] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [11:53:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1034.eqiad.wmnet [11:53:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1034.eqiad.wmnet [11:53:40] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: apply [11:54:25] FIRING: [10x] SystemdUnitFailed: ferm.service on kubernetes1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:54:36] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1033.eqiad.wmnet [11:54:48] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [11:54:58] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [11:55:27] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [11:55:42] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [11:56:01] (03CR) 10Hnowlan: [C:03+1] trafficserver: move 95% of traffic to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1042205 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert) [11:56:23] (03Abandoned) 10Hnowlan: trafficserver: move k8s traffic shift to 95% [puppet] - 10https://gerrit.wikimedia.org/r/1028844 (https://phabricator.wikimedia.org/T362323) (owner: 10Hnowlan) [11:57:09] !log Manual restart of dump_cloud_ip_ranges.service on A:puppetserver and A:puppetmaster [11:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:57:38] claime: bumps are done for 95% (but ofc whenever suits you) [11:57:55] hnowlan: awesome, will merge after lunch <3 [11:58:17] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [11:58:35] (03PS1) 10Brouberol: helmfile: fix typo in python script puppet path [puppet] - 10https://gerrit.wikimedia.org/r/1042215 (https://phabricator.wikimedia.org/T331894) [11:59:01] (03PS2) 10Brouberol: helmfile: fix typo in python script puppet path [puppet] - 10https://gerrit.wikimedia.org/r/1042215 (https://phabricator.wikimedia.org/T331894) [11:59:05] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [11:59:06] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply [11:59:25] FIRING: [10x] SystemdUnitFailed: ferm.service on kubernetes1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:59:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T367261)', diff saved to https://phabricator.wikimedia.org/P64696 and previous config saved to /var/cache/conftool/dbconfig/20240612-115934-marostegui.json [11:59:38] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [12:00:00] PROBLEM - Check whether ferm is active by checking the default input chain on wikikube-ctrl1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:00:12] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply [12:00:51] (03CR) 10Btullis: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1042215 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [12:02:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1191 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P64697 and previous config saved to /var/cache/conftool/dbconfig/20240612-120211-root.json [12:02:18] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2898/co" [puppet] - 10https://gerrit.wikimedia.org/r/1042215 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [12:02:39] (03CR) 10Brouberol: [V:03+1 C:03+2] helmfile: fix typo in python script puppet path [puppet] - 10https://gerrit.wikimedia.org/r/1042215 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [12:03:40] FIRING: [3x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2088:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:04:17] (03CR) 10JMeybohm: [C:03+2] kask: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037195 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [12:04:25] FIRING: [13x] SystemdUnitFailed: ferm.service on kubernetes1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:04:58] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1033.eqiad.wmnet [12:05:17] (03Merged) 10jenkins-bot: kask: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037195 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [12:05:18] RECOVERY - Check whether ferm is active by checking the default input chain on wikikube-ctrl1001 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:05:18] PROBLEM - Check whether ferm is active by checking the default input chain on mw2387 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:05:18] PROBLEM - Check whether ferm is active by checking the default input chain on parse2008 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:05:38] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-mcrouter: apply [12:06:52] PROBLEM - Host kubestagemaster1003 is DOWN: PING CRITICAL - Packet loss = 100% [12:08:45] FIRING: [3x] ProbeDown: Service ganeti1034:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:09:19] (03PS1) 10JMeybohm: mw-mcrouter: Deployments of the deamonset take longer than 5min [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042218 [12:09:25] FIRING: [14x] SystemdUnitFailed: ferm.service on kubernetes1018:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:09:36] (03CR) 10JMeybohm: "Just FYI" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042218 (owner: 10JMeybohm) [12:29:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P64700 and previous config saved to /var/cache/conftool/dbconfig/20240612-122948-marostegui.json [12:31:15] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9884096 (10klausman) [12:31:47] RESOLVED: HelmReleaseBadStatus: Helm release mw-mcrouter/main on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=mw-mcrouter - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [12:32:02] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: Q3:rack/setup/install ml-staging2003 - https://phabricator.wikimedia.org/T357415#9884100 (10klausman) One note: since the default OS has changed (Bullseye->Bookworm), I updated the ticket desc accordingly --- we definitely want Bookworm. [12:32:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1191 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P64701 and previous config saved to /var/cache/conftool/dbconfig/20240612-123222-root.json [12:34:55] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ping1003.eqiad.wmnet [12:35:18] RECOVERY - Check whether ferm is active by checking the default input chain on mw2387 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:35:19] RECOVERY - Check whether ferm is active by checking the default input chain on parse2008 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [12:36:22] FIRING: [4x] SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [12:38:25] (03PS2) 10JMeybohm: [WIP] Global update of test-service-checker template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032714 (https://phabricator.wikimedia.org/T362978) [12:38:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ping1003.eqiad.wmnet [12:38:44] (03PS3) 10JMeybohm: Global update of test-service-checker template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032714 (https://phabricator.wikimedia.org/T362978) [12:39:37] (03CR) 10CI reject: [V:04-1] Global update of test-service-checker template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032714 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [12:40:01] (03CR) 10Santiago Faci: [C:03+1] "Just wondering something about configuring the localQuorum for edit and editor. In case you consider that is not needed, let's go with thi" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041235 (https://phabricator.wikimedia.org/T366851) (owner: 10Scott French) [12:41:04] (03CR) 10Ladsgroup: [C:03+2] override circuit breaking threshold for ES hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042197 (owner: 10Ladsgroup) [12:41:42] (03Merged) 10jenkins-bot: override circuit breaking threshold for ES hosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042197 (owner: 10Ladsgroup) [12:42:17] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1042197|override circuit breaking threshold for ES hosts]] [12:42:31] (03PS4) 10JMeybohm: Global update of test-service-checker template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032714 (https://phabricator.wikimedia.org/T362978) [12:43:09] (03PS2) 10Brouberol: helmfile: set HELM environment variables for the admin-ng systemd jobs [puppet] - 10https://gerrit.wikimedia.org/r/1042224 (https://phabricator.wikimedia.org/T331894) [12:44:49] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:1042197|override circuit breaking threshold for ES hosts]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:44:52] jouncebot: nowandnext [12:44:52] No deployments scheduled for the next 0 hour(s) and 15 minute(s) [12:44:52] In 0 hour(s) and 15 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240612T1300) [12:44:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T367261)', diff saved to https://phabricator.wikimedia.org/P64702 and previous config saved to /var/cache/conftool/dbconfig/20240612-124456-marostegui.json [12:44:58] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1223.eqiad.wmnet with reason: Maintenance [12:45:00] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [12:45:11] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1223.eqiad.wmnet with reason: Maintenance [12:45:18] I’ll start some gate-and-submit for my backports already if that’s okay [12:45:26] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] Only register EntitySchema namespace when feature is enabled [extensions/EntitySchema] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1041678 (https://phabricator.wikimedia.org/T363153) (owner: 10Lucas Werkmeister (WMDE)) [12:46:26] oh, looks like Amir1 is still deploying [12:46:45] yeah, I hope I'll be done soon [12:47:13] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on logstash1030.eqiad.wmnet with reason: reboot/ganeti [12:47:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1191 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P64703 and previous config saved to /var/cache/conftool/dbconfig/20240612-124727-root.json [12:47:28] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on logstash1030.eqiad.wmnet with reason: reboot/ganeti [12:47:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1020.eqiad.wmnet [12:50:31] (03CR) 10Hoo man: [C:03+1] Load EntitySchema on Test Wikidata clients [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042208 (https://phabricator.wikimedia.org/T363153) (owner: 10Lucas Werkmeister (WMDE)) [12:50:33] !log ladsgroup@deploy1002 ladsgroup: Continuing with sync [12:50:50] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1240.eqiad.wmnet with reason: Maintenance [12:50:52] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1240.eqiad.wmnet with reason: Maintenance [12:51:50] (03CR) 10Hoo man: [C:03+1] Enable EntitySchema data type on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042209 (https://phabricator.wikimedia.org/T332157) (owner: 10Lucas Werkmeister (WMDE)) [12:52:27] (03CR) 10Brouberol: [C:03+2] wdqs: add wdqs2023 as a scap target [puppet] - 10https://gerrit.wikimedia.org/r/1042196 (owner: 10DCausse) [12:53:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1020.eqiad.wmnet [12:53:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1020.eqiad.wmnet [12:54:11] (03CR) 10Btullis: "This looks generally good, but just because prometheus::node_textfile is deployed to every single server, let's requests a review from out" [puppet] - 10https://gerrit.wikimedia.org/r/1042224 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [12:54:13] (03PS1) 10Vgutierrez: hiera: Enable IPIP on high-traffic1@esams for text services [puppet] - 10https://gerrit.wikimedia.org/r/1042229 (https://phabricator.wikimedia.org/T366466) [12:54:15] (03PS1) 10Vgutierrez: hiera: Enable IPIP on text@esams [puppet] - 10https://gerrit.wikimedia.org/r/1042230 (https://phabricator.wikimedia.org/T366466) [12:54:55] (03CR) 10Brouberol: [V:03+1] helmfile: set HELM environment variables for the admin-ng systemd jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1042224 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [12:55:13] (03PS1) 10Vgutierrez: depool text@esams before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1042231 (https://phabricator.wikimedia.org/T366466) [12:56:12] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1021.eqiad.wmnet [12:56:12] (03PS3) 10Brouberol: helmfile: set HELM environment variables for the admin-ng systemd jobs [puppet] - 10https://gerrit.wikimedia.org/r/1042224 (https://phabricator.wikimedia.org/T331894) [12:56:14] (03CR) 10CI reject: [V:04-1] depool text@esams before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1042231 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [12:57:15] (03CR) 10Brouberol: helmfile: set HELM environment variables for the admin-ng systemd jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1042224 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [12:57:18] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1042230 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [12:57:56] info: admin_state: checking state file '/tmp/dns-check.wfzjokyj/state/admin_state'... [12:57:56] error: In rrset '_etcd-server-ssl._tcp.k8s3.eqiad.wmnet. SRV', same-zone target 'wikikube-ctrl1003.eqiad.wmnet.' has no addresses [12:58:12] dns repo CI is complaining about wikikube-ctrl1003 [12:58:48] kamila_: ^ [12:58:51] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1042197|override circuit breaking threshold for ES hosts]] (duration: 16m 34s) [12:58:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T364069)', diff saved to https://phabricator.wikimedia.org/P64704 and previous config saved to /var/cache/conftool/dbconfig/20240612-125853-marostegui.json [12:58:58] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [12:59:39] vgutierrez: fixing shortly [12:59:53] it'll be up later today [12:59:59] kamila_: #define later [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240612T1300). [13:00:05] Lucas_WMDE: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:07] that's blocking me at the moment :) [13:00:14] oh, sorry, where is it? [13:00:17] basically is blocking the whole DNS repo [13:00:20] o/ [13:00:24] crap, sorry, I had no idea [13:00:37] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [extensions/EntitySchema] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1041678 (https://phabricator.wikimedia.org/T363153) (owner: 10Lucas Werkmeister (WMDE)) [13:00:44] later == 14:something UTC [13:01:17] kamila_: please ping me as soon as it's fixed [13:01:38] vgutierrez: I'll see if I can fix it immediately [13:01:41] (03PS4) 10Brouberol: helmfile: set HELM environment variables for the admin-ng systemd jobs [puppet] - 10https://gerrit.wikimedia.org/r/1042224 (https://phabricator.wikimedia.org/T331894) [13:01:55] kamila_: so if the instance is not up the easiest fix would be to drop the SRV record [13:02:04] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [13:02:13] aka reverting 15f5b5f6 [13:02:17] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [13:02:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1191 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P64705 and previous config saved to /var/cache/conftool/dbconfig/20240612-130232-root.json [13:02:48] vgutierrez: if it unblocks you, go ahead I guess, I'll deal with it somehow [13:03:23] but wouldn't changing the state in netbox to planned also fix it? [13:03:43] I don't think so [13:03:57] the error is the following one [13:03:59] error: In rrset '_etcd-server-ssl._tcp.k8s3.eqiad.wmnet. SRV', same-zone target 'wikikube-ctrl1003.eqiad.wmnet.' has no addresses [13:04:19] I can make addresses appear in netbox [13:04:31] that would solve it [13:08:43] vgutierrez: actually I can't since I don't know the future location yet, sorry [13:08:59] kamila_: so is it safe to remove the SRV record? [13:09:03] (03Merged) 10jenkins-bot: Only register EntitySchema namespace when feature is enabled [extensions/EntitySchema] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1041678 (https://phabricator.wikimedia.org/T363153) (owner: 10Lucas Werkmeister (WMDE)) [13:09:31] vgutierrez: yeah [13:09:35] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1041678|Only register EntitySchema namespace when feature is enabled (T363153)]] [13:09:36] (03PS1) 10Eevans: aqs: Upgrade cluster to Java 11 [puppet] - 10https://gerrit.wikimedia.org/r/1042234 (https://phabricator.wikimedia.org/T350567) [13:09:38] Lucas_WMDE: I've been done [13:09:39] I'll put it back when the host is ready [13:09:39] T363153: [ES-M2]: Load EntitySchema data type registration for WikibaseClient on client wikis - https://phabricator.wikimedia.org/T363153 [13:09:41] (03PS2) 10JMeybohm: kserve-inference: add securityContext explicit config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026954 (https://phabricator.wikimedia.org/T362978) (owner: 10Elukey) [13:09:51] kamila_: do you have a phabricator task for wikikube-ctrl1003 work? [13:09:58] (just to append it to the revert commit) [13:10:13] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2139.codfw.wmnet with reason: Maintenance [13:10:20] vgutierrez: https://phabricator.wikimedia.org/T366204 [13:10:22] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:10:22] PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:10:25] Amir1: I noticed, thanks :) [13:10:26] (03CR) 10CI reject: [V:04-1] kserve-inference: add securityContext explicit config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026954 (https://phabricator.wikimedia.org/T362978) (owner: 10Elukey) [13:10:26] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2139.codfw.wmnet with reason: Maintenance [13:10:39] * Lucas_WMDE deploying now [13:10:40] PROBLEM - OSPF status on cr1-magru is CRITICAL: OSPFv2: 1/2 UP : OSPFv3: 1/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:10:46] kamila_: thx [13:10:50] PROBLEM - BGP status on cr1-magru is CRITICAL: BGP CRITICAL - AS12956/IPv4: Connect - Telxius, AS12956/IPv6: Connect - Telxius https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [13:11:19] (03PS1) 10Vgutierrez: Revert "Add wikikube-ctrl1003 to server SRV record for etcd 4" [dns] - 10https://gerrit.wikimedia.org/r/1042237 (https://phabricator.wikimedia.org/T366204) [13:11:26] (03PS2) 10Vgutierrez: Revert "Add wikikube-ctrl1003 to server SRV record for etcd 4" [dns] - 10https://gerrit.wikimedia.org/r/1042237 (https://phabricator.wikimedia.org/T366204) [13:11:44] kamila_: https://gerrit.wikimedia.org/r/c/operations/dns/+/1042237 [13:12:05] vgutierrez: thank you (and sorry) [13:12:09] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Backport for [[gerrit:1041678|Only register EntitySchema namespace when feature is enabled (T363153)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:12:11] testing… [13:12:22] no problem :) just review the change [13:12:57] (03CR) 10Kamila Součková: [C:03+1] Revert "Add wikikube-ctrl1003 to server SRV record for etcd 4" [dns] - 10https://gerrit.wikimedia.org/r/1042237 (https://phabricator.wikimedia.org/T366204) (owner: 10Vgutierrez) [13:13:06] thx :D [13:13:15] (03CR) 10Vgutierrez: [C:03+2] Revert "Add wikikube-ctrl1003 to server SRV record for etcd 4" [dns] - 10https://gerrit.wikimedia.org/r/1042237 (https://phabricator.wikimedia.org/T366204) (owner: 10Vgutierrez) [13:13:15] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 1:00:00 on logstash1031.eqiad.wmnet with reason: reboot/ganeti [13:13:20] working as far as I can tell [13:13:22] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Continuing with sync [13:13:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on logstash1031.eqiad.wmnet with reason: reboot/ganeti [13:13:51] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1021.eqiad.wmnet [13:13:56] (03PS2) 10Vgutierrez: depool text@esams before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1042231 (https://phabricator.wikimedia.org/T366466) [13:14:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P64706 and previous config saved to /var/cache/conftool/dbconfig/20240612-131400-marostegui.json [13:16:08] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1042234 (https://phabricator.wikimedia.org/T350567) (owner: 10Eevans) [13:17:24] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/EntitySchema] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1041677 (https://phabricator.wikimedia.org/T363153) (owner: 10Lucas Werkmeister (WMDE)) [13:17:29] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [extensions/EntitySchema] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1041679 (https://phabricator.wikimedia.org/T363153) (owner: 10Lucas Werkmeister (WMDE)) [13:17:39] 06SRE, 10SRE-Access-Requests: Requesting access to private data-based dashboards for Jsn.sherman - https://phabricator.wikimedia.org/T367295 (10jsn.sherman) 03NEW [13:18:13] 06SRE, 10SRE-Access-Requests: Requesting access to private data-based dashboards for Jsn.sherman - https://phabricator.wikimedia.org/T367295#9884252 (10jsn.sherman) [13:18:15] 06SRE, 10Cassandra, 06Data-Persistence, 13Patch-For-Review: Migrate Cassandra to Java 11 - https://phabricator.wikimedia.org/T350567#9884253 (10MoritzMuehlenhoff) Can you please uninstall openjdk-8-* on the migrated clusters? (simply run dpkg --remove openjdk-8-jdk openjre-8-jre openjdk-8-jdk-headless open... [13:18:21] Lucas_WMDE: o/ I have a mw config I'd like to deploy whenever you are done. [13:18:24] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on aqs1010.eqiad.wmnet with reason: Troubleshooting remote logging — T350567 [13:18:28] T350567: Migrate Cassandra to Java 11 - https://phabricator.wikimedia.org/T350567 [13:18:38] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on aqs1010.eqiad.wmnet with reason: Troubleshooting remote logging — T350567 [13:18:51] 06SRE, 10Cassandra, 06Data-Persistence, 13Patch-For-Review: Migrate Cassandra to Java 11 - https://phabricator.wikimedia.org/T350567#9884255 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=36667e14-e2ab-458d-af49-c424df72a544) set by eevans@cumin1002 for 1 day, 0:00:00 on 1 host(s) and... [13:18:51] (03PS3) 10JMeybohm: kserve-inference: add securityContext explicit config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026954 (https://phabricator.wikimedia.org/T362978) (owner: 10Elukey) [13:19:14] ottomata: alright! [13:19:19] (03CR) 10Btullis: dse-k8s-services: Add net-new chart for Airflow (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [13:19:24] can you add it to the calendar page already so I don’t forget? [13:19:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1021.eqiad.wmnet [13:20:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1021.eqiad.wmnet [13:20:39] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (NOOP 6 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1042230 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [13:21:50] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1041678|Only register EntitySchema namespace when feature is enabled (T363153)]] (duration: 12m 15s) [13:21:51] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1022.eqiad.wmnet [13:21:55] T363153: [ES-M2]: Load EntitySchema data type registration for WikibaseClient on client wikis - https://phabricator.wikimedia.org/T363153 [13:22:33] yes oo i can try the new thing bd808 built... [13:22:46] https://schedule-deployment.toolforge.org/ [13:23:09] yesss [13:23:31] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2149.codfw.wmnet with reason: Maintenance [13:23:44] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2149.codfw.wmnet with reason: Maintenance [13:23:45] FIRING: [2x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:23:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2149 (T367261)', diff saved to https://phabricator.wikimedia.org/P64707 and previous config saved to /var/cache/conftool/dbconfig/20240612-132351-marostegui.json [13:24:29] aww Internal Server Error [13:24:59] oop refresh might be working [13:25:26] ah this window is no longer available for scheduling [13:25:29] !log dcausse@deploy1002 Started deploy [wdqs/wdqs@43b966f]: deploy to test server wdqs2023 [13:25:36] ah, because it’s ongoing? [13:25:44] !log dcausse@deploy1002 Finished deploy [wdqs/wdqs@43b966f]: deploy to test server wdqs2023 (duration: 00m 14s) [13:25:46] FIRING: [2x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:26:04] !log depool text@esams before enabling IPIP encapsulation - T366466 [13:26:42] hmm all good logmsgbot? :) [13:26:46] yeah i think so [13:26:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:23] that's some lag :) [13:27:39] !log sukhe@cumin1002 START - Cookbook sre.dns.netbox [13:28:36] manually edited [13:28:37] !log add ntp-[abc].anycast.wmnet: 10.3.0.[5-7]/32: T366360 [13:28:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:42] T366360: Anycast NTP and update the list of timeservers for P:systemd::timesyncd - https://phabricator.wikimedia.org/T366360 [13:28:45] RESOLVED: [2x] ProbeDown: Service aqs1010-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:29:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P64708 and previous config saved to /var/cache/conftool/dbconfig/20240612-132907-marostegui.json [13:30:02] !log brouberol@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet) [13:30:07] T349069: Design and implement a WDQS data-reload mechanism that sources its data from HDFS instead of the snapshot servers - https://phabricator.wikimedia.org/T349069 [13:30:20] !log sukhe@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add ntp-[abc].anycast.wmnet addresses - sukhe@cumin1002" [13:31:14] !log sukhe@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: add ntp-[abc].anycast.wmnet addresses - sukhe@cumin1002" [13:31:15] !log sukhe@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:31:22] FIRING: [4x] SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [13:31:49] ^fixed, will resolve soon [13:33:55] !log brouberol@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet) [13:33:57] !log elukey@cumin1002 START - Cookbook sre.puppet.renew-cert for poolcounter1005.eqiad.wmnet: Renew puppet certificate - elukey@cumin1002 [13:34:42] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [13:34:55] !log eevans@cumin1002 START - Cookbook sre.hosts.remove-downtime for aqs1010.eqiad.wmnet [13:34:55] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for aqs1010.eqiad.wmnet [13:35:08] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for poolcounter1005.eqiad.wmnet: Renew puppet certificate - elukey@cumin1002 [13:35:19] !log elukey@cumin1002 START - Cookbook sre.puppet.renew-cert for poolcounter1004.eqiad.wmnet: Renew puppet certificate - elukey@cumin1002 [13:36:22] RESOLVED: [4x] SessionStoreOnNonDedicatedHost: Sessionstore k8s pods are running on non-dedicated hosts - TODO - TODO - https://alerts.wikimedia.org/?q=alertname%3DSessionStoreOnNonDedicatedHost [13:36:33] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for poolcounter1004.eqiad.wmnet: Renew puppet certificate - elukey@cumin1002 [13:36:51] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1022.eqiad.wmnet [13:36:55] !log elukey@cumin1002 START - Cookbook sre.puppet.renew-cert for poolcounter2003.codfw.wmnet: Renew puppet certificate - elukey@cumin1002 [13:37:57] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for poolcounter2003.codfw.wmnet: Renew puppet certificate - elukey@cumin1002 [13:38:07] !log elukey@cumin1002 START - Cookbook sre.puppet.renew-cert for poolcounter2004.codfw.wmnet: Renew puppet certificate - elukey@cumin1002 [13:38:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T367261)', diff saved to https://phabricator.wikimedia.org/P64709 and previous config saved to /var/cache/conftool/dbconfig/20240612-133812-marostegui.json [13:38:44] PROBLEM - Host ml-etcd1003 is DOWN: PING CRITICAL - Packet loss = 100% [13:39:07] noo my gate and submit failed /o\ [13:39:07] !log elukey@cumin1002 END (PASS) - Cookbook sre.puppet.renew-cert (exit_code=0) for poolcounter2004.codfw.wmnet: Renew puppet certificate - elukey@cumin1002 [13:39:17] random ECONNRESET from npm :cat_wat: [13:40:28] RECOVERY - Host ml-etcd1003 is UP: PING OK - Packet loss = 0%, RTA = 0.53 ms [13:41:34] so, what would happen if we were to deploy ottomata’s change before trying another gate-and-submit of my two changes? [13:41:46] given that one of my backports was merged, I assume that would be deployed as well then, at least to kubernetes [13:42:01] (and probably to bare-metal too – as discussed yesterday, scap backport does a full sync-world) [13:42:12] and I don’t think I want that without testing :| [13:42:35] so I think I’ll just have to try my backports again, and ottomata will have to wait a bit longer still, sorry :/ [13:42:36] if it was merged, I assume it would be deployed yes [13:42:43] (and this will all probably run into the Wikifunctions window, unfortunately) [13:42:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1022.eqiad.wmnet [13:42:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1022.eqiad.wmnet [13:44:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T364069)', diff saved to https://phabricator.wikimedia.org/P64710 and previous config saved to /var/cache/conftool/dbconfig/20240612-134414-marostegui.json [13:44:16] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance [13:44:18] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [13:44:29] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance [13:45:24] (03CR) 10Vgutierrez: [V:03+1 C:03+2] hiera: Enable IPIP on text@esams [puppet] - 10https://gerrit.wikimedia.org/r/1042230 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [13:45:35] !log Starting kafka-main reboots in eqiad [13:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:45:52] 10ops-codfw, 06SRE, 06DC-Ops, 10Observability-Metrics: Memory upgrade request for prometheus200[56] - https://phabricator.wikimedia.org/T360895#9884373 (10herron) 05Open→03Resolved This was completed yesterday (during stashbot outage, this task unfortunately missed the !log) [13:45:52] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s6 [13:45:57] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1019.eqiad.wmnet,service=s4 [13:46:03] !log cgoubert@cumin1002 START - Cookbook sre.kafka.roll-restart-reboot-brokers rolling reboot on A:kafka-main-eqiad [13:46:25] (03CR) 10CI reject: [V:04-1] Only register EntitySchema namespace when feature is enabled [extensions/EntitySchema] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1041679 (https://phabricator.wikimedia.org/T363153) (owner: 10Lucas Werkmeister (WMDE)) [13:46:26] (03PS1) 10JMeybohm: toolhub: Add missing securityContext to CronJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042256 (https://phabricator.wikimedia.org/T362978) [13:46:27] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1020.eqiad.wmnet,service=s5 [13:46:30] !log fnegri@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1020.eqiad.wmnet,service=s8 [13:46:33] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [extensions/EntitySchema] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1041679 (https://phabricator.wikimedia.org/T363153) (owner: 10Lucas Werkmeister (WMDE)) [13:46:43] pfffft [13:46:45] scap backport refuses [13:46:46] > The change '1041679' failed build tests and could not be merged [13:46:53] I know it failed tests! that’s why I want to try again 😠 [13:47:14] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [extensions/EntitySchema] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1041679 (https://phabricator.wikimedia.org/T363153) (owner: 10Lucas Werkmeister (WMDE)) [13:47:21] oh, but it still gave the +2 o_O [13:47:28] so running the command a third time, now it works again… [13:47:29] wat [13:48:32] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet [13:48:33] !log fnegri@cumin1002 START - Cookbook sre.hosts.downtime for 4:00:00 on clouddb1020.eqiad.wmnet with reason: T366555 [13:48:46] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on clouddb1020.eqiad.wmnet with reason: T366555 [13:48:49] (03PS1) 10Slyngshede: Replace development server with uWSGI. [software/bitu] - 10https://gerrit.wikimedia.org/r/1042261 [13:48:55] * Lucas_WMDE looks for existing scap backport task about this behavior, if any [13:48:57] (03PS3) 10Dreamrimmer: maiwiki: Remove 'CA' namespace alias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031533 (https://phabricator.wikimedia.org/T363667) [13:49:07] !log depooled cp4037 to test benthos/haproxy configuration (T365718) [13:49:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:11] T365718: Switch HAProxy/Benthos to rfc5424 - https://phabricator.wikimedia.org/T365718 [13:50:01] (03PS2) 10Slyngshede: Replace development server with uWSGI. [software/bitu] - 10https://gerrit.wikimedia.org/r/1042261 [13:50:05] (03PS14) 10Gergő Tisza: [POC] Handle sso.wikimedia.org domain [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) [13:50:14] * Lucas_WMDE creates task [13:50:17] (03CR) 10Btullis: dse-k8s-services: Add net-new chart for Airflow (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [13:51:09] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2903/co" [puppet] - 10https://gerrit.wikimedia.org/r/1035820 (https://phabricator.wikimedia.org/T355097) (owner: 10Brennen Bearnes) [13:53:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P64712 and previous config saved to /var/cache/conftool/dbconfig/20240612-135319-marostegui.json [13:53:44] !log rolling restart of pybal on lvs3010 and lvs3008 - T366466 [13:53:45] (03PS27) 10DCausse: wdqs.data-reload: various fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) (owner: 10Ryan Kemper) [13:53:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:48] T366466: Use IPIP encapsulation on lvs<-->text cluster - https://phabricator.wikimedia.org/T366466 [13:54:25] filed T367302 for the scap behavior [13:54:25] T367302: Confusing scap backport behavior on gate-and-submit failure - https://phabricator.wikimedia.org/T367302 [13:54:26] (03CR) 10Jelto: [V:03+1 C:03+1] "lgtm now, thanks you!" [puppet] - 10https://gerrit.wikimedia.org/r/1035820 (https://phabricator.wikimedia.org/T355097) (owner: 10Brennen Bearnes) [13:54:53] !log dcausse@deploy1002 Started deploy [wdqs/wdqs@1cf4017]: deploy to test server wdqs2023 (fix loadData.sh) [13:55:06] !log dcausse@deploy1002 Finished deploy [wdqs/wdqs@1cf4017]: deploy to test server wdqs2023 (fix loadData.sh) (duration: 00m 13s) [13:55:38] (03CR) 10Jelto: [C:03+2] "lgtm, let me know when this should be merged" [puppet] - 10https://gerrit.wikimedia.org/r/1040567 (owner: 10Paladox) [13:56:44] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1038.eqiad.wmnet [13:58:07] (03PS1) 10Vgutierrez: Revert "depool text@esams before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1042265 (https://phabricator.wikimedia.org/T366466) [13:58:21] (03CR) 10Scott French: [C:03+1] Global update of test-service-checker template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032714 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [13:59:02] (03CR) 10JMeybohm: [C:03+2] Global update of test-service-checker template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032714 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [13:59:10] (03PS1) 10Gergő Tisza: tests: Replace yaml_parse_file with symfony/yaml [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042266 [13:59:13] 06SRE, 10SRE-Access-Requests: Requesting access to private data-based dashboards for Jsn.sherman - https://phabricator.wikimedia.org/T367295#9884454 (10DMburugu) Approved [14:00:05] Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240612T1400) [14:00:24] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [14:00:39] * James_F waves. [14:00:46] I’m still deploying, sorry :( [14:01:19] (03PS28) 10DCausse: wdqs.data-reload: various fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) (owner: 10Ryan Kemper) [14:01:19] if you’re deploying with helmfile then I’m not sure if that blocks you or not [14:01:32] (03CR) 10DCausse: wdqs.data-reload: various fixes (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) (owner: 10Ryan Kemper) [14:01:44] (03CR) 10Btullis: dse-k8s-services: Add net-new chart for Airflow (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [14:01:49] depends on the namespace [14:01:57] Lucas_WMDE: It probably doesn't, but I've got to create the commits first, so no rush! [14:02:12] ok [14:02:15] !log mvernon@cumin1002 START - Cookbook sre.hosts.reimage for host moss-be1003.eqiad.wmnet with OS bookworm [14:02:15] (03CR) 10Vgutierrez: [C:03+2] Revert "depool text@esams before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1042265 (https://phabricator.wikimedia.org/T366466) (owner: 10Vgutierrez) [14:02:22] !log repool text@esams with IPIP encapsulation enabled - T366466 [14:02:25] 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9884486 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin1002 for host moss-be1003.eqiad.wmnet with OS bookworm [14:02:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:26] T366466: Use IPIP encapsulation on lvs<-->text cluster - https://phabricator.wikimedia.org/T366466 [14:02:50] Amir1, godog, herron, jhathaway ^^ [14:02:57] (03CR) 10Eevans: aqs-http-gateway: allow cross-DC Cassandra client connection (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041235 (https://phabricator.wikimedia.org/T366851) (owner: 10Scott French) [14:02:58] ack [14:03:00] kk thanks [14:03:06] 👍 [14:03:58] 10ops-eqiad, 06DC-Ops, 10decommission-hardware: decommission stat100[4-7].eqiad.wmnet - https://phabricator.wikimedia.org/T367147#9884493 (10BTullis) a:05BTullis→03Jclark-ctr [14:04:23] (03Merged) 10jenkins-bot: Global update of test-service-checker template [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032714 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [14:04:36] <_joe_> jouncebot: now [14:04:36] For the next 0 hour(s) and 55 minute(s): Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240612T1400) [14:04:42] <_joe_> jouncebot: nowandnext [14:04:42] For the next 0 hour(s) and 55 minute(s): Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240612T1400) [14:04:43] In 2 hour(s) and 55 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240612T1700) [14:04:58] <_joe_> uhm [14:06:01] _joe_: Do you need to deploy? [14:06:15] <_joe_> James_F: I would like not to wait for my window :D [14:06:23] <_joe_> it's not that I /need/ [14:06:45] _joe_: Can you be done in 10? :-) [14:07:00] !log fnegri@cumin1002 START - Cookbook sre.hosts.reboot-single for host clouddb1020.eqiad.wmnet [14:07:04] <_joe_> yeah the change is just this https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1041656 [14:07:12] Go for it. [14:07:20] If Lucas_WMDE is done. [14:07:30] <_joe_> ack, thanks for lending me your window :) [14:08:00] RECOVERY - OSPF status on cr1-magru is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:08:11] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [software/bitu] - 10https://gerrit.wikimedia.org/r/1038778 (https://phabricator.wikimedia.org/T366525) (owner: 10Slyngshede) [14:08:11] <_joe_> Lucas_WMDE: lmk when you're done :) [14:08:26] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [14:08:26] RECOVERY - BFD status on cr2-eqiad is OK: UP: 25 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [14:08:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P64713 and previous config saved to /var/cache/conftool/dbconfig/20240612-140827-marostegui.json [14:08:56] RECOVERY - BGP status on cr1-magru is OK: BGP OK - up: 13, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [14:08:59] I'm also in the queue, _joe_ you go first [14:09:21] <_joe_> ottomata: are you sure? I am not in a rush [14:09:36] hm, okay I'll go first. I have meetings starting in 50 would love to get it done before then ;) [14:09:38] ty! [14:09:48] ottomata: Hang on, I have the window for a reason. :-) [14:09:51] so many deployments, so little time :( [14:09:58] James_F: for sure for sure [14:09:59] (sorry, was afk for a few minutes) [14:10:20] if it doesn't happen today for me it will be okay too [14:10:27] not a real rush either [14:10:41] I also still have a config change that I didn’t even schedule yet, maybe I can squeeze that in after the wikifunctions window [14:10:42] or tomorrow… [14:10:42] !log installing libarchive security updates [14:10:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:46] (03PS1) 10Majavah: openstack: wikitech: Stop setting writable LDAP credentials [puppet] - 10https://gerrit.wikimedia.org/r/1042267 (https://phabricator.wikimedia.org/T367287) [14:12:03] (03CR) 10JMeybohm: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042256 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [14:12:08] (phpunit at 90%…) [14:12:12] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2904/co" [puppet] - 10https://gerrit.wikimedia.org/r/1042267 (https://phabricator.wikimedia.org/T367287) (owner: 10Majavah) [14:12:21] Do you mean php-fpm restart? [14:12:25] <_joe_> James_F: go on with your deployment [14:12:28] Ack. [14:12:29] no, it’s still in the gate-and-submit [14:12:41] so I *think* scap hasn’t even taken the lock yet? [14:12:43] <_joe_> I mean, I will not make you wait longer [14:12:53] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2024-05-31-163732 to 2024-06-11-161031 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042268 (https://phabricator.wikimedia.org/T359233) [14:12:54] <_joe_> after Lucas is done [14:12:58] Ack. [14:13:07] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2024-06-05-003919 to 2024-06-11-223956 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042269 (https://phabricator.wikimedia.org/T359233) [14:13:13] <_joe_> but also I think you can deploy wikifunctions while mw gets deployed [14:13:15] <_joe_> IMHO [14:13:18] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade evaluators from 2024-05-31-163732 to 2024-06-11-161031 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042268 (https://phabricator.wikimedia.org/T359233) (owner: 10Jforrester) [14:13:23] Ack. [14:13:29] fine by me [14:13:51] (03CR) 10Scott French: aqs-http-gateway: allow cross-DC Cassandra client connection (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041235 (https://phabricator.wikimedia.org/T366851) (owner: 10Scott French) [14:14:05] (03CR) 10CDanis: "I think this is probably fine, although I notice in the pcc results that it never causes any diffs -- all the diffs there are due to PCC p" [puppet] - 10https://gerrit.wikimedia.org/r/1037531 (owner: 10Kosta Harlan) [14:14:08] (03Merged) 10jenkins-bot: Only register EntitySchema namespace when feature is enabled [extensions/EntitySchema] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1041679 (https://phabricator.wikimedia.org/T363153) (owner: 10Lucas Werkmeister (WMDE)) [14:14:29] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2024-05-31-163732 to 2024-06-11-161031 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042268 (https://phabricator.wikimedia.org/T359233) (owner: 10Jforrester) [14:14:31] now scap is starting to sync [14:14:42] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1041677|Allow loading EntitySchema on client (only) wikis (T363153)]], [[gerrit:1041679|Only register EntitySchema namespace when feature is enabled (T363153)]] [14:14:46] T363153: [ES-M2]: Load EntitySchema data type registration for WikibaseClient on client wikis - https://phabricator.wikimedia.org/T363153 [14:14:47] (03CR) 10CDanis: [C:03+1] geoip: Download GeoLite2 ASN file [puppet] - 10https://gerrit.wikimedia.org/r/1037531 (owner: 10Kosta Harlan) [14:15:12] !log fnegri@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host clouddb1020.eqiad.wmnet [14:15:21] (03CR) 10Scott French: [C:03+1] toolhub: Add missing securityContext to CronJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042256 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [14:15:42] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:15:53] (03PS7) 10CDanis: geoip: Download GeoLite2 ASN file [puppet] - 10https://gerrit.wikimedia.org/r/1037531 (owner: 10Kosta Harlan) [14:16:25] (03CR) 10JMeybohm: [C:03+2] toolhub: Add missing securityContext to CronJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042256 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [14:16:37] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#9884583 (10Jhancock.wm) a:03Jhancock.wm [14:16:50] (03CR) 10CDanis: [C:03+2] geoip: Download GeoLite2 ASN file [puppet] - 10https://gerrit.wikimedia.org/r/1037531 (owner: 10Kosta Harlan) [14:17:14] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Backport for [[gerrit:1041677|Allow loading EntitySchema on client (only) wikis (T363153)]], [[gerrit:1041679|Only register EntitySchema namespace when feature is enabled (T363153)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:17:19] testing [14:17:20] 10ops-codfw, 06SRE, 06Data-Persistence, 06DC-Ops: Q#:rack/setup/install dbproxy200[5-8] - https://phabricator.wikimedia.org/T362824#9884618 (10Jhancock.wm) a:03Jhancock.wm [14:17:24] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:17:25] (03CR) 10CDanis: [C:03+2] maxmind: Fix parameter order and document user_id/license_key defaults [puppet] - 10https://gerrit.wikimedia.org/r/1037767 (owner: 10Kosta Harlan) [14:17:26] (03Merged) 10jenkins-bot: toolhub: Add missing securityContext to CronJob [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042256 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [14:17:54] (03CR) 10JMeybohm: [C:03+1] kserve-inference: add securityContext explicit config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026954 (https://phabricator.wikimedia.org/T362978) (owner: 10Elukey) [14:18:52] looks okay as far as I can tell, syncing [14:18:54] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Continuing with sync [14:19:10] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/toolhub: apply [14:19:24] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/toolhub: apply [14:19:30] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/toolhub: apply [14:19:54] (03CR) 10Btullis: [C:03+1] helmfile: set HELM environment variables for the admin-ng systemd jobs [puppet] - 10https://gerrit.wikimedia.org/r/1042224 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [14:19:59] !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:20:20] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/toolhub: apply [14:20:38] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/toolhub: apply [14:20:50] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=s8 [14:20:53] !log fnegri@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1020.eqiad.wmnet,service=s5 [14:20:55] (03Abandoned) 10Hashar: Ignore mediawiki/tools/cli for gerrit replication [puppet] - 10https://gerrit.wikimedia.org/r/1029212 (https://phabricator.wikimedia.org/T333029) (owner: 10Addshore) [14:21:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1038.eqiad.wmnet [14:21:56] !log mvernon@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on moss-be1003.eqiad.wmnet with reason: host reimage [14:22:02] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/toolhub: apply [14:22:08] !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:22:13] !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:23:03] FIRING: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [14:23:11] <_joe_> uh [14:23:15] <_joe_> that isn't good [14:23:15] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1042267 (https://phabricator.wikimedia.org/T367287) (owner: 10Majavah) [14:23:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T367261)', diff saved to https://phabricator.wikimedia.org/P64714 and previous config saved to /var/cache/conftool/dbconfig/20240612-142335-marostegui.json [14:23:38] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2156.codfw.wmnet with reason: Maintenance [14:23:39] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [14:23:51] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2156.codfw.wmnet with reason: Maintenance [14:23:52] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [14:24:05] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [14:24:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2156 (T367261)', diff saved to https://phabricator.wikimedia.org/P64715 and previous config saved to /var/cache/conftool/dbconfig/20240612-142412-marostegui.json [14:24:16] !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:24:24] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on moss-be1003.eqiad.wmnet with reason: host reimage [14:24:27] _joe_: I'm roll rebooting the cluster [14:24:48] <_joe_> claime: eyah I was looking at SAL after seeing the patterns in grafana :) [14:24:57] (03CR) 10Brouberol: [C:03+2] helmfile: set HELM environment variables for the admin-ng systemd jobs [puppet] - 10https://gerrit.wikimedia.org/r/1042224 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [14:24:58] (k8s deployment progress last printed “97% (ok: 2062; fail: 0; left: 46)” this time btw, about that artifact earlier ^^) [14:25:14] (03CR) 10Jforrester: [C:03+2] wikifunctions: Upgrade orchestrator from 2024-06-05-003919 to 2024-06-11-223956 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042269 (https://phabricator.wikimedia.org/T359233) (owner: 10Jforrester) [14:26:08] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2024-06-05-003919 to 2024-06-11-223956 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042269 (https://phabricator.wikimedia.org/T359233) (owner: 10Jforrester) [14:27:03] (03CR) 10Clément Goubert: [C:03+2] trafficserver: move 95% of traffic to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1042205 (https://phabricator.wikimedia.org/T362323) (owner: 10Clément Goubert) [14:27:13] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [14:27:15] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1041677|Allow loading EntitySchema on client (only) wikis (T363153)]], [[gerrit:1041679|Only register EntitySchema namespace when feature is enabled (T363153)]] (duration: 12m 32s) [14:27:19] T363153: [ES-M2]: Load EntitySchema data type registration for WikibaseClient on client wikis - https://phabricator.wikimedia.org/T363153 [14:27:20] !log trafficserver: move 95% of traffic to mw-on-k8s [14:27:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:53] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [14:28:03] RESOLVED: KafkaUnderReplicatedPartitions: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [14:28:10] (03PS1) 10Eevans: cassandra: alternate logging hostname definition [puppet] - 10https://gerrit.wikimedia.org/r/1042273 (https://phabricator.wikimedia.org/T350567) [14:28:15] FIRING: [3x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:28:18] * Lucas_WMDE done [14:28:23] fyi James_F, ottomata, _joe_ [14:28:25] !log jforrester@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [14:28:27] Thanks. [14:28:35] (and about to jump in a meeting so I won’t be the judge for whose turn it is now) [14:28:36] <_joe_> Thanks [14:28:43] sorry for trampling over half your window :( [14:28:50] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1042273 (https://phabricator.wikimedia.org/T350567) (owner: 10Eevans) [14:29:02] Such is scap, destroyer of worlds. [14:29:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1038.eqiad.wmnet [14:29:26] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [software/bitu] - 10https://gerrit.wikimedia.org/r/1042261 (owner: 10Slyngshede) [14:29:38] <_joe_> James_F: s/scap/ci/ [14:29:41] !log jforrester@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [14:29:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1038.eqiad.wmnet [14:29:53] !log jforrester@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [14:30:04] <_joe_> James_F: can I go on with my deployment? [14:30:13] _joe_: Sure, we won't clash. [14:30:20] <_joe_> <3 [14:31:01] <_joe_> ottomata: still around? you still have way if you want :) [14:31:09] !log jforrester@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [14:31:22] !log kamila@cumin1002 START - Cookbook sre.dns.netbox [14:31:35] !log installing gst-plugins-base1.0 security updates [14:31:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:48] (03CR) 10Btullis: [C:03+1] helmfile: set HELM environment variables for the admin-ng systemd jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1042224 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [14:31:50] (Done with our window.) [14:32:33] <_joe_> ok, I guess I'll proceed :) [14:32:42] (03PS1) 10Kamila Součková: Revert^2 "Add wikikube-ctrl1003 to server SRV record for etcd 4" [dns] - 10https://gerrit.wikimedia.org/r/1042276 [14:32:59] (03PS2) 10Kamila Součková: Revert^2 "Add wikikube-ctrl1003 to server SRV record for etcd 4" [dns] - 10https://gerrit.wikimedia.org/r/1042276 [14:33:15] RESOLVED: [3x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:33:32] (03CR) 10TrainBranchBot: [C:03+2] "Approved by oblivian@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041656 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto) [14:33:51] (03CR) 10CI reject: [V:04-1] Revert^2 "Add wikikube-ctrl1003 to server SRV record for etcd 4" [dns] - 10https://gerrit.wikimedia.org/r/1042276 (owner: 10Kamila Součková) [14:33:53] (03PS3) 10CDanis: otelcol: filter out sessionstore user IDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039292 (https://phabricator.wikimedia.org/T366750) [14:33:53] (03PS3) 10CDanis: otelcol: filter common healthcheck spans [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039297 (https://phabricator.wikimedia.org/T366750) [14:34:09] (03Merged) 10jenkins-bot: Use the statsd-exporter service where available [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041656 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto) [14:34:13] !log kamila@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Moved wikikube-ctrl1003 to a new rack - kamila@cumin1002" [14:34:40] !log oblivian@deploy1002 Started scap: Backport for [[gerrit:1041656|Use the statsd-exporter service where available (T365265)]] [14:34:44] T365265: Create a per-release deployment of statsd-exporter for mw-on-k8s - https://phabricator.wikimedia.org/T365265 [14:34:47] !log failover ganeti master in eqiad to ganeti1028 [14:34:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:40] !log kamila@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Moved wikikube-ctrl1003 to a new rack - kamila@cumin1002" [14:35:40] !log kamila@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:36:36] god dammit [14:36:40] I think I might have to revert my backports [14:36:41] !log kamila@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host wikikube-ctrl1003 [14:36:45] lots of errors in logstash [14:37:12] !log oblivian@deploy1002 oblivian: Backport for [[gerrit:1041656|Use the statsd-exporter service where available (T365265)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:37:23] (03PS1) 10Ottomata: profile::cache::kafka::eventlogging - add ensure param [puppet] - 10https://gerrit.wikimedia.org/r/1042278 (https://phabricator.wikimedia.org/T238230) [14:37:28] PROBLEM - ganeti-wconfd running on ganeti1027 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 113 (gnt-masterd), command name ganeti-wconfd https://wikitech.wikimedia.org/wiki/Ganeti [14:37:34] _joe_, ottomata: can I revert or are you deploying? [14:37:34] Lucas_WMDE: :( [14:37:38] i am not deploying [14:37:40] _joe_ is deploying [14:37:42] i will wait til things are clear [14:37:44] (03CR) 10CI reject: [V:04-1] profile::cache::kafka::eventlogging - add ensure param [puppet] - 10https://gerrit.wikimedia.org/r/1042278 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [14:37:54] (03PS2) 10Ottomata: profile::cache::kafka::eventlogging - add ensure param [puppet] - 10https://gerrit.wikimedia.org/r/1042278 (https://phabricator.wikimedia.org/T238230) [14:37:57] <_joe_> Lucas_WMDE: should I abort my deploytment? [14:37:57] !log kamila@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host wikikube-ctrl1003 [14:38:07] <_joe_> I'm past the testservers, it will take 3 minutes tops [14:38:11] I’m not sure how bad the errors are so far [14:38:12] !log oblivian@deploy1002 oblivian: Continuing with sync [14:38:15] (03CR) 10CI reject: [V:04-1] profile::cache::kafka::eventlogging - add ensure param [puppet] - 10https://gerrit.wikimedia.org/r/1042278 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [14:38:18] let it run, I think [14:38:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T367261)', diff saved to https://phabricator.wikimedia.org/P64716 and previous config saved to /var/cache/conftool/dbconfig/20240612-143830-marostegui.json [14:38:34] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [14:38:45] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:55] (03CR) 10Kamila Součková: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1042276 (owner: 10Kamila Součková) [14:39:04] <_joe_> Lucas_WMDE: you can start merging the revert I guess given it will take forever in CI [14:39:13] I was thinking about force-merging that tbh [14:39:19] but mainly I was going to try scap backport --revert [14:39:20] <_joe_> the downside is not using scap backport [14:39:23] and I don’t know if that waits for CI or not [14:39:26] <_joe_> ah ok [14:39:28] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T367253#9884875 (10VRiley-WMF) a:03VRiley-WMF [14:39:29] I don’t think I’ve used --revert before [14:39:35] <_joe_> I never had [14:39:36] (03PS3) 10Ottomata: profile::cache::kafka::eventlogging - add ensure param [puppet] - 10https://gerrit.wikimedia.org/r/1042278 (https://phabricator.wikimedia.org/T238230) [14:39:45] ideally, it’ll deploy the revert first? but idk [14:39:46] <_joe_> jnuche: maybe you know the answer? ^^ [14:39:50] let me know when you’re done please [14:40:03] so far I don’t see a dip in edits so I don’t think it’s the end of the world [14:40:06] <_joe_> I'm past the canaries [14:40:21] but at least one user saw the API exception already so it’s not no-impact either [14:40:43] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T367253#9884889 (10VRiley-WMF) Reseated cable. Nominal operation restored. [14:40:48] 10ops-eqiad, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T367253#9884890 (10VRiley-WMF) 05Open→03Resolved [14:41:25] FIRING: SystemdUnitFailed: ferm.service on mw1453:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:41:27] (03CR) 10Kamila Součková: [C:03+2] Revert^2 "Add wikikube-ctrl1003 to server SRV record for etcd 4" [dns] - 10https://gerrit.wikimedia.org/r/1042276 (owner: 10Kamila Součková) [14:41:46] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9884904 (10BCornwall) [14:42:33] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission stat100[4-7].eqiad.wmnet - https://phabricator.wikimedia.org/T367147#9884905 (10VRiley-WMF) a:05Jclark-ctr→03VRiley-WMF [14:42:52] I should’ve looked at logstash when I was testing the backport :/ [14:43:10] but it’s also a bit surprising that the canaries didn’t catch this [14:43:28] (logspam watch says 4740+779+779 errors) [14:43:45] FIRING: [2x] JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:43:51] Lucas_WMDE, _joe_: --revert will create its own patch, so if you want to speed things up you're probably better off creating your revert patch manually, forcing it to merge and then backporting [14:44:03] alright, thanks [14:44:06] let me start that then [14:44:25] (03PS1) 10Lucas Werkmeister (WMDE): Revert "Only register EntitySchema namespace when feature is enabled" [extensions/EntitySchema] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1042283 [14:44:25] <_joe_> Lucas_WMDE: we're a few php restarts from me being done anyways [14:44:26] !log mvernon@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host moss-be1003.eqiad.wmnet with OS bookworm [14:44:29] <_joe_> jnuche: thanks <3 [14:44:36] (03PS1) 10Lucas Werkmeister (WMDE): Revert "Allow loading EntitySchema on client (only) wikis" [extensions/EntitySchema] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1042284 [14:44:39] 10SRE-swift-storage, 13Patch-For-Review: Set up Misc Object Storage Service (moss) - https://phabricator.wikimedia.org/T279621#9884942 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin1002 for host moss-be1003.eqiad.wmnet with OS bookworm completed: - moss-be1003 (**PASS**)... [14:44:43] (03CR) 10CI reject: [V:04-1] Revert "Allow loading EntitySchema on client (only) wikis" [extensions/EntitySchema] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1042284 (owner: 10Lucas Werkmeister (WMDE)) [14:44:49] (03PS2) 10Lucas Werkmeister (WMDE): Revert "Allow loading EntitySchema on client (only) wikis" [extensions/EntitySchema] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1042284 [14:46:25] FIRING: [6x] SystemdUnitFailed: ferm.service on kubernetes1031:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:46:32] (03CR) 10Ottomata: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1042278 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [14:46:45] !log oblivian@deploy1002 Finished scap: Backport for [[gerrit:1041656|Use the statsd-exporter service where available (T365265)]] (duration: 12m 05s) [14:46:49] T365265: Create a per-release deployment of statsd-exporter for mw-on-k8s - https://phabricator.wikimedia.org/T365265 [14:46:50] (03CR) 10Lucas Werkmeister (WMDE): [V:03+2 C:03+2] "Force-merging to expedite deployment." [extensions/EntitySchema] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1042283 (owner: 10Lucas Werkmeister (WMDE)) [14:46:54] (03CR) 10Lucas Werkmeister (WMDE): [V:03+2 C:03+2] "Force-merging to expedite deployment. (I checked locally that the rebased revert is clean – there’s no diff between this and `@~4`.)" [extensions/EntitySchema] (wmf/1.43.0-wmf.8) - 10https://gerrit.wikimedia.org/r/1042284 (owner: 10Lucas Werkmeister (WMDE)) [14:47:12] scap is running [14:47:32] <_joe_> Lucas_WMDE: heh we don't give our deployment pipelines one second to catch a breath :D [14:47:36] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1042283|Revert "Only register EntitySchema namespace when feature is enabled"]], [[gerrit:1042284|Revert "Allow loading EntitySchema on client (only) wikis"]] [14:47:37] :D [14:47:49] isn’t that what continuous deployment means? [14:49:25] !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [14:49:38] !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [14:50:04] !log cgoubert@cumin1002 END (PASS) - Cookbook sre.kafka.roll-restart-reboot-brokers (exit_code=0) rolling reboot on A:kafka-main-eqiad [14:50:06] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Backport for [[gerrit:1042283|Revert "Only register EntitySchema namespace when feature is enabled"]], [[gerrit:1042284|Revert "Allow loading EntitySchema on client (only) wikis"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:51:12] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Continuing with sync [14:51:18] don’t think I can test this easily, so just syncing [14:51:29] the errors I see in logstash all come from Special:OAuth/initiate [14:51:40] but I don’t have a tool handy that happens to send X-Wikimedia-Debug with its requests to that [14:51:41] (03PS1) 10Brouberol: helmfile: don't schedule admin-ng diff check jobs for the staging k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/1042285 (https://phabricator.wikimedia.org/T331894) [14:51:43] (03PS1) 10Brouberol: helmfile: remove temporary else block once resources were absented [puppet] - 10https://gerrit.wikimedia.org/r/1042286 (https://phabricator.wikimedia.org/T331894) [14:51:55] and just loading it without any of the magic OAuth URL params isn’t enough to cause the error AFAICT [14:52:28] syncing canaries now… [14:52:45] (03PS2) 10Brouberol: helmfile: don't schedule admin-ng diff check jobs for the staging k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/1042285 (https://phabricator.wikimedia.org/T331894) [14:52:45] (03PS2) 10Brouberol: helmfile: remove temporary else block once resources were absented [puppet] - 10https://gerrit.wikimedia.org/r/1042286 (https://phabricator.wikimedia.org/T331894) [14:52:52] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1042285 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [14:53:11] !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [14:53:25] !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [14:53:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P64717 and previous config saved to /var/cache/conftool/dbconfig/20240612-145337-marostegui.json [14:54:00] (03CR) 10Giuseppe Lavagetto: [C:03+1] dump_cloud_ip_ranges: This is owned by all of SRE [puppet] - 10https://gerrit.wikimedia.org/r/1042161 (owner: 10Clément Goubert) [14:54:19] it made it past the canaries [14:55:26] (03CR) 10Clément Goubert: [C:03+2] dump_cloud_ip_ranges: This is owned by all of SRE [puppet] - 10https://gerrit.wikimedia.org/r/1042161 (owner: 10Clément Goubert) [14:55:46] FIRING: [2x] JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:57:04] (03PS2) 10BCornwall: depool ulsfo for text cluster drive upgrade [dns] - 10https://gerrit.wikimedia.org/r/1041689 (https://phabricator.wikimedia.org/T364891) [14:57:18] (03CR) 10BCornwall: [C:03+2] depool ulsfo for text cluster drive upgrade [dns] - 10https://gerrit.wikimedia.org/r/1041689 (https://phabricator.wikimedia.org/T364891) (owner: 10BCornwall) [14:57:50] (03CR) 10Ottomata: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1042278 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [14:57:53] (03PS1) 10Volans: CHANGELOG: add changelogs for release v8.6.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1042288 [14:57:54] (03PS1) 10Muehlenhoff: Extend access for edtadros [puppet] - 10https://gerrit.wikimedia.org/r/1042287 [14:58:16] (03CR) 10Volans: [C:03+2] CHANGELOG: add changelogs for release v8.6.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1042288 (owner: 10Volans) [15:00:07] (03CR) 10Muehlenhoff: [C:03+2] Extend access for edtadros [puppet] - 10https://gerrit.wikimedia.org/r/1042287 (owner: 10Muehlenhoff) [15:00:12] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1042283|Revert "Only register EntitySchema namespace when feature is enabled"]], [[gerrit:1042284|Revert "Allow loading EntitySchema on client (only) wikis"]] (duration: 12m 36s) [15:00:16] * Lucas_WMDE looks at logstash [15:00:35] looks… better, I think [15:00:43] !log Depooling ulsfo in preparation for A:cp-text downtime/poweroff for nvme upgrades (T364891) [15:00:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:49] T364891: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891 [15:00:54] lots of “could not enqueue jobs”, no idea if that’s related or not [15:01:02] ugh eventgate again [15:01:08] but the errors that were definitely my fault seem to have stopped [15:01:15] FIRING: [3x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:01:15] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1003.eqiad.wmnet with OS bullseye [15:01:17] (03CR) 10Bartosz Dziewoński: [C:03+1] tests: Replace yaml_parse_file with symfony/yaml [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042266 (owner: 10Gergő Tisza) [15:01:23] I think it's not liking the deploys actually [15:01:25] FIRING: [10x] SystemdUnitFailed: ferm.service on kubernetes1031:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:01:39] re eventgate, we are hoping to add some instrumentation to eventbus soon: https://phabricator.wikimedia.org/T363587 [15:01:42] Maybe we need to raise the number of containers [15:01:55] just the fact that deploys were happening, you mean? [15:01:57] might help us with some understanding around 5xx error counts [15:02:15] Lucas_WMDE: possibly all the reconnections in a short amount of time [15:02:31] !log authdns-update run on dns1004 [15:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:34] !log T364907 💙cdanis@apt1002.wikimedia.org ~ 🕚☕ sudo -i reprepro --keepunreferencedfiles includedeb bullseye-wikimedia ~/otelcol-contrib_0.102.0_linux_amd64.deb [15:02:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:38] T364907: upgrade to latest stable version of otelcol-contrib - https://phabricator.wikimedia.org/T364907 [15:03:17] (03CR) 10Andrea Denisse: [C:03+2] conftool: Integrate logstash with active-passive configuration [puppet] - 10https://gerrit.wikimedia.org/r/1039406 (https://phabricator.wikimedia.org/T356386) (owner: 10Andrea Denisse) [15:03:21] so, in this case. eventgate was not restarted here, rigiht? this is just correlated with a MW deploy? [15:03:35] (03PS1) 10CDanis: otelcol: bump to v0.102.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1042291 (https://phabricator.wikimedia.org/T364907) [15:03:41] ottomata: yes [15:03:58] (03PS1) 10Muehlenhoff: Extend access for jhancock [puppet] - 10https://gerrit.wikimedia.org/r/1042292 [15:04:00] is it possible some envoy in the chain is just backed up? [15:04:06] (03CR) 10CDanis: [V:03+2 C:03+2] otelcol: bump to v0.102.0 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1042291 (https://phabricator.wikimedia.org/T364907) (owner: 10CDanis) [15:04:12] eventgate itself doesn't seem to be affected or notice these errors? [15:04:16] * Lucas_WMDE done deploying btw (hopefully) [15:05:16] (03Merged) 10jenkins-bot: CHANGELOG: add changelogs for release v8.6.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/1042288 (owner: 10Volans) [15:05:16] ottomata: we notice it from the upstream servers [15:05:58] (03CR) 10Muehlenhoff: [C:03+2] Extend access for jhancock [puppet] - 10https://gerrit.wikimedia.org/r/1042292 (owner: 10Muehlenhoff) [15:06:07] upstream servers meaning MW request? or envoy proxy handling request for MW? [15:06:15] RESOLVED: [3x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [15:06:24] (03PS4) 10Brouberol: helmfile: don't schedule admin-ng diff check jobs for the staging k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/1042285 (https://phabricator.wikimedia.org/T331894) [15:06:25] FIRING: [11x] SystemdUnitFailed: ferm.service on kubernetes1031:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:06:44] ottomata: envoy proxy handling the request for MW [15:06:45] (03PS4) 10Brouberol: helmfile: remove temporary else block once resources were absented [puppet] - 10https://gerrit.wikimedia.org/r/1042286 (https://phabricator.wikimedia.org/T331894) [15:08:05] interpreting https://grafana-rw.wikimedia.org/d/VTCkm29Wz/envoy-telemetry?orgId=1&var-datasource=eqiad+prometheus%2Fops&var-origin=All&var-destination=eventgate-main&viewPanel=26&from=1718201277993&to=1718204877994 [15:08:15] those are both reported by MW's envoy proxy? [15:08:41] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1042286 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [15:08:44] 06SRE, 06Infrastructure-Foundations, 13Patch-For-Review: Move the ping* servers to Bookworm - https://phabricator.wikimedia.org/T366695#9885044 (10MoritzMuehlenhoff) The routers in eqiad have been reconfigured to use ping1004 (confirmed with tcpdump) instead of ping1003. I'll decom the old nodes on Friday. [15:08:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P64718 and previous config saved to /var/cache/conftool/dbconfig/20240612-150844-marostegui.json [15:08:47] (03PS4) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) [15:09:02] 06SRE, 10Cassandra, 06Data-Persistence, 13Patch-For-Review: Migrate Cassandra to Java 11 - https://phabricator.wikimedia.org/T350567#9885045 (10Eevans) >>! In T350567#9884253, @MoritzMuehlenhoff wrote: > Can you please uninstall openjdk-8-* on the migrated clusters? (simply run dpkg --remove openjdk-8-jdk... [15:09:24] (03CR) 10Bking: dse-k8s-services: Add net-new chart for Airflow (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [15:09:31] ottomata: yes [15:09:47] specifically those are bare metal [15:09:55] we see the same for k8s [15:11:15] are the destroyed connections just because MW is being restarted and there are in flight requests? [15:11:25] FIRING: [11x] SystemdUnitFailed: ferm.service on kubernetes1031:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:11:40] (thank you for patience with naive questions, am just curious to know more) [15:11:45] ottomata: That shouldn't result in failed to enqueue jobs errors [15:12:06] envoy gets redeployed with mw [15:12:08] !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-ctrl1003.eqiad.wmnet with OS bullseye [15:12:16] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, and 2 others: eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9885068 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1003.eqiad.wmnet with OS bullseye exe... [15:12:27] yes. what if envoy is shut down before MW PHP proc, and MW is able to catch the 500 and log it before it shuts down? [15:12:29] (03PS1) 10Volans: Upstream release v8.6.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1042294 [15:12:30] !log kamila@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wikikube-ctrl1003'] [15:12:46] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1027.eqiad.wmnet [15:12:50] (03CR) 10Volans: [C:03+2] Upstream release v8.6.0 [software/spicerack] (debian) - 10https://gerrit.wikimedia.org/r/1042294 (owner: 10Volans) [15:14:36] 06SRE, 10Cassandra, 06Data-Persistence, 13Patch-For-Review: Migrate Cassandra to Java 11 - https://phabricator.wikimedia.org/T350567#9885080 (10Eevans) [15:14:49] ottomata: it is shut down before the php process, but it's also shutting down the entry point, requests to mediawiki go through the same envoy. It wouldn't explain the long tail of connect fail either. [15:14:50] claime: I ask because I think this was(?) a problem on the eventgate side https://phabricator.wikimedia.org/T249745#9313482 [15:15:15] (03CR) 10JMeybohm: helmfile: don't schedule admin-ng diff check jobs for the staging k8s cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1042285 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [15:15:15] (03PS1) 10Brouberol: hemlfile: export admin-ng pending diff metrics hourly [puppet] - 10https://gerrit.wikimedia.org/r/1042296 (https://phabricator.wikimedia.org/T331894) [15:15:29] I think it's the ferm issue again actually [15:15:37] hm, long tail right. in the long tail case, are we sure the failures come from the newly restarted MW? [15:15:51] if so then yeah we can through my hypothesis out :) [15:16:10] it again coincides with a bunch of ferm reload problems [15:16:25] RESOLVED: [11x] SystemdUnitFailed: ferm.service on kubernetes1031:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:16:41] which are not caused by mediawiki deployments, but by changes to the k8s infrastructure [15:16:42] oh, interesting. how would ferm reload be related? Is it just a coincidence? [15:16:45] That's a problem [15:16:53] hm [15:16:57] ottomata: iptables does a big part of the routing in k8s [15:17:19] and reloads are happening to support routing after deployments/services/pods change? [15:18:00] No, but when we do changes to the hosts list yeah [15:18:09] and it correlates with some work on the control plane [15:18:22] (03PS1) 10Giuseppe Lavagetto: statsd-exporter: allow changing service port [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042297 [15:18:22] (03PS1) 10Giuseppe Lavagetto: mw-debug: fix exposed port for statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042298 [15:18:23] yesterday it coincided with me reimaging some servers there [15:18:34] (03CR) 10Santiago Faci: [C:03+1] aqs-http-gateway: allow cross-DC Cassandra client connection (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041235 (https://phabricator.wikimedia.org/T366851) (owner: 10Scott French) [15:18:40] And in the task you link j.ayme saw the same correlation [15:18:52] 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9885106 (10VRiley-WMF) @Eevans No problem at all. Let us know when there is a good time to try and swap this. Since it's out of warranty, we will have to pull one from a decommissioned one. Tha... [15:19:12] (03CR) 10Santiago Faci: [C:03+1] aqs-http-gateway: allow cross-DC Cassandra client connection (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041235 (https://phabricator.wikimedia.org/T366851) (owner: 10Scott French) [15:19:55] 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9885115 (10VRiley-WMF) [15:20:21] I don't know why eventgate would me more sensitive to this than other applications though [15:20:44] I've restarted the failed ferms, we'll see how the errors go [15:21:22] (03PS4) 10Scott French: aqs-http-gateway: allow cross-DC Cassandra client connection / fix settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041235 (https://phabricator.wikimedia.org/T366851) [15:23:31] (03CR) 10Filippo Giunchedi: [C:03+1] otelcol: filter out sessionstore user IDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039292 (https://phabricator.wikimedia.org/T366750) (owner: 10CDanis) [15:23:49] !log kamila@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wikikube-ctrl1003'] [15:23:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T367261)', diff saved to https://phabricator.wikimedia.org/P64719 and previous config saved to /var/cache/conftool/dbconfig/20240612-152351-marostegui.json [15:23:54] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2177.codfw.wmnet with reason: Maintenance [15:23:56] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [15:23:56] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2177.codfw.wmnet with reason: Maintenance [15:24:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2177 (T367261)', diff saved to https://phabricator.wikimedia.org/P64720 and previous config saved to /var/cache/conftool/dbconfig/20240612-152403-marostegui.json [15:24:09] claime: interesting. yeah why would eventgate be special. hm. Are the requests taking longer than most? I doubt it? https://github.com/wikimedia/operations-puppet/blob/production/modules/profile/manifests/wmcs/striker/docker.pp#L28 [15:24:13] about 1ms average? [15:24:27] no less [15:24:36] the POST is 111microseconds [15:24:50] !log kamila@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['wikikube-ctrl1003'] [15:25:08] !log kamila@cumin1002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['wikikube-ctrl1003'] [15:25:23] !log uploaded spicerack_8.6.0 to apt.wikimedia.org bullseye-wikimedia [15:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:32] (03CR) 10Scott French: aqs-http-gateway: allow cross-DC Cassandra client connection / fix settings (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041235 (https://phabricator.wikimedia.org/T366851) (owner: 10Scott French) [15:25:41] I'll roll restart eventgate, the errors are not going down [15:26:59] (03PS5) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) [15:27:19] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: sync [15:27:29] (03CR) 10Eevans: [C:03+1] aqs-http-gateway: allow cross-DC Cassandra client connection / fix settings (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041235 (https://phabricator.wikimedia.org/T366851) (owner: 10Scott French) [15:27:39] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: sync [15:28:17] this is very strange. [15:28:22] !log denisse@cumin2002 conftool action : set/pooled=true; selector: dnsdisc=logstash,name=eqiad [15:28:28] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1027.eqiad.wmnet [15:28:31] extremely [15:28:45] the spike in error is the roll restart [15:29:11] (03CR) 10Andrea Denisse: [C:03+2] discovery: Add metafo entry for logstash [dns] - 10https://gerrit.wikimedia.org/r/1039882 (https://phabricator.wikimedia.org/T356386) (owner: 10Andrea Denisse) [15:29:13] ideally roll restart would not result in errors, right? [15:29:20] (03PS4) 10Cwhite: discovery: Add metafo entry for logstash [dns] - 10https://gerrit.wikimedia.org/r/1039882 (https://phabricator.wikimedia.org/T356386) (owner: 10Andrea Denisse) [15:29:21] ideally yes [15:29:23] inflight requests would be given time to finish [15:29:23] yeah [15:29:24] okay [15:29:24] (03CR) 10Andrea Denisse: [V:03+2 C:03+2] discovery: Add metafo entry for logstash [dns] - 10https://gerrit.wikimedia.org/r/1039882 (https://phabricator.wikimedia.org/T356386) (owner: 10Andrea Denisse) [15:29:55] 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9885191 (10Eevans) >>! In T362033#9885106, @VRiley-WMF wrote: > @Eevans > > No problem at all. Let us know when there is a good time to try and swap this. Since it's out of warranty, we will ha... [15:30:27] so um, I want to do a simple mw-config deployment. [15:30:33] https://deploy-commands.toolforge.org/bacc/1041115 [15:30:51] should I be scared i'll cause this again @claime ? [15:31:14] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [15:31:36] (03CR) 10Andrea Denisse: [C:03+2] traffic: Route logstash.w.o to logstash.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1039887 (https://phabricator.wikimedia.org/T356386) (owner: 10Andrea Denisse) [15:31:55] ottomata: it's still happening anyways, so do your deployment :) [15:33:40] FIRING: SystemdUnitFailed: ferm.service on parse1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:33:48] (03PS1) 10Filippo Giunchedi: logging: add logstash.discovery.wmnet to alt names [puppet] - 10https://gerrit.wikimedia.org/r/1042299 (https://phabricator.wikimedia.org/T356386) [15:33:51] claime: if it is related to ferm reloads and it does not fix itself after restarts are done...would there have to be some deeper thing going on here? unrefreshed routing stuff in k8s somewhere? But indeed, why eventgate only? [15:33:54] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding sretest2001 to codfw - jhancock@cumin2002" [15:33:55] FIRING: [6x] SystemdUnitFailed: ferm.service on mw1356:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:34:29] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1027.eqiad.wmnet [15:34:43] (03CR) 10Andrea Denisse: [C:03+1] "LGTM,thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1042299 (https://phabricator.wikimedia.org/T356386) (owner: 10Filippo Giunchedi) [15:34:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding sretest2001 to codfw - jhancock@cumin2002" [15:34:51] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:34:55] (03CR) 10Andrea Denisse: [C:03+2] logging: add logstash.discovery.wmnet to alt names [puppet] - 10https://gerrit.wikimedia.org/r/1042299 (https://phabricator.wikimedia.org/T356386) (owner: 10Filippo Giunchedi) [15:35:02] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1027.eqiad.wmnet [15:35:35] (03PS1) 10Muehlenhoff: Extend access for dani [puppet] - 10https://gerrit.wikimedia.org/r/1042301 [15:35:40] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#9885220 (10Jhancock.wm) @elukey this one [15:35:47] claime: this does also happen for eventgate-analytics, do you restart that too? [15:35:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T352010)', diff saved to https://phabricator.wikimedia.org/P64721 and previous config saved to /var/cache/conftool/dbconfig/20240612-153549-ladsgroup.json [15:35:51] mw does produce to that [15:35:53] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [15:35:55] ottomata: I did not. [15:36:04] This is doing my head in [15:36:17] let's wait and see if eventgate-analytics fixes itself... [15:36:42] (03CR) 10Krinkle: [C:03+2] tests: Replace yaml_parse_file with symfony/yaml (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042266 (owner: 10Gergő Tisza) [15:36:47] (03CR) 10Muehlenhoff: [C:03+2] Extend access for dani [puppet] - 10https://gerrit.wikimedia.org/r/1042301 (owner: 10Muehlenhoff) [15:37:16] (03CR) 10Krinkle: [C:03+2] Delete docroot/noc/createTxtFileSymlinks.sh [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034183 (https://phabricator.wikimedia.org/T365514) (owner: 10Reedy) [15:37:21] (03Merged) 10jenkins-bot: tests: Replace yaml_parse_file with symfony/yaml [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042266 (owner: 10Gergő Tisza) [15:37:58] (03Merged) 10jenkins-bot: Delete docroot/noc/createTxtFileSymlinks.sh [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034183 (https://phabricator.wikimedia.org/T365514) (owner: 10Reedy) [15:38:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by otto@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041115 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [15:38:40] RESOLVED: SystemdUnitFailed: ferm.service on parse1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:38:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T367261)', diff saved to https://phabricator.wikimedia.org/P64722 and previous config saved to /var/cache/conftool/dbconfig/20240612-153842-marostegui.json [15:38:46] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [15:39:45] (03PS6) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) [15:39:45] (03Merged) 10jenkins-bot: Remove EventLoggingLegacyConverter code - it has been moved to EventLogging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041115 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [15:40:05] (03CR) 10Bking: dse-k8s-services: Add net-new chart for Airflow (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [15:40:38] (03CR) 10Krinkle: Use the statsd-exporter service where available (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041656 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto) [15:40:44] (03CR) 10Bking: dse-k8s-services: Add net-new chart for Airflow (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [15:41:26] ottomata: we clashed, sorry. [15:41:33] I looked at the schedule and saw nothing MW rlated before/after for 2h [15:41:41] sorry Krinkle yeah was much delayed [15:41:49] you are tgr and reedy's stuff? [15:41:58] deploy is asking me to see the diff [15:42:03] (03PS7) 10Bking: dse-k8s-services: Add net-new chart for Airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) [15:42:11] go ahead with yours then. mine don't really need deployment, I just wanted to make sure it was git-pul'ed to avoid asurprising the newxt person [15:42:13] I can abort (how?) [15:42:16] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1003.eqiad.wmnet with OS bullseye [15:42:16] okay [15:42:22] I pulled down your change in my git-pull. [15:42:23] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9885253 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1003.eq... [15:42:25] okay [15:42:42] yeah, looks like tests and removal of a docroot/noc file ya? [15:42:47] yep [15:42:51] k, proceeding with all [15:43:11] (03CR) 10JHathaway: [C:03+1] cephadm: template out cephadm spec files [puppet] - 10https://gerrit.wikimedia.org/r/1041163 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [15:43:19] !log otto@deploy1002 Started scap: Backport for [[gerrit:1041115|Remove EventLoggingLegacyConverter code - it has been moved to EventLogging (T353817)]] [15:43:23] T353817: Create legacy EventLogging proxy HTTP intake (for MediaWikiPingback) endpoint to EventGate - https://phabricator.wikimedia.org/T353817 [15:43:25] (03PS29) 10DCausse: wdqs.data-reload: various fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) (owner: 10Ryan Kemper) [15:43:41] (03CR) 10Eevans: "I'm not sure how useful I can be reviewing the specific changes, but it (generally) looks good to me!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039247 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan) [15:43:58] (03CR) 10DCausse: "Thanks for the suggestions volans!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) (owner: 10Ryan Kemper) [15:46:09] !log otto@deploy1002 otto: Backport for [[gerrit:1041115|Remove EventLoggingLegacyConverter code - it has been moved to EventLogging (T353817)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:46:40] !log otto@deploy1002 otto: Continuing with sync [15:49:46] FIRING: Storage /var over 50%: Alert for device lsw1-f5-eqiad.mgmt.eqiad.wmnet - Storage /var over 50% - https://alerts.wikimedia.org/?q=alertname%3DStorage+%2Fvar+over+50%25 [15:50:57] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P64723 and previous config saved to /var/cache/conftool/dbconfig/20240612-155056-ladsgroup.json [15:52:49] (03CR) 10Krinkle: [C:04-1] "(setting -1 to clear from CR dashboard)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024932 (https://phabricator.wikimedia.org/T350094) (owner: 10Gergő Tisza) [15:53:23] !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-ctrl1003.eqiad.wmnet with OS bullseye [15:53:29] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9885312 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1003.eqiad.... [15:53:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P64724 and previous config saved to /var/cache/conftool/dbconfig/20240612-155349-marostegui.json [15:55:38] !log otto@deploy1002 Finished scap: Backport for [[gerrit:1041115|Remove EventLoggingLegacyConverter code - it has been moved to EventLogging (T353817)]] (duration: 12m 19s) [15:55:42] T353817: Create legacy EventLogging proxy HTTP intake (for MediaWikiPingback) endpoint to EventGate - https://phabricator.wikimedia.org/T353817 [15:56:19] (03CR) 10Gergő Tisza: [POC] Handle sso.wikimedia.org domain (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza) [15:57:40] (03CR) 10Giuseppe Lavagetto: [C:03+2] statsd-exporter: allow changing service port [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042297 (owner: 10Giuseppe Lavagetto) [15:58:37] (03Merged) 10jenkins-bot: statsd-exporter: allow changing service port [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042297 (owner: 10Giuseppe Lavagetto) [16:00:20] !log kamila@cumin1002 START - Cookbook sre.hosts.reimage for host wikikube-ctrl1003.eqiad.wmnet with OS bullseye [16:00:28] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9885334 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by kamila@cumin1002 for host wikikube-ctrl1003.eq... [16:00:31] deploy done [16:01:19] (03CR) 10JHathaway: [C:03+2] mediawiki: Move outbound email to mx-out{1001,2001}.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1041758 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [16:02:01] (03CR) 10JHathaway: [C:03+2] mw: change mail_host [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041763 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [16:02:35] (03PS2) 10Giuseppe Lavagetto: mw-debug: fix exposed port for statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042298 [16:02:35] (03PS1) 10Giuseppe Lavagetto: statsd-exporter: also change the port in the networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042303 [16:03:30] (03Merged) 10jenkins-bot: mw: change mail_host [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041763 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [16:05:50] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-reboot rolling reboot on A:sessionstore [16:06:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P64725 and previous config saved to /var/cache/conftool/dbconfig/20240612-160603-ladsgroup.json [16:06:17] 06SRE, 06Growth-Team, 10GrowthExperiments-Homepage, 07Grafana: Growth team product KPI Grafana dashboard has `update_` task type, which does not exist - https://phabricator.wikimedia.org/T362633#9885350 (10Michael) >>! In T362633#9883647, @fgiunchedi wrote: > {{done}}; resolving though reopen if sth is... [16:08:01] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Moving 1G servers out of rack D4 in prep of switch migration - https://phabricator.wikimedia.org/T361856#9885353 (10Papaul) @Clement_Goubert next week monday 17th at 10:00am CT [16:08:30] (03CR) 10Giuseppe Lavagetto: [C:03+2] statsd-exporter: also change the port in the networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042303 (owner: 10Giuseppe Lavagetto) [16:08:45] FIRING: ProbeDown: Service sessionstore2004-a:9042 has failed probes (tcp_cassandra_a_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#sessionstore2004-a:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:08:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P64726 and previous config saved to /var/cache/conftool/dbconfig/20240612-160856-marostegui.json [16:08:58] (03PS4) 10CDanis: otelcol: filter out sessionstore user IDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039292 (https://phabricator.wikimedia.org/T366750) [16:08:58] (03PS4) 10CDanis: otelcol: filter common healthcheck spans [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039297 (https://phabricator.wikimedia.org/T366750) [16:08:58] (03PS1) 10CDanis: otelcol: bump to v0.102.0 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042305 (https://phabricator.wikimedia.org/T364907) [16:09:22] (03Merged) 10jenkins-bot: statsd-exporter: also change the port in the networkpolicy [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042303 (owner: 10Giuseppe Lavagetto) [16:10:05] !log jhathaway@deploy1002 Started scap: (no justification provided) [16:10:46] FIRING: [2x] ProbeDown: Service sessionstore2004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:11:07] (03CR) 10Elukey: [C:03+1] cassandra: alternate logging hostname definition [puppet] - 10https://gerrit.wikimedia.org/r/1042273 (https://phabricator.wikimedia.org/T350567) (owner: 10Eevans) [16:11:30] !log jhathaway@deploy1002 Finished scap: (no justification provided) (duration: 03m 19s) [16:12:50] 10ops-codfw, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: memory errors for ml-serve2007.codfw.wmnet - https://phabricator.wikimedia.org/T366688#9885382 (10Papaul) a:05Papaul→03None [16:13:45] RESOLVED: [2x] ProbeDown: Service sessionstore2004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:13:46] 10ops-codfw, 06DC-Ops, 06Machine-Learning-Team: hw troubleshooting: memory errors during boot for ml-staging2001.codfw.wmnet - https://phabricator.wikimedia.org/T366670#9885387 (10Papaul) a:05Papaul→03None [16:13:57] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet [16:15:10] (03CR) 10Giuseppe Lavagetto: [C:03+2] mw-debug: fix exposed port for statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042298 (owner: 10Giuseppe Lavagetto) [16:15:52] (03Merged) 10jenkins-bot: mw-debug: fix exposed port for statsd [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042298 (owner: 10Giuseppe Lavagetto) [16:17:12] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [16:17:35] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [16:17:51] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [16:18:00] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [16:18:12] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [16:18:50] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 3:00:00 on 8 hosts with reason: T364891 [16:18:53] T364891: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891 [16:19:15] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on 8 hosts with reason: T364891 [16:19:15] !log kamila@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host wikikube-ctrl1003.eqiad.wmnet with OS bullseye [16:19:23] 10ops-eqiad, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): eqiad:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366204#9885446 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by kamila@cumin1002 for host wikikube-ctrl1003.eqiad.... [16:20:34] !log cumin 'A:cp-text and A:ulsfo' 'systemctl poweroff' - T364891 [16:20:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:46] FIRING: [3x] ProbeDown: Service sessionstore2004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:21:12] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T352010)', diff saved to https://phabricator.wikimedia.org/P64727 and previous config saved to /var/cache/conftool/dbconfig/20240612-162110-ladsgroup.json [16:21:14] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance [16:21:16] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [16:21:27] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance [16:21:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1182 (T352010)', diff saved to https://phabricator.wikimedia.org/P64728 and previous config saved to /var/cache/conftool/dbconfig/20240612-162134-ladsgroup.json [16:23:33] PROBLEM - PyBal IPVS diff check on lvs4008 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:23:33] PROBLEM - PyBal IPVS diff check on lvs4010 is CRITICAL: (CRITICAL: Mismatch between IPVS and PyBal https://wikitech.wikimedia.org/wiki/PyBal [16:23:45] FIRING: [4x] ProbeDown: Service sessionstore2004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:23:46] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Moving 1G servers out of rack D4 in prep of switch migration - https://phabricator.wikimedia.org/T361856#9885472 (10Clement_Goubert) All right, I'll do the draining monday beginning of the UTC afternoon so it's all set for you. [16:23:51] (03CR) 10JHathaway: [C:03+1] Replace development server with uWSGI. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1042261 (owner: 10Slyngshede) [16:24:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T367261)', diff saved to https://phabricator.wikimedia.org/P64729 and previous config saved to /var/cache/conftool/dbconfig/20240612-162403-marostegui.json [16:24:06] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2190.codfw.wmnet with reason: Maintenance [16:24:08] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [16:24:19] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2190.codfw.wmnet with reason: Maintenance [16:24:26] FIRING: [6x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:24:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2190 (T367261)', diff saved to https://phabricator.wikimedia.org/P64730 and previous config saved to /var/cache/conftool/dbconfig/20240612-162426-marostegui.json [16:24:34] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail, 13Patch-For-Review: Postfix outbound rollout sequence, mx-out - https://phabricator.wikimedia.org/T365395#9885481 (10jhathaway) [16:24:41] FIRING: [8x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:24:55] ^These are expected. Silencing while we're doing maintenance [16:25:02] ok, thanks! [16:25:05] thanks [16:25:10] ack thanks brett [16:25:16] !incidents [16:25:16] 4740 (UNACKED) [6x] ProbeDown sre (probes/service ulsfo) [16:25:27] !ack 4740 [16:25:28] 4740 (ACKED) [6x] ProbeDown sre (probes/service ulsfo) [16:25:32] (03CR) 10Ryan Kemper: [C:03+2] ryankemper: add some bash config [puppet] - 10https://gerrit.wikimedia.org/r/1035052 (owner: 10Ryan Kemper) [16:25:46] RESOLVED: [4x] ProbeDown: Service sessionstore2004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:27:49] !incidents [16:27:49] 4740 (ACKED) [6x] ProbeDown sre (probes/service ulsfo) [16:27:59] 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9885505 (10VRiley-WMF) It certainly does! I will plan for this tomorrow and start prepping a motherboard for this unit. Thanks! [16:28:08] resolving that. I'm sorry for the page, [16:28:32] !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [16:28:40] FIRING: [4x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2088:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:28:45] FIRING: [7x] JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:30:46] FIRING: [8x] JobUnavailable: Reduced availability for job cache_haproxy_tls in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:31:21] !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [16:33:25] RESOLVED: [4x] SystemdUnitFailed: prometheus-wmf-elasticsearch-exporter-9200.service on elastic2088:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:33:45] FIRING: [4x] ProbeDown: Service sessionstore2005-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:35:10] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet) [16:35:46] FIRING: [4x] ProbeDown: Service sessionstore2005-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:38:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T367261)', diff saved to https://phabricator.wikimedia.org/P64731 and previous config saved to /var/cache/conftool/dbconfig/20240612-163822-marostegui.json [16:38:45] RESOLVED: [4x] ProbeDown: Service sessionstore2005-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:40:47] FIRING: HelmReleaseBadStatus: Helm release opentelemetry-collector/main-opentelemetry-collector on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:40:51] !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [16:41:27] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet) [16:42:29] (03Merged) 10jenkins-bot: otelcol: increase max_unavailable to ~5% [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042311 (owner: 10CDanis) [16:43:45] FIRING: [4x] ProbeDown: Service sessionstore1004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:43:45] FIRING: [8x] JobUnavailable: Reduced availability for job cache_haproxy_tls in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:44:11] !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [16:44:22] (03CR) 10Bartosz Dziewoński: [C:03+1] wm-patch-demo: silently ignore errors [software/gerrit] (deploy/wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1042153 (https://phabricator.wikimedia.org/T367155) (owner: 10Hashar) [16:44:22] !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [16:45:18] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9885645 (10BCornwall) [16:45:46] FIRING: [8x] JobUnavailable: Reduced availability for job cache_haproxy_tls in ops@ulsfo - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:45:47] RESOLVED: HelmReleaseBadStatus: Helm release opentelemetry-collector/main-opentelemetry-collector on k8s@codfw in state pending-upgrade - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=codfw&var-cluster=k8s&var-namespace=opentelemetry-collector - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [16:46:03] (03PS31) 10Ryan Kemper: wdqs.data-reload: various fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) [16:46:14] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/User:BryanDavis/Sandbox/D" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041168 (owner: 10BryanDavis) [16:46:47] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9885674 (10BCornwall) [16:46:54] (03CR) 10CDanis: [C:03+2] "works: http.url ==> http://localhost:6006/sessions/v1/enwiki%3AMWSession%3A{token}" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039292 (https://phabricator.wikimedia.org/T366750) (owner: 10CDanis) [16:47:56] !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [16:48:18] !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [16:48:45] RESOLVED: [4x] ProbeDown: Service sessionstore1004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:49:01] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission stat100[4-7].eqiad.wmnet - https://phabricator.wikimedia.org/T367147#9885690 (10VRiley-WMF) [16:49:55] (03Merged) 10jenkins-bot: otelcol: filter out sessionstore user IDs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039292 (https://phabricator.wikimedia.org/T366750) (owner: 10CDanis) [16:50:31] 10ops-eqiad, 06SRE, 06DC-Ops, 10decommission-hardware: decommission stat100[4-7].eqiad.wmnet - https://phabricator.wikimedia.org/T367147#9885695 (10VRiley-WMF) 05Open→03Resolved Ran the decommission script and removed them from the racks. This is now resolved [16:51:51] (03CR) 10CDanis: [C:03+2] otelcol: filter common healthcheck spans [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039297 (https://phabricator.wikimedia.org/T366750) (owner: 10CDanis) [16:52:15] claime: I can't quite tell. I think eventgate-analytics looks fine? But maybe it always has? [16:52:15] https://grafana-rw.wikimedia.org/explore?left=%7B%22datasource%22:%22000000006%22,%22queries%22:%5B%7B%22datasource%22:%7B%22type%22:%22prometheus%22,%22uid%22:%22000000006%22%7D,%22editorMode%22:%22code%22,%22expr%22:%22sum%28%5Cn%20%20rate%28envoy_cluster_upstream_cx_connect_fail%7Benvoy_cluster_name%3D~%5C%22eventgate-%28analytics%7Cmain%29%5C%22,%20cluster%3D~%5C%22.%2A%5C%22,%20instance%3D~%5C%22.%2A%5C%22%7D%5B2m%5D%29%29%5Cnby% [16:52:15] 20%28envoy_cluster_name%29%20%2F%20sum%28%5Cn%20%20rate%28envoy_cluster_upstream_cx_total%7Benvoy_cluster_name%3D~%5C%22eventgate-%28analytics%7Cmain%29%5C%22,%20cluster%3D~%5C%22.%2A%5C%22,%20instance%3D~%5C%22.%2A%5C%22%7D%5B2m%5D%29%29%5Cnby%20%28envoy_cluster_name%29%22,%22interval%22:%22%22,%22range%22:true,%22refId%22:%22A%22%7D%5D,%22range%22:%7B%22from%22:%22now-3h%22,%22to%22:%22now%22%7D%7D&orgId=1 [16:52:30] Hm, bad link sorry [16:52:50] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9885722 (10BCornwall) [16:52:57] (erg, the grafana UI is beign weird in my browser, can't click on some things, like share button...) [16:53:21] (03CR) 10Btullis: dse-k8s-services: Add net-new chart for Airflow (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041759 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [16:53:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P64732 and previous config saved to /var/cache/conftool/dbconfig/20240612-165329-marostegui.json [16:54:07] ottomata: you can use https://w.wiki/ manually [16:54:58] (03Merged) 10jenkins-bot: otelcol: filter common healthcheck spans [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039297 (https://phabricator.wikimedia.org/T366750) (owner: 10CDanis) [16:54:59] 10ops-codfw, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel codfw wikikube worker nodes - https://phabricator.wikimedia.org/T367286#9885733 (10VRiley-WMF) [16:55:20] !log cdanis@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [16:55:46] FIRING: [4x] ProbeDown: Service sessionstore1004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:56:04] !log cdanis@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [16:57:48] (03PS7) 10Ottomata: Configurably remove varnish handling of /beacon/event [puppet] - 10https://gerrit.wikimedia.org/r/1042278 (https://phabricator.wikimedia.org/T238230) [16:58:03] (03CR) 10Ottomata: "I am guessing at the best way to do this. Please advise on better ways." [puppet] - 10https://gerrit.wikimedia.org/r/1042278 (https://phabricator.wikimedia.org/T238230) (owner: 10Ottomata) [16:58:45] RESOLVED: [4x] ProbeDown: Service sessionstore1004-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240612T1700) [17:01:05] (03CR) 10Volans: wdqs.data-reload: various fixes (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) (owner: 10Ryan Kemper) [17:01:46] (03PS1) 10CDanis: bump quota for otelcol [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042318 (https://phabricator.wikimedia.org/T364907) [17:01:49] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#9885755 (10elukey) We have sretest2001 racked and connected to mgmt network, and it is a Supermicro node. I tried to... [17:03:08] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9885761 (10BCornwall) [17:03:45] FIRING: [4x] JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:04:36] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9885762 (10BCornwall) [17:06:55] (03CR) 10CDanis: [V:03+2 C:03+2] bump quota for otelcol [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042318 (https://phabricator.wikimedia.org/T364907) (owner: 10CDanis) [17:07:33] (03CR) 10Alexandros Kosiaris: bump quota for otelcol (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042318 (https://phabricator.wikimedia.org/T364907) (owner: 10CDanis) [17:08:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190', diff saved to https://phabricator.wikimedia.org/P64733 and previous config saved to /var/cache/conftool/dbconfig/20240612-170837-marostegui.json [17:08:45] FIRING: [4x] ProbeDown: Service sessionstore1005-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:08:48] (03PS1) 10CDanis: otelcol: set no quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042319 [17:09:21] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-reboot (exit_code=0) rolling reboot on A:sessionstore [17:09:43] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [17:10:12] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet [17:10:46] RESOLVED: [4x] ProbeDown: Service sessionstore1005-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:10:56] (03CR) 10Dzahn: [C:03+2] acme_chief: add replica-a and replica-b to gitlab cert names [puppet] - 10https://gerrit.wikimedia.org/r/1041749 (owner: 10Dzahn) [17:12:56] (03CR) 10CDanis: [V:03+2 C:03+2] otelcol: set no quota [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042319 (owner: 10CDanis) [17:13:20] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9885810 (10CDobbins) [17:13:22] !log cdanis@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [17:13:39] (03CR) 10Dzahn: [C:03+1] idp/gitlab: add gitlab-replica-a and -b to regex [puppet] - 10https://gerrit.wikimedia.org/r/1041768 (owner: 10Dzahn) [17:13:47] (03PS2) 10Dzahn: idp/gitlab: add gitlab-replica-a and -b to regex [puppet] - 10https://gerrit.wikimedia.org/r/1041768 [17:13:58] !log cdanis@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [17:15:47] (03PS3) 10Dzahn: idp/gitlab: add gitlab-replica-a and -b to regex [puppet] - 10https://gerrit.wikimedia.org/r/1041768 [17:16:26] (03CR) 10Dzahn: [C:03+2] idp/gitlab: add gitlab-replica-a and -b to regex [puppet] - 10https://gerrit.wikimedia.org/r/1041768 (owner: 10Dzahn) [17:18:24] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9885829 (10CDobbins) [17:19:09] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9885830 (10BCornwall) [17:21:42] 10ops-ulsfo, 06SRE, 06DC-Ops: PowerSupplyFailure - https://phabricator.wikimedia.org/T366863#9885839 (10RobH) 05Open→03Resolved a:03RobH had been slightly unseated over time even though velcroed in place as it was fully online when i checked today, but pushed in cables. [17:22:25] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9885854 (10BCornwall) [17:22:41] (03CR) 10Giuseppe Lavagetto: [C:03+2] statsd-exporter: the service should listen to UDP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042312 (owner: 10Giuseppe Lavagetto) [17:23:29] (03Merged) 10jenkins-bot: statsd-exporter: the service should listen to UDP [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042312 (owner: 10Giuseppe Lavagetto) [17:23:38] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9885856 (10BCornwall) [17:23:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2190 (T367261)', diff saved to https://phabricator.wikimedia.org/P64734 and previous config saved to /var/cache/conftool/dbconfig/20240612-172344-marostegui.json [17:23:45] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9885860 (10CDobbins) [17:23:47] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2194.codfw.wmnet with reason: Maintenance [17:23:48] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [17:24:00] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2194.codfw.wmnet with reason: Maintenance [17:24:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2194 (T367261)', diff saved to https://phabricator.wikimedia.org/P64735 and previous config saved to /var/cache/conftool/dbconfig/20240612-172406-marostegui.json [17:24:25] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [17:24:28] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [17:25:23] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [17:25:47] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [17:25:57] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [17:26:07] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [17:27:08] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 13 UTC late backport window](https://wikitech.wikimedia.org/wiki/User:BryanDavis/Sandbox/D" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041168 (owner: 10BryanDavis) [17:29:51] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9885889 (10CDobbins) [17:30:43] !log brett@puppetmaster1001 conftool action : set/pooled=yes; selector: cluster=cache_text,dc=ulsfo [17:33:33] RECOVERY - PyBal IPVS diff check on lvs4008 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:33:33] RECOVERY - PyBal IPVS diff check on lvs4010 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [17:34:42] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [17:37:52] FIRING: GitLabCIJobErrors: GitLab - High CI job error rate - https://wikitech.wikimedia.org/wiki/GitLab/Runbook - https://grafana.wikimedia.org/d/Chb-gC07k/gitlab-ci-overview - https://alerts.wikimedia.org/?q=alertname%3DGitLabCIJobErrors [17:38:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T367261)', diff saved to https://phabricator.wikimedia.org/P64736 and previous config saved to /var/cache/conftool/dbconfig/20240612-173759-marostegui.json [17:38:04] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [17:39:19] !log Remove downtime of cache_text/cp text servers in ulsfo - T364891 [17:39:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:39:23] T364891: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891 [17:40:26] RESOLVED: [8x] ProbeDown: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:42:52] RESOLVED: GitLabCIJobErrors: GitLab - High CI job error rate - https://wikitech.wikimedia.org/wiki/GitLab/Runbook - https://grafana.wikimedia.org/d/Chb-gC07k/gitlab-ci-overview - https://alerts.wikimedia.org/?q=alertname%3DGitLabCIJobErrors [17:43:26] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9885932 (10BCornwall) [17:46:18] (03PS1) 10BCornwall: Revert "depool ulsfo for text cluster drive upgrade" [dns] - 10https://gerrit.wikimedia.org/r/1042326 [17:46:25] (03PS2) 10BCornwall: Revert "depool ulsfo for text cluster drive upgrade" [dns] - 10https://gerrit.wikimedia.org/r/1042326 [17:49:57] (03CR) 10Ssingh: [C:03+1] Revert "depool ulsfo for text cluster drive upgrade" [dns] - 10https://gerrit.wikimedia.org/r/1042326 (owner: 10BCornwall) [17:49:58] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet) [17:50:02] T349069: Design and implement a WDQS data-reload mechanism that sources its data from HDFS instead of the snapshot servers - https://phabricator.wikimedia.org/T349069 [17:50:33] (03CR) 10BCornwall: [C:03+2] Revert "depool ulsfo for text cluster drive upgrade" [dns] - 10https://gerrit.wikimedia.org/r/1042326 (owner: 10BCornwall) [17:51:50] !log Repool ulsfo as A:cp-text nvme upgrades are complete (T364891) [17:51:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:51:54] T364891: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891 [17:52:00] !log authdns-update run on dns1004 (T364891) [17:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:53:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P64737 and previous config saved to /var/cache/conftool/dbconfig/20240612-175306-marostegui.json [17:53:15] (03PS2) 10Dzahn: rename gitlab-replica-old to gitlab-replica-b [dns] - 10https://gerrit.wikimedia.org/r/1041740 [17:53:23] (03PS2) 10Dzahn: gitlab: change service name on gitlab1003 to gitlab-replica-b [puppet] - 10https://gerrit.wikimedia.org/r/1041751 [17:53:28] 06SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for Audrey Penven - https://phabricator.wikimedia.org/T367184#9885965 (10KFrancis) I have sent the NDA out for signatures. I'll confirm when it's complete. Thanks! [17:56:20] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet) [17:56:25] T349069: Design and implement a WDQS data-reload mechanism that sources its data from HDFS instead of the snapshot servers - https://phabricator.wikimedia.org/T349069 [17:57:42] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on gitlab1003.wikimedia.org with reason: renaming gitlab-replica [17:57:55] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gitlab1003.wikimedia.org with reason: renaming gitlab-replica [17:58:08] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet) [17:58:15] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on gitlab-replica-old.wikimedia.org with reason: renaming gitlab-replica [17:58:16] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 1:00:00 on gitlab-replica-old.wikimedia.org with reason: renaming gitlab-replica [17:59:31] !log gitlab-replica-old - downtime, renaming to gitlab-replica-b [17:59:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] brennen and dduvall: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240612T1800). [18:00:10] (03CR) 10Dzahn: [C:03+2] rename gitlab-replica-old to gitlab-replica-b [dns] - 10https://gerrit.wikimedia.org/r/1041740 (owner: 10Dzahn) [18:00:43] o/ [18:01:07] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 213, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:01:09] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:01:35] (03CR) 10Dzahn: [C:03+2] "downtimed, masked gitlab-exporter service, disabled puppet, merging DNS change, merging config change" [puppet] - 10https://gerrit.wikimedia.org/r/1041751 (owner: 10Dzahn) [18:01:47] !log 1.43.0-wmf.9 train (T361403): currently blocked on T367334, holding at group0 until resolved. [18:01:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:53] T361403: 1.43.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T361403 [18:01:53] T367334: EntitySchema conditional namespace registration errors (NS_ENTITYSCHEMA_JSON, NamespaceRegistrationHandler::registerNamespace() TypeError) - https://phabricator.wikimedia.org/T367334 [18:03:35] (03PS5) 10Snwachukwu: Update Eventgate-Wikimedia and Eventstreams repository to Gitlab source and version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040862 (https://phabricator.wikimedia.org/T344730) [18:04:27] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet) [18:04:31] T349069: Design and implement a WDQS data-reload mechanism that sources its data from HDFS instead of the snapshot servers - https://phabricator.wikimedia.org/T349069 [18:05:07] (03CR) 10Snwachukwu: "Acknowledged" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040862 (https://phabricator.wikimedia.org/T344730) (owner: 10Snwachukwu) [18:05:46] FIRING: [2x] JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:08:02] 06SRE, 10SRE-Access-Requests: Requesting access to private data-based dashboards for Jsn.sherman - https://phabricator.wikimedia.org/T367295#9886019 (10herron) [18:08:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194', diff saved to https://phabricator.wikimedia.org/P64738 and previous config saved to /var/cache/conftool/dbconfig/20240612-180814-marostegui.json [18:08:45] FIRING: [2x] JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:12:43] 06SRE, 10SRE-Access-Requests: Requesting access to private data-based dashboards for Jsn.sherman - https://phabricator.wikimedia.org/T367295#9886034 (10herron) Hi @odimitrijevic @Milimetric @WDoranWMF @Ahoelzl could one of you please approve this request for analytics-privatedata-users? Thanks in advance! [18:15:53] (03PS1) 10Herron: admin: add ldap_only entry for gonyeahialam [puppet] - 10https://gerrit.wikimedia.org/r/1042331 (https://phabricator.wikimedia.org/T367053) [18:17:19] (03PS5) 10Brouberol: helmfile: don't schedule admin-ng diff check jobs for the staging k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/1042285 (https://phabricator.wikimedia.org/T331894) [18:17:30] (03PS1) 10Lucas Werkmeister: python: Also look for ~/www/python/src/uwsgi.ini [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1042332 (https://phabricator.wikimedia.org/T367345) [18:17:43] (03CR) 10Snwachukwu: "@akosiaris@wikimedia.org @ltoscano@wikimedia.org Please can you +1 on the change again. I did a rebase." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040862 (https://phabricator.wikimedia.org/T344730) (owner: 10Snwachukwu) [18:17:54] (03Abandoned) 10Brouberol: helmfile: remove temporary else block once resources were absented [puppet] - 10https://gerrit.wikimedia.org/r/1042286 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [18:18:49] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042333 [18:20:32] brennen: o/ (oof, doesn't sound good) [18:21:04] (03PS6) 10Brouberol: helmfile: don't schedule admin-ng diff check jobs for aliases of k8s clusters [puppet] - 10https://gerrit.wikimedia.org/r/1042285 (https://phabricator.wikimedia.org/T331894) [18:23:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2194 (T367261)', diff saved to https://phabricator.wikimedia.org/P64739 and previous config saved to /var/cache/conftool/dbconfig/20240612-182321-marostegui.json [18:23:23] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2205.codfw.wmnet with reason: Maintenance [18:23:25] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [18:23:37] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2205.codfw.wmnet with reason: Maintenance [18:23:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2205 (T367261)', diff saved to https://phabricator.wikimedia.org/P64740 and previous config saved to /var/cache/conftool/dbconfig/20240612-182343-marostegui.json [18:24:20] (03CR) 10Brouberol: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2910/co" [puppet] - 10https://gerrit.wikimedia.org/r/1042285 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [18:24:50] (03PS1) 10Brouberol: helmfile: remove temporary else block once resources were absented [puppet] - 10https://gerrit.wikimedia.org/r/1042336 (https://phabricator.wikimedia.org/T331894) [18:27:23] (03CR) 10Brouberol: [C:03+2] helmfile: set HELM environment variables for the admin-ng systemd jobs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1042224 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [18:27:33] (03CR) 10Snwachukwu: [C:03+1] Update Eventgate-Wikimedia and Eventstreams repository to Gitlab source and version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040862 (https://phabricator.wikimedia.org/T344730) (owner: 10Snwachukwu) [18:27:52] dduvall: i'm not entirely clear that that one should actually block, but i guess i will trust the process here and wait for someone who knows to weigh in. [18:29:12] from a heuristic approach, does the error appear for wmf.8 as well and do the rates differ? [18:29:50] that's one measure i use, but of course this is all discretionary and it makes sense to get a confirmation like you're doing already [18:30:58] (03CR) 10Brouberol: [C:03+1] wdqs graph-split: add final svcs [dns] - 10https://gerrit.wikimedia.org/r/1042160 (https://phabricator.wikimedia.org/T364364) (owner: 10Stevemunene) [18:31:23] (03CR) 10Jforrester: [C:03+1] Update Eventgate-Wikimedia and Eventstreams repository to Gitlab source and version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040862 (https://phabricator.wikimedia.org/T344730) (owner: 10Snwachukwu) [18:31:30] dduvall: https://logstash.wikimedia.org/goto/566ad848a6721d93d8e1e83ee2ffa2e8 - it shows up in wmf.8 but maybe only during L.ucas_WMDE's attempted backport? [18:32:27] (03CR) 10Ottomata: [C:03+1] Update Eventgate-Wikimedia and Eventstreams repository to Gitlab source and version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040862 (https://phabricator.wikimedia.org/T344730) (owner: 10Snwachukwu) [18:32:54] (03PS3) 10Jdlrobson: Disable quick surveys using deprecated configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041748 (https://phabricator.wikimedia.org/T367128) [18:35:34] well, anyway, patch has a +2. i'll backport and go ahead. [18:35:55] (03PS1) 10Brennen Bearnes: Call NamespaceRegistrationHandler::setConstants() earlier [extensions/EntitySchema] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1042343 (https://phabricator.wikimedia.org/T367334) [18:36:56] (03PS1) 10Dzahn: rename gitlab-replica to gitlab-replica-a [dns] - 10https://gerrit.wikimedia.org/r/1042344 [18:37:11] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:37:21] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:37:30] brennen: that's really confusing but blocking made/makes sense to me [18:37:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2205 (T367261)', diff saved to https://phabricator.wikimedia.org/P64741 and previous config saved to /var/cache/conftool/dbconfig/20240612-183741-marostegui.json [18:37:48] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [18:37:53] (03CR) 10Ottomata: [C:03+2] Update Eventgate-Wikimedia and Eventstreams repository to Gitlab source and version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040862 (https://phabricator.wikimedia.org/T344730) (owner: 10Snwachukwu) [18:39:14] (03Merged) 10jenkins-bot: Update Eventgate-Wikimedia and Eventstreams repository to Gitlab source and version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040862 (https://phabricator.wikimedia.org/T344730) (owner: 10Snwachukwu) [18:39:28] !log ebysans@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [18:39:51] !log ebysans@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply [18:40:12] (03PS1) 10CDanis: otelcol: Auto-generate useful operation names [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042350 (https://phabricator.wikimedia.org/T367342) [18:40:33] !log ebysans@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [18:40:50] !log ebysans@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply [18:41:38] (03CR) 10Scott French: "@tklausmann@wikimedia.org - Could I ask you to take a look at this, and let me whether you think this would be desirable for k8s-mlserve a" [alerts] - 10https://gerrit.wikimedia.org/r/1040242 (https://phabricator.wikimedia.org/T366932) (owner: 10Scott French) [18:41:55] !log ebysans@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams-internal: apply [18:42:37] !log ebysans@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams-internal: apply [18:46:35] (03PS2) 10CDanis: otelcol: Auto-generate useful operation names [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042350 (https://phabricator.wikimedia.org/T367342) [18:46:35] (03CR) 10Jsn.sherman: [C:03+1] "Done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041748 (https://phabricator.wikimedia.org/T367128) (owner: 10Jdlrobson) [18:48:12] !log ebysans@deploy1002 helmfile [eqiad] START helmfile.d/services/eventstreams-internal: apply [18:48:58] (03PS3) 10CDanis: otelcol: Auto-generate useful operation names [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042350 (https://phabricator.wikimedia.org/T367342) [18:49:15] !log ebysans@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventstreams-internal: apply [18:51:36] !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [18:52:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2205', diff saved to https://phabricator.wikimedia.org/P64742 and previous config saved to /var/cache/conftool/dbconfig/20240612-185248-marostegui.json [18:52:59] !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [18:53:51] (03PS4) 10CDanis: otelcol: Auto-generate useful operation names [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042350 (https://phabricator.wikimedia.org/T367342) [18:55:50] !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [18:56:47] (03PS5) 10CDanis: otelcol: Auto-generate useful operation names [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042350 (https://phabricator.wikimedia.org/T367342) [18:57:43] !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [18:57:45] (03PS1) 10Ssingh: [WIP]: dnsbox: advertise ntp-[abc].anycast.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1042354 (https://phabricator.wikimedia.org/T366360) [18:58:04] !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [18:58:48] !log ebysans@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams-internal: apply [18:58:57] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2911/console" [puppet] - 10https://gerrit.wikimedia.org/r/1042354 (https://phabricator.wikimedia.org/T366360) (owner: 10Ssingh) [18:59:40] !log ebysans@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams-internal: apply [18:59:48] !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [18:59:56] !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [19:01:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by brennen@deploy1002 using scap backport" [extensions/EntitySchema] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1042343 (https://phabricator.wikimedia.org/T367334) (owner: 10Brennen Bearnes) [19:01:57] (03PS2) 10Ssingh: [WIP]: dnsbox: advertise ntp-[abc].anycast.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1042354 (https://phabricator.wikimedia.org/T366360) [19:02:11] !log ebysans@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams: apply [19:02:32] !log ebysans@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply [19:02:55] !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [19:02:56] !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [19:03:10] (03CR) 10Ssingh: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2912/console" [puppet] - 10https://gerrit.wikimedia.org/r/1042354 (https://phabricator.wikimedia.org/T366360) (owner: 10Ssingh) [19:06:06] !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [19:06:16] !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [19:07:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2205', diff saved to https://phabricator.wikimedia.org/P64744 and previous config saved to /var/cache/conftool/dbconfig/20240612-190755-marostegui.json [19:08:21] (03PS3) 10Ssingh: [WIP]: dnsbox: advertise ntp-[abc].anycast.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1042354 (https://phabricator.wikimedia.org/T366360) [19:08:35] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel eqiad wikikube worker nodes - https://phabricator.wikimedia.org/T367285#9886282 (10VRiley-WMF) [19:08:44] !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [19:08:55] (03CR) 10Scott French: aqs-http-gateway: allow cross-DC Cassandra client connection / fix settings (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041235 (https://phabricator.wikimedia.org/T366851) (owner: 10Scott French) [19:09:11] !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [19:09:46] (03PS4) 10Ssingh: [WIP]: dnsbox: advertise ntp-[abc].anycast.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1042354 (https://phabricator.wikimedia.org/T366360) [19:10:22] 10ops-eqiad, 06SRE, 06DC-Ops, 10Prod-Kubernetes, and 2 others: Relabel eqiad wikikube worker nodes - https://phabricator.wikimedia.org/T367285#9886284 (10VRiley-WMF) 05Open→03Resolved a:03VRiley-WMF These have been tagged as requested. This is resolved. [19:10:58] !log ebysans@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams: apply [19:11:48] !log ebysans@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply [19:15:12] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2200.codfw.wmnet with reason: Maintenance [19:15:25] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2200.codfw.wmnet with reason: Maintenance [19:16:46] (03PS5) 10Ssingh: [WIP]: dnsbox: advertise ntp-[abc].anycast.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1042354 (https://phabricator.wikimedia.org/T366360) [19:17:34] !log ebysans@deploy1002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply [19:17:58] (03CR) 10Scott French: [C:03+2] aqs-http-gateway: allow cross-DC Cassandra client connection / fix settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041235 (https://phabricator.wikimedia.org/T366851) (owner: 10Scott French) [19:18:04] !log ebysans@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply [19:19:13] !log ebysans@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-logging-external: apply [19:19:29] (03Merged) 10jenkins-bot: aqs-http-gateway: allow cross-DC Cassandra client connection / fix settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041235 (https://phabricator.wikimedia.org/T366851) (owner: 10Scott French) [19:19:41] !log ebysans@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-logging-external: apply [19:21:27] (03PS6) 10Ssingh: [WIP]: dnsbox: advertise ntp-[abc].anycast.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1042354 (https://phabricator.wikimedia.org/T366360) [19:22:05] !log cdanis@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [19:22:17] !log cdanis@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [19:22:27] !log ebysans@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-logging-external: apply [19:23:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2205 (T367261)', diff saved to https://phabricator.wikimedia.org/P64745 and previous config saved to /var/cache/conftool/dbconfig/20240612-192303-marostegui.json [19:23:07] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2209.codfw.wmnet with reason: Maintenance [19:23:09] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [19:23:09] !log ebysans@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-logging-external: apply [19:23:20] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2209.codfw.wmnet with reason: Maintenance [19:23:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2209 (T367261)', diff saved to https://phabricator.wikimedia.org/P64746 and previous config saved to /var/cache/conftool/dbconfig/20240612-192327-marostegui.json [19:24:26] (03Abandoned) 10Ssingh: [WIP]: dnsbox: advertise ntp-[abc].anycast.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/1042354 (https://phabricator.wikimedia.org/T366360) (owner: 10Ssingh) [19:25:51] !log ebysans@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-logging-external: apply [19:26:33] !log ebysans@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-logging-external: apply [19:27:40] !log ebysans@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [19:28:05] !log ebysans@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [19:29:29] !log ebysans@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply [19:29:59] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/device-analytics: apply [19:30:12] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/device-analytics: apply [19:30:22] !log ebysans@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply [19:30:51] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/edit-analytics: apply [19:31:33] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply [19:31:58] !log ebysans@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [19:32:01] (03Merged) 10jenkins-bot: Call NamespaceRegistrationHandler::setConstants() earlier [extensions/EntitySchema] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1042343 (https://phabricator.wikimedia.org/T367334) (owner: 10Brennen Bearnes) [19:32:09] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/editor-analytics: apply [19:32:31] !log brennen@deploy1002 Started scap: Backport for [[gerrit:1042343|Call NamespaceRegistrationHandler::setConstants() earlier (T367334 T363153)]] [19:32:41] T367334: EntitySchema conditional namespace registration errors (NS_ENTITYSCHEMA_JSON, NamespaceRegistrationHandler::registerNamespace() TypeError) - https://phabricator.wikimedia.org/T367334 [19:32:41] T363153: [ES-M2]: Load EntitySchema data type registration for WikibaseClient on client wikis - https://phabricator.wikimedia.org/T363153 [19:32:52] !log ebysans@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [19:32:53] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply [19:33:22] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/geo-analytics: apply [19:34:07] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/geo-analytics: apply [19:34:39] !log ebysans@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics-external: apply [19:34:53] !log ebysans@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics-external: apply [19:35:03] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/image-suggestion: apply [19:35:12] !log brennen@deploy1002 brennen: Backport for [[gerrit:1042343|Call NamespaceRegistrationHandler::setConstants() earlier (T367334 T363153)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [19:35:24] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/image-suggestion: apply [19:36:08] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/media-analytics: apply [19:36:16] !log ebysans@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics-external: apply [19:36:47] !log ebysans@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics-external: apply [19:36:57] !log brennen@deploy1002 brennen: Continuing with sync [19:37:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T367261)', diff saved to https://phabricator.wikimedia.org/P64747 and previous config saved to /var/cache/conftool/dbconfig/20240612-193712-marostegui.json [19:37:16] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [19:37:52] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/media-analytics: apply [19:38:09] (03CR) 10BryanDavis: [C:03+2] "LGTM." [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1042332 (https://phabricator.wikimedia.org/T367345) (owner: 10Lucas Werkmeister) [19:38:23] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/page-analytics: apply [19:38:41] (03Merged) 10jenkins-bot: python: Also look for ~/www/python/src/uwsgi.ini [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/1042332 (https://phabricator.wikimedia.org/T367345) (owner: 10Lucas Werkmeister) [19:39:14] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/page-analytics: apply [19:39:39] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/data-gateway: apply [19:40:11] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/data-gateway: apply [19:40:16] !log ebysans@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics-external: apply [19:41:11] !log ebysans@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics-external: apply [19:43:15] !log ebysans@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-main: apply [19:43:27] !log ebysans@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-main: apply [19:45:37] !log ebysans@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-main: apply [19:45:37] !log brennen@deploy1002 Finished scap: Backport for [[gerrit:1042343|Call NamespaceRegistrationHandler::setConstants() earlier (T367334 T363153)]] (duration: 13m 06s) [19:45:44] T367334: EntitySchema conditional namespace registration errors (NS_ENTITYSCHEMA_JSON, NamespaceRegistrationHandler::registerNamespace() TypeError) - https://phabricator.wikimedia.org/T367334 [19:45:44] T363153: [ES-M2]: Load EntitySchema data type registration for WikibaseClient on client wikis - https://phabricator.wikimedia.org/T363153 [19:46:18] !log ebysans@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-main: apply [19:47:22] (03PS1) 10TrainBranchBot: group1 wikis to 1.43.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042359 (https://phabricator.wikimedia.org/T361403) [19:47:24] (03CR) 10TrainBranchBot: [C:03+2] group1 wikis to 1.43.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042359 (https://phabricator.wikimedia.org/T361403) (owner: 10TrainBranchBot) [19:47:58] (03CR) 10Hashar: [C:03+2] wm-patch-demo: silently ignore errors [software/gerrit] (deploy/wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1042153 (https://phabricator.wikimedia.org/T367155) (owner: 10Hashar) [19:48:02] (03Merged) 10jenkins-bot: group1 wikis to 1.43.0-wmf.9 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1042359 (https://phabricator.wikimedia.org/T361403) (owner: 10TrainBranchBot) [19:48:07] !log 1.43.0-wmf.9 train (T361403): blockers (hopefully) resolved, rolling to group1 [19:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:11] T361403: 1.43.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T361403 [19:48:21] !log ebysans@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-main: apply [19:48:39] (03Merged) 10jenkins-bot: wm-patch-demo: silently ignore errors [software/gerrit] (deploy/wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1042153 (https://phabricator.wikimedia.org/T367155) (owner: 10Hashar) [19:48:48] !log ebysans@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-main: apply [19:49:10] !log hashar@deploy1002 Started deploy [gerrit/gerrit@e4c49f9]: wm-patch-demo: silently ignore errors - T367155 [19:49:14] T367155: Gerrit error: Error while fetching results for wm-patch-demo - https://phabricator.wikimedia.org/T367155 [19:49:17] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@e4c49f9]: wm-patch-demo: silently ignore errors - T367155 (duration: 00m 07s) [19:49:46] FIRING: Storage /var over 50%: Alert for device lsw1-f5-eqiad.mgmt.eqiad.wmnet - Storage /var over 50% - https://alerts.wikimedia.org/?q=alertname%3DStorage+%2Fvar+over+50%25 [19:52:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P64748 and previous config saved to /var/cache/conftool/dbconfig/20240612-195219-marostegui.json [19:58:57] !log brennen@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.43.0-wmf.9 refs T361403 [19:59:02] T361403: 1.43.0-wmf.9 deployment blockers - https://phabricator.wikimedia.org/T361403 [19:59:37] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/data-gateway: apply [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: gettimeofday() says it's time for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240612T2000) [20:00:05] jdlrobson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:47] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/data-gateway: apply [20:01:38] 10ops-ulsfo, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9886475 (10BCornwall) a:05RobH→03BCornwall [20:02:03] 10ops-ulsfo, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9886473 (10BCornwall) [20:02:15] 10ops-ulsfo, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9886476 (10BCornwall) 05Open→03In progress [20:03:54] o/ present [20:07:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209', diff saved to https://phabricator.wikimedia.org/P64749 and previous config saved to /var/cache/conftool/dbconfig/20240612-200726-marostegui.json [20:09:00] cjming: TheresNoTime urbanecm RoanKattouw - are any of you free to help with a backport? [20:09:20] yes! in a meeting going long - give me a few mins [20:09:42] thanks cjming i appreciate it <3 [20:09:49] np! [20:10:41] !log fabfur@cumin1002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-upload_codfw [20:11:13] (03PS32) 10Ryan Kemper: wdqs.data-reload: various fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) [20:12:50] (03PS4) 10Jdlrobson: Disable quick surveys using deprecated configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041748 (https://phabricator.wikimedia.org/T367128) [20:12:58] (03PS3) 10Jdlrobson: Don't squish images in non-responsive skins e.g. Vector 2010 [core] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1041674 (https://phabricator.wikimedia.org/T113101) [20:13:10] ok - here for real [20:13:17] (03CR) 10Cwhite: "This patch is serviceable as-is. Other comments are non-blocking." [puppet] - 10https://gerrit.wikimedia.org/r/1042273 (https://phabricator.wikimedia.org/T350567) (owner: 10Eevans) [20:13:22] Jdlrobson: i'll do your config patch first [20:14:18] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041748 (https://phabricator.wikimedia.org/T367128) (owner: 10Jdlrobson) [20:14:43] (03CR) 10Clare Ming: [C:03+2] Don't squish images in non-responsive skins e.g. Vector 2010 [core] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1041674 (https://phabricator.wikimedia.org/T113101) (owner: 10Jdlrobson) [20:16:47] (03Merged) 10jenkins-bot: Disable quick surveys using deprecated configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041748 (https://phabricator.wikimedia.org/T367128) (owner: 10Jdlrobson) [20:17:23] !log cjming@deploy1002 Started scap: Backport for [[gerrit:1041748|Disable quick surveys using deprecated configuration (T367128)]] [20:17:27] T367128: PHP Deprecated: Use of QuickSurveys survey with link parameter was deprecated in MediaWiki 1.43. [Called from QuickSurveys\SurveyFactory::factoryExternal] - https://phabricator.wikimedia.org/T367128 [20:19:57] !log cjming@deploy1002 jdlrobson, cjming: Backport for [[gerrit:1041748|Disable quick surveys using deprecated configuration (T367128)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:20:05] Jdlrobson: shall i sync? [20:20:53] just doing some testing. [20:21:02] config patch is good to sync [20:21:07] great - syncing [20:21:11] !log cjming@deploy1002 jdlrobson, cjming: Continuing with sync [20:22:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2209 (T367261)', diff saved to https://phabricator.wikimedia.org/P64750 and previous config saved to /var/cache/conftool/dbconfig/20240612-202233-marostegui.json [20:22:38] T367261: Rebuild recentchanges table everywhere - https://phabricator.wikimedia.org/T367261 [20:29:22] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1041748|Disable quick surveys using deprecated configuration (T367128)]] (duration: 11m 59s) [20:29:26] T367128: PHP Deprecated: Use of QuickSurveys survey with link parameter was deprecated in MediaWiki 1.43. [Called from QuickSurveys\SurveyFactory::factoryExternal] - https://phabricator.wikimedia.org/T367128 [20:30:07] (03PS1) 10BCornwall: Set cp4037 hieradata to use dual NVMe disks [puppet] - 10https://gerrit.wikimedia.org/r/1042366 (https://phabricator.wikimedia.org/T364891) [20:30:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by cjming@deploy1002 using scap backport" [core] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1041674 (https://phabricator.wikimedia.org/T113101) (owner: 10Jdlrobson) [20:30:32] Jdlrobson: config patch is live - just waiting for backport to merge (7 mins) [20:31:38] sounds good! Thanks! [20:32:01] (03CR) 10Krinkle: Disable quick surveys using deprecated configuration (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041748 (https://phabricator.wikimedia.org/T367128) (owner: 10Jdlrobson) [20:34:27] (03CR) 10Eevans: cassandra: alternate logging hostname definition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1042273 (https://phabricator.wikimedia.org/T350567) (owner: 10Eevans) [20:34:43] (03PS2) 10Eevans: cassandra: alternate logging hostname definition [puppet] - 10https://gerrit.wikimedia.org/r/1042273 (https://phabricator.wikimedia.org/T350567) [20:39:11] (03Merged) 10jenkins-bot: Don't squish images in non-responsive skins e.g. Vector 2010 [core] (wmf/1.43.0-wmf.9) - 10https://gerrit.wikimedia.org/r/1041674 (https://phabricator.wikimedia.org/T113101) (owner: 10Jdlrobson) [20:39:19] wow - that took a whopping 26 mins to merge [20:39:43] !log cjming@deploy1002 Started scap: Backport for [[gerrit:1041674|Don't squish images in non-responsive skins e.g. Vector 2010 (T113101)]] [20:39:47] T113101: Images should be responsive in Vector and restrained to a max-size. - https://phabricator.wikimedia.org/T113101 [20:40:44] lol [20:41:24] (03CR) 10Gergő Tisza: [C:04-1] "Needs an update now that the URL schema uses `/$serverName`, not `/$site/$lang`." [puppet] - 10https://gerrit.wikimedia.org/r/1036230 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza) [20:41:42] (03CR) 10Gergő Tisza: [C:04-1] [POC] Handle sso.wikimedia.org domain (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1036245 (https://phabricator.wikimedia.org/T365162) (owner: 10Gergő Tisza) [20:41:50] (03CR) 10Eevans: [C:03+2] sre.cassandra.roll-reboot: Add missing Cassandra cluster aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/1041231 (owner: 10Eevans) [20:42:01] (03CR) 10Ssingh: [C:03+1] Set cp4037 hieradata to use dual NVMe disks [puppet] - 10https://gerrit.wikimedia.org/r/1042366 (https://phabricator.wikimedia.org/T364891) (owner: 10BCornwall) [20:42:21] !log cjming@deploy1002 cjming, jdlrobson: Backport for [[gerrit:1041674|Don't squish images in non-responsive skins e.g. Vector 2010 (T113101)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:42:32] Jdlrobson: good to sync? [20:43:32] cjming: yep good to sync! thanks! [20:44:10] !log cjming@deploy1002 cjming, jdlrobson: Continuing with sync [20:46:01] (03Merged) 10jenkins-bot: sre.cassandra.roll-reboot: Add missing Cassandra cluster aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/1041231 (owner: 10Eevans) [20:47:15] !log brett@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp4037.ulsfo.wmnet [20:47:43] Thanks cjming for your help! [20:47:54] you're welcome! [20:49:35] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1042273 (https://phabricator.wikimedia.org/T350567) (owner: 10Eevans) [20:52:35] !log cjming@deploy1002 Finished scap: Backport for [[gerrit:1041674|Don't squish images in non-responsive skins e.g. Vector 2010 (T113101)]] (duration: 12m 52s) [20:52:40] T113101: Images should be responsive in Vector and restrained to a max-size. - https://phabricator.wikimedia.org/T113101 [20:52:42] 10SRE-tools, 10Observability-Logging: Provide an utility script to replace a failed device in raid 0 array - https://phabricator.wikimedia.org/T350492#9886699 (10colewhite) [20:53:03] Jdlrobson: everything should be live [20:53:11] !log end of UTC late backport window [20:53:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:55:13] thank you! [21:00:05] Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240612T2100) [21:00:11] (03CR) 10BCornwall: [C:03+2] "y" [puppet] - 10https://gerrit.wikimedia.org/r/1042366 (https://phabricator.wikimedia.org/T364891) (owner: 10BCornwall) [21:03:57] (03PS1) 10Eevans: data-gateway: Upgrade production to v1.0.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042372 [21:04:32] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp4037.ulsfo.wmnet with OS bullseye [21:04:39] 10ops-ulsfo, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9886731 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp4037.ulsfo.wmnet with OS bullseye [21:05:11] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet) [21:05:17] T349069: Design and implement a WDQS data-reload mechanism that sources its data from HDFS instead of the snapshot servers - https://phabricator.wikimedia.org/T349069 [21:05:25] 10ops-ulsfo, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9886733 (10BCornwall) [21:05:50] !log ryankemper@cumin2002 START - Cookbook sre.hadoop.reboot-workers for Hadoop analytics cluster [21:07:59] (03CR) 10Eevans: [C:03+2] cassandra: alternate logging hostname definition [puppet] - 10https://gerrit.wikimedia.org/r/1042273 (https://phabricator.wikimedia.org/T350567) (owner: 10Eevans) [21:11:29] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.wdqs.data-reload (exit_code=0) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet) [21:11:34] T349069: Design and implement a WDQS data-reload mechanism that sources its data from HDFS instead of the snapshot servers - https://phabricator.wikimedia.org/T349069 [21:13:04] (03PS1) 10Scott French: Revert "aqs-http-gateway: allow cross-DC Cassandra client connection / fix settings" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042375 (https://phabricator.wikimedia.org/T366851) [21:13:26] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching aqs1010.eqiad.wmnet: Apply remote logging fix (r1042273) - eevans@cumin1002 [21:13:51] (03CR) 10Eevans: [C:03+1] Revert "aqs-http-gateway: allow cross-DC Cassandra client connection / fix settings" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042375 (https://phabricator.wikimedia.org/T366851) (owner: 10Scott French) [21:15:02] (03CR) 10Scott French: "Thanks, Eric!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042375 (https://phabricator.wikimedia.org/T366851) (owner: 10Scott French) [21:15:17] (03CR) 10Scott French: [C:03+2] Revert "aqs-http-gateway: allow cross-DC Cassandra client connection / fix settings" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042375 (https://phabricator.wikimedia.org/T366851) (owner: 10Scott French) [21:16:42] (03Merged) 10jenkins-bot: Revert "aqs-http-gateway: allow cross-DC Cassandra client connection / fix settings" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042375 (https://phabricator.wikimedia.org/T366851) (owner: 10Scott French) [21:17:41] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/data-gateway: apply [21:17:52] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/data-gateway: apply [21:18:38] !log brett@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host cp4037.ulsfo.wmnet with OS bullseye [21:19:17] !log brett@cumin2002 START - Cookbook sre.hosts.reimage for host cp4037.ulsfo.wmnet with OS bullseye [21:19:48] 10ops-ulsfo, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9886776 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp4037.ulsfo.wmnet with OS bullseye execu... [21:19:58] 10ops-ulsfo, 06DC-Ops, 06Traffic, 13Patch-For-Review: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9886778 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by brett@cumin2002 for host cp4037.ulsfo.wmnet with OS bullseye [21:20:51] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching aqs1010.eqiad.wmnet: Apply remote logging fix (r1042273) - eevans@cumin1002 [21:21:56] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching A:cassandra-dev: Apply remote logging fix (r1042273) - eevans@cumin1002 [21:22:08] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/data-gateway: sync [21:22:55] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/data-gateway: sync [21:24:15] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/page-analytics: apply [21:25:38] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/page-analytics: apply [21:26:01] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/media-analytics: apply [21:27:00] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/media-analytics: apply [21:28:14] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/image-suggestion: apply [21:28:20] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/image-suggestion: apply [21:28:40] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/image-suggestion: sync [21:30:40] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/geo-analytics: apply [21:30:53] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/geo-analytics: apply [21:31:05] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/geo-analytics: sync [21:31:35] (03CR) 10Eevans: [C:03+2] data-gateway: Upgrade production to v1.0.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042372 (owner: 10Eevans) [21:31:41] (03CR) 10CI reject: [V:04-1] data-gateway: Upgrade production to v1.0.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042372 (owner: 10Eevans) [21:31:50] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/geo-analytics: sync [21:32:12] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/editor-analytics: apply [21:33:05] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/editor-analytics: apply [21:33:55] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/edit-analytics: apply [21:34:42] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [21:34:46] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/edit-analytics: apply [21:34:46] (03PS2) 10Eevans: data-gateway: Upgrade production to v1.0.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042372 [21:35:54] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/device-analytics: apply [21:36:01] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/device-analytics: apply [21:36:11] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/device-analytics: sync [21:36:56] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/device-analytics: sync [21:41:27] !log brett@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on cp4037.ulsfo.wmnet with reason: host reimage [21:42:01] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:cassandra-dev: Apply remote logging fix (r1042273) - eevans@cumin1002 [21:44:23] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cp4037.ulsfo.wmnet with reason: host reimage [21:47:46] (03CR) 10Ryan Kemper: [C:03+1] wdqs.data-reload: various fixes [cookbooks] - 10https://gerrit.wikimedia.org/r/1038904 (https://phabricator.wikimedia.org/T349069) (owner: 10Ryan Kemper) [21:53:55] (03CR) 10Scott French: [C:03+2] "For folks who might be interested, a summary of why this was reverted appears in https://phabricator.wikimedia.org/T366851#9886824." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042375 (https://phabricator.wikimedia.org/T366851) (owner: 10Scott French) [21:55:17] * Krinkle checks if mw deploy coast is clear [21:56:12] jouncebot: next [21:56:12] In 8 hour(s) and 3 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T0600) [21:56:12] In 8 hour(s) and 3 minute(s): Primary database switchover (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240613T0600) [21:56:21] (03CR) 10Scott French: [C:03+1] data-gateway: Upgrade production to v1.0.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042372 (owner: 10Eevans) [21:56:46] (03PS2) 10Krinkle: password: Document wmgPasswordSecretKey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034905 (https://phabricator.wikimedia.org/T150647) [21:56:49] (03CR) 10Krinkle: [C:03+2] password: Document wmgPasswordSecretKey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034905 (https://phabricator.wikimedia.org/T150647) (owner: 10Krinkle) [21:56:56] (03PS5) 10Krinkle: Move etcd.php from wmf-config/ to src/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891733 (https://phabricator.wikimedia.org/T308932) [21:57:30] (03Merged) 10jenkins-bot: password: Document wmgPasswordSecretKey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034905 (https://phabricator.wikimedia.org/T150647) (owner: 10Krinkle) [21:58:39] (03CR) 10TrainBranchBot: [C:03+2] "Approved by krinkle@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891733 (https://phabricator.wikimedia.org/T308932) (owner: 10Krinkle) [21:59:15] (03Merged) 10jenkins-bot: Move etcd.php from wmf-config/ to src/ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/891733 (https://phabricator.wikimedia.org/T308932) (owner: 10Krinkle) [21:59:48] !log krinkle@deploy1002 Started scap: Backport for [[gerrit:891733|Move etcd.php from wmf-config/ to src/ (T308932)]] [21:59:53] T308932: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022) - https://phabricator.wikimedia.org/T308932 [22:02:51] (03CR) 10Eevans: [C:03+2] data-gateway: Upgrade production to v1.0.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042372 (owner: 10Eevans) [22:03:04] !log krinkle@deploy1002 krinkle: Backport for [[gerrit:891733|Move etcd.php from wmf-config/ to src/ (T308932)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:03:45] (03Merged) 10jenkins-bot: data-gateway: Upgrade production to v1.0.7 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1042372 (owner: 10Eevans) [22:04:45] !log eevans@deploy1002 helmfile [codfw] START helmfile.d/services/data-gateway: apply [22:04:47] !log krinkle@deploy1002 krinkle: Continuing with sync [22:06:16] !log eevans@deploy1002 helmfile [codfw] DONE helmfile.d/services/data-gateway: apply [22:07:20] !log brett@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp4037.ulsfo.wmnet with OS bullseye [22:07:30] 10ops-ulsfo, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9886850 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by brett@cumin2002 for host cp4037.ulsfo.wmnet with OS bullseye completed: - cp4037 (**PASS... [22:07:46] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:08:08] !log eevans@deploy1002 helmfile [eqiad] START helmfile.d/services/data-gateway: apply [22:08:39] (03CR) 10Gergő Tisza: password: Document wmgPasswordSecretKey (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034905 (https://phabricator.wikimedia.org/T150647) (owner: 10Krinkle) [22:10:17] !log eevans@deploy1002 helmfile [eqiad] DONE helmfile.d/services/data-gateway: apply [22:10:46] FIRING: JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:11:25] FIRING: SystemdUnitFailed: httpbb_hourly_appserver.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:13:31] !log krinkle@deploy1002 Finished scap: Backport for [[gerrit:891733|Move etcd.php from wmf-config/ to src/ (T308932)]] (duration: 13m 42s) [22:13:35] T308932: Iteratively clean up wmf-config to be less dynamic and with smaller settings files (2022) - https://phabricator.wikimedia.org/T308932 [22:14:51] 10ops-ulsfo, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9886858 (10BCornwall) [22:17:00] !log brett@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet [22:19:13] 10ops-ulsfo, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9886862 (10BCornwall) [22:21:36] (03PS1) 10Bking: team-search-platform: Add kafka topic alerts for new search pipeline [alerts] - 10https://gerrit.wikimedia.org/r/1042396 (https://phabricator.wikimedia.org/T349772) [22:24:10] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Thursday, June 13 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/User:BryanDavis/Sand" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041168 (owner: 10BryanDavis) [22:26:49] (03PS2) 10BryanDavis: [DNM] Testing things in Gerrit UI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041168 (https://phabricator.wikimedia.org/T366763) [22:35:12] PROBLEM - Host an-worker1085 is DOWN: PING CRITICAL - Packet loss = 100% [23:01:56] (03Abandoned) 10BryanDavis: [DNM] Testing things in Gerrit UI [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041168 (https://phabricator.wikimedia.org/T366763) (owner: 10BryanDavis) [23:07:46] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:11:25] RESOLVED: SystemdUnitFailed: httpbb_hourly_appserver.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:21:45] (03PS1) 10Ncmonitor: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1042412 [23:21:48] (03PS1) 10Ncmonitor: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1042413 [23:21:51] (03PS1) 10Ncmonitor: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1042414 [23:22:40] (03Abandoned) 10BCornwall: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1042413 (owner: 10Ncmonitor) [23:23:25] (03Abandoned) 10BCornwall: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1042412 (owner: 10Ncmonitor) [23:23:29] (03Abandoned) 10BCornwall: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1042414 (owner: 10Ncmonitor) [23:30:28] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 51 probes of 791 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:35:28] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 30 probes of 791 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [23:38:23] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1042419 [23:38:23] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1042419 (owner: 10TrainBranchBot) [23:49:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T352010)', diff saved to https://phabricator.wikimedia.org/P64751 and previous config saved to /var/cache/conftool/dbconfig/20240612-234923-ladsgroup.json [23:49:28] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [23:49:46] FIRING: Storage /var over 50%: Alert for device lsw1-f5-eqiad.mgmt.eqiad.wmnet - Storage /var over 50% - https://alerts.wikimedia.org/?q=alertname%3DStorage+%2Fvar+over+50%25