[00:01:25] FIRING: [5x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:04:28] ^ working on HelmReleaseBadStatus, just for the avoidance of doubt please no deploys [00:06:25] FIRING: [6x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:06:43] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1057029 (owner: 10TrainBranchBot) [00:18:26] (03PS1) 10Ladsgroup: Update UI classes and CSS for review notices [extensions/FlaggedRevs] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1057031 (https://phabricator.wikimedia.org/T191156) [00:23:12] (03PS1) 10Superzerocool: enwiki, commonswiki: lift IP cap for edit-a-thon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057033 (https://phabricator.wikimedia.org/T371026) [00:24:32] (03CR) 10Superzerocool: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057033 (https://phabricator.wikimedia.org/T371026) (owner: 10Superzerocool) [00:27:32] (03CR) 10Ladsgroup: [C:03+2] Update UI classes and CSS for review notices [extensions/FlaggedRevs] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1057031 (https://phabricator.wikimedia.org/T191156) (owner: 10Ladsgroup) [00:27:56] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1002 using scap backport" [extensions/FlaggedRevs] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1057031 (https://phabricator.wikimedia.org/T191156) (owner: 10Ladsgroup) [00:28:47] RESOLVED: HelmReleaseBadStatus: Helm release mw-api-ext/main on k8s@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s&var-namespace=mw-api-ext - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [00:33:25] (03PS1) 10Zabe: WIP: Move defining sections to separate file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057034 [00:34:03] (03CR) 10CI reject: [V:04-1] WIP: Move defining sections to separate file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057034 (owner: 10Zabe) [00:36:54] (03PS2) 10Zabe: WIP: Move defining sections to separate file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057034 [00:37:23] (03Merged) 10jenkins-bot: Update UI classes and CSS for review notices [extensions/FlaggedRevs] (wmf/1.43.0-wmf.15) - 10https://gerrit.wikimedia.org/r/1057031 (https://phabricator.wikimedia.org/T191156) (owner: 10Ladsgroup) [00:37:51] !log ladsgroup@deploy1002 Started scap sync-world: Backport for [[gerrit:1057031|Update UI classes and CSS for review notices (T191156)]], [[gerrit:1057016|Add CSS class to watchlist pending notice (T191156)]] [00:37:56] T191156: Convert FlaggedRevisions to Codex - https://phabricator.wikimedia.org/T191156 [00:40:05] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:1057031|Update UI classes and CSS for review notices (T191156)]], [[gerrit:1057016|Add CSS class to watchlist pending notice (T191156)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [00:42:51] !log ladsgroup@deploy1002 ladsgroup: Continuing with sync [00:47:41] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1057031|Update UI classes and CSS for review notices (T191156)]], [[gerrit:1057016|Add CSS class to watchlist pending notice (T191156)]] (duration: 09m 49s) [00:47:46] T191156: Convert FlaggedRevisions to Codex - https://phabricator.wikimedia.org/T191156 [00:54:53] deploys back to normal 👍 [01:07:09] (03PS1) 10Zabe: Automatically set db section to s5 for new wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057037 [01:07:49] (03PS2) 10Zabe: Automatically set db section to s5 for new wiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057037 [01:17:30] (03PS3) 10Zabe: WIP: Move defining sections to separate file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057034 [01:54:21] FIRING: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:02:43] (03PS1) 10Jdlrobson: Preserve existing responsive skin behaviour for community members [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057041 [02:04:21] RESOLVED: SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:06:25] FIRING: [6x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:21:25] FIRING: [6x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240726T0600) [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:56:52] !log continue rolling out "LVS-and-NS-service-ips" prefix-list rename to network device [06:56:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T367856)', diff saved to https://phabricator.wikimedia.org/P66933 and previous config saved to /var/cache/conftool/dbconfig/20240726-065747-marostegui.json [06:57:51] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240726T0700) [07:06:38] (03CR) 10Ayounsi: [C:03+2] Netbox 4 breaking change (choices is now in netbox) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1056972 (owner: 10Ayounsi) [07:07:35] (03Merged) 10jenkins-bot: Netbox 4 breaking change (choices is now in netbox) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1056972 (owner: 10Ayounsi) [07:12:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P66934 and previous config saved to /var/cache/conftool/dbconfig/20240726-071254-marostegui.json [07:28:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190', diff saved to https://phabricator.wikimedia.org/P66935 and previous config saved to /var/cache/conftool/dbconfig/20240726-072801-marostegui.json [07:43:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1190 (T367856)', diff saved to https://phabricator.wikimedia.org/P66936 and previous config saved to /var/cache/conftool/dbconfig/20240726-074308-marostegui.json [07:43:10] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1199.eqiad.wmnet with reason: Maintenance [07:43:13] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [07:43:23] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1199.eqiad.wmnet with reason: Maintenance [07:43:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1199 (T367856)', diff saved to https://phabricator.wikimedia.org/P66937 and previous config saved to /var/cache/conftool/dbconfig/20240726-074330-marostegui.json [07:48:20] 06SRE, 06serviceops: Memcached, mcrouter in MediaWiki on Kubernetes - https://phabricator.wikimedia.org/T277711#10017109 (10jijiki) 05Open→03Resolved Closing since T346690 is done, daemonset is working [07:52:41] 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: SmartNotHealthy (instance an-worker1085:9100) - https://phabricator.wikimedia.org/T371077 (10LSobanski) 03NEW [07:57:51] (03CR) 10Filippo Giunchedi: [C:03+1] "Nice! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1056579 (https://phabricator.wikimedia.org/T366573) (owner: 10Andrea Denisse) [08:09:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T367856)', diff saved to https://phabricator.wikimedia.org/P66938 and previous config saved to /var/cache/conftool/dbconfig/20240726-080945-marostegui.json [08:09:50] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [08:16:27] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS bullseye [08:16:48] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [08:18:23] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [08:18:25] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [08:21:40] FIRING: [6x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:24:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P66939 and previous config saved to /var/cache/conftool/dbconfig/20240726-082452-marostegui.json [08:25:52] !log jelto@cumin1002 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version [08:32:01] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [08:34:26] 06SRE, 10Charts, 06serviceops, 10Shellbox: Figure out how a shellbox instance for the Chart extension would work - https://phabricator.wikimedia.org/T370739#10017209 (10akosiaris) The situation is indeed known, see also T309772, T357950. Some efforts did happen to modernize the codebase, however, as far as... [08:34:31] 06SRE, 10Wikimedia-Mailing-lists: Create Mailing List: Wikidata for Wikimedia Projects (wikidata-4-wikimedia) - https://phabricator.wikimedia.org/T371078#10017212 (10Peachey88) [08:35:16] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [08:36:32] (03CR) 10Alexandros Kosiaris: [C:03+2] mesh: Patch faultinjection config stanza mistake [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056953 (owner: 10Alexandros Kosiaris) [08:38:58] (03Merged) 10jenkins-bot: mesh: Patch faultinjection config stanza mistake [deployment-charts] - 10https://gerrit.wikimedia.org/r/1056953 (owner: 10Alexandros Kosiaris) [08:39:49] FIRING: HelmReleaseBadStatus: Helm release growthbook/production on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=growthbook - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [08:40:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176', diff saved to https://phabricator.wikimedia.org/P66940 and previous config saved to /var/cache/conftool/dbconfig/20240726-083959-marostegui.json [08:42:40] (03CR) 10Alexandros Kosiaris: [C:03+2] "> Is the expectation that I deploy it myself? Tbh I would prefer to not deploy myself but I am also just wondering what the expected or "n" [puppet] - 10https://gerrit.wikimedia.org/r/1053400 (https://phabricator.wikimedia.org/T367014) (owner: 10Dzahn) [08:43:52] (03CR) 10Alexandros Kosiaris: [C:03+2] redirects.dat: delete integration.mediawiki.org [puppet] - 10https://gerrit.wikimedia.org/r/1054919 (https://phabricator.wikimedia.org/T361250) (owner: 10Dzahn) [08:52:13] (03CR) 10Jelto: [C:03+2] "I saw a few entries in /var/log/messages (mostly from public clouds) and triggered a few entries manually using curl. So logging works" [puppet] - 10https://gerrit.wikimedia.org/r/1056581 (https://phabricator.wikimedia.org/T366882) (owner: 10Dzahn) [08:52:36] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [08:52:48] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [08:54:49] RESOLVED: HelmReleaseBadStatus: Helm release growthbook/production on k8s-dse@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-dse&var-namespace=growthbook - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [08:55:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2176 (T367856)', diff saved to https://phabricator.wikimedia.org/P66941 and previous config saved to /var/cache/conftool/dbconfig/20240726-085507-marostegui.json [08:55:09] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 6:00:00 on db2188.codfw.wmnet with reason: Maintenance [08:55:12] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [08:55:18] !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/services/developer-portal: sync [08:55:22] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 6:00:00 on db2188.codfw.wmnet with reason: Maintenance [08:55:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2188 (T367856)', diff saved to https://phabricator.wikimedia.org/P66942 and previous config saved to /var/cache/conftool/dbconfig/20240726-085529-marostegui.json [08:55:34] !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/services/developer-portal: sync [08:55:35] !log akosiaris@deploy1003 helmfile [codfw] START helmfile.d/services/developer-portal: sync [08:55:54] !log akosiaris@deploy1003 helmfile [codfw] DONE helmfile.d/services/developer-portal: sync [08:55:55] !log akosiaris@deploy1003 helmfile [staging] START helmfile.d/services/developer-portal: sync [08:56:02] !log akosiaris@deploy1003 helmfile [staging] DONE helmfile.d/services/developer-portal: sync [08:58:31] 06SRE, 06collaboration-services, 06Traffic, 13Patch-For-Review, 10Release-Engineering-Team (Radar): implement anti-abuse features for GitLab (Move GitLab behind the CDN) - https://phabricator.wikimedia.org/T366882#10017292 (10Jelto) [09:01:31] !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/services/linkrecommendation: sync [09:01:46] !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: sync [09:01:47] !log akosiaris@deploy1003 helmfile [codfw] START helmfile.d/services/linkrecommendation: sync [09:02:34] !log akosiaris@deploy1003 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: sync [09:02:35] !log akosiaris@deploy1003 helmfile [staging] START helmfile.d/services/linkrecommendation: sync [09:02:42] !log akosiaris@deploy1003 helmfile [staging] DONE helmfile.d/services/linkrecommendation: sync [09:05:42] !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/services/machinetranslation: sync [09:06:01] !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/services/echostore: sync [09:06:03] !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/services/echostore: sync [09:06:04] !log akosiaris@deploy1003 helmfile [codfw] START helmfile.d/services/echostore: sync [09:06:09] !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/services/sessionstore: sync [09:06:10] !log akosiaris@deploy1003 helmfile [codfw] DONE helmfile.d/services/echostore: sync [09:06:11] !log akosiaris@deploy1003 helmfile [staging] START helmfile.d/services/echostore: sync [09:06:12] !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/services/sessionstore: sync [09:06:13] !log akosiaris@deploy1003 helmfile [codfw] START helmfile.d/services/sessionstore: sync [09:06:14] !log akosiaris@deploy1003 helmfile [staging] DONE helmfile.d/services/echostore: sync [09:06:19] !log akosiaris@deploy1003 helmfile [codfw] DONE helmfile.d/services/sessionstore: sync [09:06:20] !log akosiaris@deploy1003 helmfile [staging] START helmfile.d/services/sessionstore: sync [09:07:41] 06SRE, 06collaboration-services, 06Traffic, 13Patch-For-Review, 10Release-Engineering-Team (Radar): implement anti-abuse features for GitLab (Move GitLab behind the CDN) - https://phabricator.wikimedia.org/T366882#10017310 (10Jelto) >>! In T366882#10012178, @Dzahn wrote: > gitlab1003 and gitlab1004 are u... [09:07:43] (03PS1) 10Elukey: sre.hosts.reimage: fix tftp feature [cookbooks] - 10https://gerrit.wikimedia.org/r/1057180 [09:09:02] !log akosiaris@deploy1003 helmfile [eqiad] START helmfile.d/services/recommendation-api: sync [09:09:28] !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/services/recommendation-api: sync [09:09:29] !log akosiaris@deploy1003 helmfile [codfw] START helmfile.d/services/recommendation-api: sync [09:09:56] !log akosiaris@deploy1003 helmfile [codfw] DONE helmfile.d/services/recommendation-api: sync [09:09:57] !log akosiaris@deploy1003 helmfile [staging] START helmfile.d/services/recommendation-api: sync [09:10:04] !log akosiaris@deploy1003 helmfile [staging] DONE helmfile.d/services/recommendation-api: sync [09:11:25] FIRING: [10x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:15:41] (03PS1) 10Brouberol: growthbook: small fixes to the values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057183 (https://phabricator.wikimedia.org/T365839) [09:16:22] !log akosiaris@deploy1003 helmfile [staging] DONE helmfile.d/services/sessionstore: sync [09:17:34] (03CR) 10Btullis: [C:03+1] growthbook: small fixes to the values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057183 (https://phabricator.wikimedia.org/T365839) (owner: 10Brouberol) [09:18:01] (03CR) 10Brouberol: [C:03+2] growthbook: small fixes to the values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057183 (https://phabricator.wikimedia.org/T365839) (owner: 10Brouberol) [09:20:02] akosiaris: running puppet on cumin to clear up httpbb following sep11 deploy [09:20:08] (03PS2) 10Elukey: sre.hosts.reimage: fix tftp feature [cookbooks] - 10https://gerrit.wikimedia.org/r/1057180 [09:21:08] !log akosiaris@deploy1003 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: sync [09:21:09] Ah no, you just merged it and didn't deploy mw-on-k8s yet [09:21:10] !log akosiaris@deploy1003 helmfile [codfw] START helmfile.d/services/machinetranslation: sync [09:21:11] mb [09:21:17] !log akosiaris@deploy1003 helmfile [codfw] DONE helmfile.d/services/machinetranslation: sync [09:21:18] !log akosiaris@deploy1003 helmfile [staging] START helmfile.d/services/machinetranslation: sync [09:21:20] that's why it's failing [09:21:21] !log akosiaris@deploy1003 helmfile [staging] DONE helmfile.d/services/machinetranslation: sync [09:26:25] FIRING: [15x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:27:33] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host analytics1072.eqiad.wmnet [09:31:25] FIRING: [16x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:33:15] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host analytics1072.eqiad.wmnet [09:35:34] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host analytics1073.eqiad.wmnet [09:38:12] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host analytics1073.eqiad.wmnet [09:44:36] 06SRE, 06Infrastructure-Foundations: Port defs_from_etcd logic to nftables - https://phabricator.wikimedia.org/T348734#10017359 (10cmooney) >>! In T348734#9967318, @Dzahn wrote: > This is because `profile::firewall` pulls in confd if the firewall provider is set to nftables and `if $defs_from_etcd and $defs_fr... [09:51:00] (03CR) 10Alexandros Kosiaris: [C:03+2] tox: pin style dependencies to avoid CI failures [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1043748 (owner: 10Hashar) [09:53:21] (03Merged) 10jenkins-bot: tox: pin style dependencies to avoid CI failures [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/1043748 (owner: 10Hashar) [10:00:05] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1055492 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [10:00:36] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/growthbook: apply [10:00:48] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/growthbook: apply [10:00:49] (03CR) 10Jelto: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1056603 (https://phabricator.wikimedia.org/T370677) (owner: 10Dzahn) [10:01:50] (03PS1) 10Cathal Mooney: Move confd::file definition for requestctl ferm rules to case block [puppet] - 10https://gerrit.wikimedia.org/r/1057185 (https://phabricator.wikimedia.org/T348734) [10:03:14] !log elukey@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host sretest1001.eqiad.wmnet with OS bullseye [10:06:03] (03CR) 10Elukey: "Git blame reports https://gerrit.wikimedia.org/r/c/operations/puppet/+/981288" [puppet] - 10https://gerrit.wikimedia.org/r/1057185 (https://phabricator.wikimedia.org/T348734) (owner: 10Cathal Mooney) [10:06:18] (03PS1) 10Alexandros Kosiaris: Add a proper mailmap for my personal account [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057186 [10:07:23] (03CR) 10Alexandros Kosiaris: [C:03+2] Add a proper mailmap for my personal account [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057186 (owner: 10Alexandros Kosiaris) [10:07:54] (03Abandoned) 10Cathal Mooney: Move confd::file definition for requestctl ferm rules to case block [puppet] - 10https://gerrit.wikimedia.org/r/1057185 (https://phabricator.wikimedia.org/T348734) (owner: 10Cathal Mooney) [10:07:59] (03Merged) 10jenkins-bot: Add a proper mailmap for my personal account [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057186 (owner: 10Alexandros Kosiaris) [10:12:14] (03PS1) 10Filippo Giunchedi: prometheus: remove absented resources [puppet] - 10https://gerrit.wikimedia.org/r/1057187 (https://phabricator.wikimedia.org/T371087) [10:12:15] (03PS1) 10Filippo Giunchedi: prometheus: clean up legacy parameters [puppet] - 10https://gerrit.wikimedia.org/r/1057188 (https://phabricator.wikimedia.org/T371087) [10:12:46] (03CR) 10CI reject: [V:04-1] prometheus: clean up legacy parameters [puppet] - 10https://gerrit.wikimedia.org/r/1057188 (https://phabricator.wikimedia.org/T371087) (owner: 10Filippo Giunchedi) [10:13:24] (03CR) 10Ayounsi: [C:03+1] sre.hosts.reimage: fix tftp feature [cookbooks] - 10https://gerrit.wikimedia.org/r/1057180 (owner: 10Elukey) [10:13:43] (03CR) 10Elukey: [C:03+2] sre.hosts.reimage: fix tftp feature [cookbooks] - 10https://gerrit.wikimedia.org/r/1057180 (owner: 10Elukey) [10:15:44] (03PS1) 10Elukey: sretest1001: disable defs_from_etcd_nft [puppet] - 10https://gerrit.wikimedia.org/r/1057189 [10:16:19] (03PS2) 10Filippo Giunchedi: prometheus: clean up legacy parameters [puppet] - 10https://gerrit.wikimedia.org/r/1057188 (https://phabricator.wikimedia.org/T371087) [10:17:16] 10ops-eqiad, 06SRE, 06Data-Engineering, 06Data-Platform-SRE, 06DC-Ops: Degraded RAID on dumpsdata1007 - https://phabricator.wikimedia.org/T369829#10017476 (10BTullis) Hi @Jclark-ctr - Yes, this drive can be replaced at any time. Thanks. [10:19:55] 06SRE, 10Wikimedia-Mailing-lists: Create Mailing List: Wikidata for Wikimedia Projects (wikidata-4-wikimedia) - https://phabricator.wikimedia.org/T371078#10017479 (10Ladsgroup) a:03Ladsgroup Overall looks good to go. But name of the mailing list is basically impossible to change in mm3. Can you pick a better... [10:20:53] (03PS1) 10Jelto: gitlab: fix port definition in firewall:service [puppet] - 10https://gerrit.wikimedia.org/r/1057190 (https://phabricator.wikimedia.org/T366882) [10:21:24] (03PS1) 10Effie Mouzeli: mw-on-k8s: update latency expression [alerts] - 10https://gerrit.wikimedia.org/r/1057191 [10:21:51] 06SRE, 06Wikidata Integrations Team, 10Wikimedia-Mailing-lists: Create Mailing List: Wikidata for Wikimedia Projects (wikidata-4-wikimedia) - https://phabricator.wikimedia.org/T371078#10017515 (10Ladsgroup) [10:22:07] (03CR) 10Effie Mouzeli: "Example: https://grafana.wikimedia.org/goto/Um6mz_uSg?orgId=1" [alerts] - 10https://gerrit.wikimedia.org/r/1057191 (owner: 10Effie Mouzeli) [10:26:09] (03PS2) 10Alexandros Kosiaris: mediawiki-image-download: Drop to 66% [puppet] - 10https://gerrit.wikimedia.org/r/1039622 (https://phabricator.wikimedia.org/T366778) [10:28:16] (03PS1) 10Jelto: gitlab: fix port definition in firewall:service [puppet] - 10https://gerrit.wikimedia.org/r/1057190 (https://phabricator.wikimedia.org/T366882) [10:28:16] (03CR) 10Jelto: "@dzahn Do you think this fixes the puppet error on the gitlab test instance?" [puppet] - 10https://gerrit.wikimedia.org/r/1057190 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [10:29:17] (03CR) 10Clément Goubert: "Yeah I think you're right actually, I got confused by the fact that the `mwdebug` servers are part of the `testserver` pool in `conftool` " [puppet] - 10https://gerrit.wikimedia.org/r/1056889 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [10:29:54] (03CR) 10Alexandros Kosiaris: [C:03+2] mediawiki-image-download: Drop to 66% [puppet] - 10https://gerrit.wikimedia.org/r/1039622 (https://phabricator.wikimedia.org/T366778) (owner: 10Alexandros Kosiaris) [10:30:12] question: if I file a task for a production error, should I also move it tho the current column of https://phabricator.wikimedia.org/project/board/1055/ directly, or leave the triage to someone else? (whoever owns the board?) [10:30:32] (03CR) 10Clément Goubert: [C:03+1] P:mediawiki::php::restarts: fix no-LVS case [puppet] - 10https://gerrit.wikimedia.org/r/1057010 (owner: 10Scott French) [10:38:22] (03CR) 10Clément Goubert: [C:03+1] mw-on-k8s: update latency expression [alerts] - 10https://gerrit.wikimedia.org/r/1057191 (owner: 10Effie Mouzeli) [10:38:58] (03CR) 10Cathal Mooney: [C:03+1] sretest1001: disable defs_from_etcd_nft [puppet] - 10https://gerrit.wikimedia.org/r/1057189 (owner: 10Elukey) [10:40:53] !log akosiaris@deploy1003 Synchronized .mailmap: Testing a noop deploy from deploy1003 (duration: 20m 28s) [10:47:53] (03PS1) 10Effie Mouzeli: kubernetes-prod: add KubernetesContainerReachingMemoryLimit exception [alerts] - 10https://gerrit.wikimedia.org/r/1057192 [10:51:14] jelto@cumin1002 jelto: The backup on gitlab2002 is complete, ready to proceed with upgrade. [10:51:25] 06SRE, 06Infrastructure-Foundations: Port defs_from_etcd logic to nftables - https://phabricator.wikimedia.org/T348734#10017568 (10cmooney) Regarding the issue we stumbled on today on sretest1001, it seems that there is a problem with the current puppetization the requestctl networks. Specifically networks ar... [10:59:12] (03CR) 10Clément Goubert: Deploy MetricsPlatform to beta cluster (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [11:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240726T0700) [11:00:04] eoghan, jelto, arnoldokoth, and mutante: #bothumor I � Unicode. All rise for GitLab version upgrades deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240726T1100). [11:03:18] (03CR) 10Clément Goubert: [C:03+1] kubernetes-prod: add KubernetesContainerReachingMemoryLimit exception [alerts] - 10https://gerrit.wikimedia.org/r/1057192 (owner: 10Effie Mouzeli) [11:05:38] !log jelto@cumin1002 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: Upgrade GitLab to new version [11:06:25] FIRING: [16x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:06:43] (03CR) 10Alexandros Kosiaris: Deploy MetricsPlatform to beta cluster (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1046732 (https://phabricator.wikimedia.org/T366234) (owner: 10Clare Ming) [11:21:25] FIRING: [16x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:22:28] (03PS1) 10GergesShamon: Enable VisualEditor at Spanish Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057196 (https://phabricator.wikimedia.org/T355336) [11:23:07] (03CR) 10CI reject: [V:04-1] Enable VisualEditor at Spanish Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057196 (https://phabricator.wikimedia.org/T355336) (owner: 10GergesShamon) [11:31:25] FIRING: [12x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:39:16] 06SRE, 06Wikidata Integrations Team, 10Wikimedia-Mailing-lists: Create Mailing List: Wikidata for Wikimedia Projects (wikidata-4-wikimedia) - https://phabricator.wikimedia.org/T371078#10017660 (10Danny_Benjafield_WMDE) [11:40:02] 06SRE, 06Wikidata Integrations Team, 10Wikimedia-Mailing-lists: Create Mailing List: Wikidata for Wikimedia Projects (wikidata-4-wikimedia) - https://phabricator.wikimedia.org/T371078#10017661 (10Danny_Benjafield_WMDE) Name updated. Please go with: wikidata-for-wikimedia@lists.wikimedia.org [11:45:52] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host pc1017.eqiad.wmnet with OS bookworm [11:46:00] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install pc1017 - https://phabricator.wikimedia.org/T369661#10017666 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host pc1017.eqiad.wmnet with OS bookworm [11:48:17] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on pc1017.eqiad.wmnet with reason: host reimage [11:49:35] 06SRE, 06serviceops: confd setup left without configuration doesn't stop confd - https://phabricator.wikimedia.org/T356296#10017678 (10cmooney) I'm a little confused about this one. We have //defs_from_etcd_nft// set to false by default in heria for the firewall profile: ` hieradata/common/profile/firewall.ya... [11:51:25] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on pc1017.eqiad.wmnet with reason: host reimage [11:52:05] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephmon1006.eqiad.wmnet with OS bullseye [11:52:07] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephmon1005.eqiad.wmnet with OS bullseye [11:52:19] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install pc1017 - https://phabricator.wikimedia.org/T369661#10017681 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephmon1006.eqiad.wmnet with OS bullseye [11:52:21] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install pc1017 - https://phabricator.wikimedia.org/T369661#10017682 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephmon1005.eqiad.wmnet with OS bullseye [11:55:01] (03PS2) 10GergesShamon: Enable VisualEditor at Spanish Wikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057196 (https://phabricator.wikimedia.org/T355336) [12:00:44] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host pc1017.eqiad.wmnet with OS bookworm [12:00:54] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install pc1017 - https://phabricator.wikimedia.org/T369661#10017689 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host pc1017.eqiad.wmnet with OS bookworm completed: - pc1017 (**PASS**)... [12:02:43] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Monday, July 29 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal-" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057196 (https://phabricator.wikimedia.org/T355336) (owner: 10GergesShamon) [12:14:15] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install pc1017 - https://phabricator.wikimedia.org/T369661#10017709 (10Jclark-ctr) [12:14:26] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install pc1017 - https://phabricator.wikimedia.org/T369661#10017710 (10Jclark-ctr) 05Open→03Resolved [12:36:19] (03PS4) 10Zabe: Move section mapping to separate file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057034 [12:37:20] (03CR) 10Elukey: [C:03+2] sretest1001: disable defs_from_etcd_nft [puppet] - 10https://gerrit.wikimedia.org/r/1057189 (owner: 10Elukey) [12:37:31] (03PS5) 10Zabe: Move section mapping to separate file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057034 [12:38:53] (03PS6) 10Zabe: Move section mapping to separate file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057034 [12:40:33] (03PS7) 10Zabe: Move section mapping to separate file [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1057034 [12:42:17] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host sretest1001.eqiad.wmnet with OS bullseye [12:42:41] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephmon1005.eqiad.wmnet with OS bullseye [12:42:51] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install pc1017 - https://phabricator.wikimedia.org/T369661#10017736 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephmon1005.eqiad.wmnet with OS bullseye executed with error... [12:45:14] (03PS1) 10Alexandros Kosiaris: maps: Add wikidata.pl and vikidia.org to whitelist [puppet] - 10https://gerrit.wikimedia.org/r/1057202 (https://phabricator.wikimedia.org/T344678) [12:45:38] (03CR) 10CI reject: [V:04-1] maps: Add wikidata.pl and vikidia.org to whitelist [puppet] - 10https://gerrit.wikimedia.org/r/1057202 (https://phabricator.wikimedia.org/T344678) (owner: 10Alexandros Kosiaris) [12:49:53] (03PS2) 10Alexandros Kosiaris: maps: Add wikidata.pl and vikidia.org to whitelist [puppet] - 10https://gerrit.wikimedia.org/r/1057202 (https://phabricator.wikimedia.org/T344678) [12:50:00] (03Abandoned) 10Alexandros Kosiaris: maps: Allow usage by vikidia.org [puppet] - 10https://gerrit.wikimedia.org/r/937463 (https://phabricator.wikimedia.org/T339102) (owner: 10Alexandros Kosiaris) [12:52:21] (03CR) 10Alexandros Kosiaris: [C:03+2] maps: Add wikidata.pl and vikidia.org to whitelist [puppet] - 10https://gerrit.wikimedia.org/r/1057202 (https://phabricator.wikimedia.org/T344678) (owner: 10Alexandros Kosiaris) [12:56:41] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cloudcephmon1006.eqiad.wmnet with OS bullseye [12:56:47] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install pc1017 - https://phabricator.wikimedia.org/T369661#10017767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephmon1006.eqiad.wmnet with OS bullseye executed with error... [12:58:44] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [13:02:50] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest1001.eqiad.wmnet with reason: host reimage [13:04:56] (03PS1) 10Klausman: hiera/manifest/partman: Add DSE node with GPU [puppet] - 10https://gerrit.wikimedia.org/r/1057205 (https://phabricator.wikimedia.org/T368978) [13:10:25] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations: Broadcom NICs with recent firmware fail to reimage - https://phabricator.wikimedia.org/T363576#10017772 (10elukey) >>! In T363576#10007492, @elukey wrote: > Next steps: > > - Immediate: I/F is going to add code to Spicerack and the reimage cookbo... [13:15:28] (03PS1) 10Ilias Sarantopoulos: ml-services: deploy aya-23-8B in exp-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057207 [13:17:02] (03PS2) 10Ilias Sarantopoulos: ml-services: deploy aya-23-8B in exp-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057207 [13:19:22] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest1001.eqiad.wmnet with OS bullseye [13:21:42] (03CR) 10Elukey: netbox.netbox-extra: trigger syncdatasource (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1056989 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [13:23:05] !log dcausse@deploy1002 Started deploy [airflow-dags/search@d09039f]: search: fix drop dailies and bump discolitycs to fix numpy & pyarrow version conflict [13:23:51] !log dcausse@deploy1002 Finished deploy [airflow-dags/search@d09039f]: search: fix drop dailies and bump discolitycs to fix numpy & pyarrow version conflict (duration: 00m 45s) [13:25:34] (03PS1) 10Elukey: Revert "Move the dump_cloud_ip_ranges etcd upload to puppetserver" [puppet] - 10https://gerrit.wikimedia.org/r/1057210 [13:25:53] (03CR) 10CDanis: [C:03+1] admin: add dcops to the system adm POSIX group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey) [13:26:05] (03CR) 10JHathaway: [C:03+1] Revert "Move the dump_cloud_ip_ranges etcd upload to puppetserver" [puppet] - 10https://gerrit.wikimedia.org/r/1057210 (owner: 10Elukey) [13:26:12] (03CR) 10Dzahn: [C:03+1] gitlab: fix port definition in firewall:service [puppet] - 10https://gerrit.wikimedia.org/r/1057190 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [13:26:50] (03PS7) 10Elukey: admin: add dcops to the system adm POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) [13:28:09] (03CR) 10CI reject: [V:04-1] Revert "Move the dump_cloud_ip_ranges etcd upload to puppetserver" [puppet] - 10https://gerrit.wikimedia.org/r/1057210 (owner: 10Elukey) [13:34:03] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10017816 (10Papaul) [13:34:42] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Move the private Puppet repository to puppetserver1001 - https://phabricator.wikimedia.org/T368023#10017817 (10elukey) Something very weird happened today: * Balthazar committed a change from puppetmaster1001's /srv/private,... [13:34:46] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10017818 (10Papaul) [13:35:56] (03PS2) 10Elukey: Revert "Move the dump_cloud_ip_ranges etcd upload to puppetserver" [puppet] - 10https://gerrit.wikimedia.org/r/1057210 [13:36:32] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10017819 (10Jhancock.wm) [13:41:39] (03CR) 10Elukey: [C:03+2] Revert "Move the dump_cloud_ip_ranges etcd upload to puppetserver" [puppet] - 10https://gerrit.wikimedia.org/r/1057210 (owner: 10Elukey) [13:42:23] !log move dump_cloud_ip_ranges's write to /srv/private capabilities back to puppetmaster1001 - T368023 [13:42:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:27] T368023: Move the private Puppet repository to puppetserver1001 - https://phabricator.wikimedia.org/T368023 [13:43:11] (03PS8) 10Elukey: admin: add dcops to the system adm POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) [13:44:32] (03PS9) 10Elukey: admin: add dcops to the system adm POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) [13:46:18] (03CR) 10Elukey: "rebased and moved user tappof from sre-admins to ops-limited." [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey) [13:47:35] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3428/co" [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey) [13:49:35] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2240.codfw.wmnet with OS bookworm [13:49:41] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10017851 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2240.codfw.wmnet with OS bookworm [13:52:08] Hi [13:52:40] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [13:53:18] Does Task T355336 need to be reviewed by a Editing-team? [13:53:18] T355336: Enable the visual editor at Spanish Wikiquote - https://phabricator.wikimedia.org/T355336 [13:55:07] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2226 to codfw - jhancock@cumin2002" [13:56:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2226 to codfw - jhancock@cumin2002" [13:56:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:58:13] (03CR) 10CDanis: [C:03+1] admin: add dcops to the system adm POSIX group [puppet] - 10https://gerrit.wikimedia.org/r/1054894 (https://phabricator.wikimedia.org/T360356) (owner: 10Elukey) [13:59:04] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2226.mgmt.codfw.wmnet with reboot policy FORCED [14:00:26] 06SRE, 06Infrastructure-Foundations: Port defs_from_etcd logic to nftables - https://phabricator.wikimedia.org/T348734#10017879 (10CDanis) @Jelto @Dzahn FYI ^ about nft and requestctl support without Ferm. [14:03:43] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2240.codfw.wmnet with reason: host reimage [14:06:58] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2240.codfw.wmnet with reason: host reimage [14:07:28] !log dcausse@deploy1002 Started deploy [airflow-dags/search@fb00e94]: search: process_sparql_query_hourly tune the number of partitions to prevent OOM [14:07:49] !log dcausse@deploy1002 Finished deploy [airflow-dags/search@fb00e94]: search: process_sparql_query_hourly tune the number of partitions to prevent OOM (duration: 00m 21s) [14:13:59] (03CR) 10ScheduleDeploymentBot: "Scheduled for deployment in the [Tuesday, July 30 UTC afternoon backport window](https://wikitech.wikimedia.org/wiki/Deployments#deploycal" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1055633 (https://phabricator.wikimedia.org/T370605) (owner: 10XXBlackburnXx) [14:17:16] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2226.mgmt.codfw.wmnet with reboot policy FORCED [14:20:50] (03CR) 10Kevin Bazira: [C:03+1] ml-services: deploy aya-23-8B in exp-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057207 (owner: 10Ilias Sarantopoulos) [14:23:29] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [14:24:02] (03CR) 10AikoChou: [C:03+1] ml-services: deploy aya-23-8B in exp-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057207 (owner: 10Ilias Sarantopoulos) [14:30:11] (03CR) 10Scott French: [C:03+1] "Nice find! It looks like panels in the following dashboards need updated:" [alerts] - 10https://gerrit.wikimedia.org/r/1057191 (owner: 10Effie Mouzeli) [14:34:01] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [14:34:02] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2240.codfw.wmnet with OS bookworm [14:34:12] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10018035 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2240.codfw.wmnet with OS bookworm completed: - db2240 (**PASS*... [14:34:16] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review: Move the private Puppet repository to puppetserver1001 - https://phabricator.wikimedia.org/T368023#10018036 (10elukey) When I ran puppet on puppetserver1001, the issue appeared again (basically it was the same change applied... [14:36:19] (03CR) 10Dzahn: "thank you for deploying!" [puppet] - 10https://gerrit.wikimedia.org/r/1054919 (https://phabricator.wikimedia.org/T361250) (owner: 10Dzahn) [14:39:21] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:46] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: deploy aya-23-8B in exp-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057207 (owner: 10Ilias Sarantopoulos) [14:40:41] (03Merged) 10jenkins-bot: ml-services: deploy aya-23-8B in exp-staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057207 (owner: 10Ilias Sarantopoulos) [14:41:29] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2226'] [14:41:38] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['db2226'] [14:41:53] !log jhancock@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['db2226'] [14:42:04] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['db2226'] [14:42:36] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [14:43:56] (03PS1) 10Milimetric: analytics.wikimedia.org: improve caching and redirects [puppet] - 10https://gerrit.wikimedia.org/r/1057223 [14:44:57] 10ops-eqiad, 06DC-Ops: PDU sensor over limit - https://phabricator.wikimedia.org/T371100 (10phaultfinder) 03NEW [14:48:18] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [14:48:36] (03CR) 10Scott French: "Sounds good. I'll merge that one shortly." [puppet] - 10https://gerrit.wikimedia.org/r/1056889 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [14:50:44] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2227 to codfw - jhancock@cumin2002" [14:51:50] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2227 to codfw - jhancock@cumin2002" [14:51:50] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:53:05] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2227.mgmt.codfw.wmnet with reboot policy FORCED [14:59:21] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:09:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2227.mgmt.codfw.wmnet with reboot policy FORCED [15:09:35] (03PS3) 10Scott French: P:mediawiki::php::restarts: fix no-LVS case [puppet] - 10https://gerrit.wikimedia.org/r/1057010 (https://phabricator.wikimedia.org/T367949) [15:11:59] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2227.codfw.wmnet with OS bookworm [15:12:07] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10018220 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2227.codfw.wmnet with OS bookworm [15:12:53] (03CR) 10JHathaway: [V:03+1 C:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet7-compiler-node/3429/console" [puppet] - 10https://gerrit.wikimedia.org/r/1056508 (https://phabricator.wikimedia.org/T368023) (owner: 10Elukey) [15:14:59] (03PS1) 10Ssingh: Release 9.2.5-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1057231 (https://phabricator.wikimedia.org/T339134) [15:15:05] (03CR) 10CI reject: [V:04-1] Release 9.2.5-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1057231 (https://phabricator.wikimedia.org/T339134) (owner: 10Ssingh) [15:15:31] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10018248 (10Papaul) [15:16:26] (03CR) 10Scott French: [C:03+2] P:mediawiki::php::restarts: fix no-LVS case [puppet] - 10https://gerrit.wikimedia.org/r/1057010 (https://phabricator.wikimedia.org/T367949) (owner: 10Scott French) [15:18:01] (03CR) 10Ssingh: "Failure is because this repository was archived. There is a successful build with this patch on buildhost." [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1057231 (https://phabricator.wikimedia.org/T339134) (owner: 10Ssingh) [15:26:25] FIRING: [6x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:31:25] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1163.eqiad.wmnet with reason: Maintenance [15:31:38] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1163.eqiad.wmnet with reason: Maintenance [15:31:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1163 (T352010)', diff saved to https://phabricator.wikimedia.org/P66945 and previous config saved to /var/cache/conftool/dbconfig/20240726-153145-ladsgroup.json [15:31:50] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [15:36:25] FIRING: [6x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:41:22] (03CR) 10Ssingh: "dpkg-deb: building package 'trafficserver' in '../trafficserver_9.2.5-1wm1_amd64.deb'. [/var/cache/pbuilder/result/bullseye-amd64/]" [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/1057231 (https://phabricator.wikimedia.org/T339134) (owner: 10Ssingh) [15:54:29] (03CR) 10Scott French: "Merged the change and poked the restart service on all the mwdebug hosts, which should clear the alerts." [puppet] - 10https://gerrit.wikimedia.org/r/1056889 (https://phabricator.wikimedia.org/T367949) (owner: 10Clément Goubert) [15:55:19] !log mfossati@deploy1002 Started deploy [airflow-dags/platform_eng@845502d]: (no justification provided) [15:55:57] !log mfossati@deploy1002 Finished deploy [airflow-dags/platform_eng@845502d]: (no justification provided) (duration: 00m 37s) [15:56:25] FIRING: [5x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:00:54] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack, 13Patch-For-Review: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#10018565 (10elukey) Next steps: * Add mac address field to https://netbox.wikimedia.org/extras/scripts/provision_ser... [16:04:39] (03CR) 10Andrea Denisse: [C:03+2] burrow: Create a runtime directory in the service definition [puppet] - 10https://gerrit.wikimedia.org/r/1056579 (https://phabricator.wikimedia.org/T366573) (owner: 10Andrea Denisse) [16:08:03] (03PS1) 10Clare Ming: Metrics Platform Instrument Configuration: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057235 (https://phabricator.wikimedia.org/T369856) [16:09:25] (03CR) 10Santiago Faci: [C:03+2] Metrics Platform Instrument Configuration: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057235 (https://phabricator.wikimedia.org/T369856) (owner: 10Clare Ming) [16:11:21] (03Merged) 10jenkins-bot: Metrics Platform Instrument Configuration: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057235 (https://phabricator.wikimedia.org/T369856) (owner: 10Clare Ming) [16:13:11] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10018651 (10Jhancock.wm) [16:20:54] !log jhancock@cumin2002 START - Cookbook sre.dns.netbox [16:21:30] FIRING: ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:23:21] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2229 to codfw - jhancock@cumin2002" [16:24:26] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: adding db2229 to codfw - jhancock@cumin2002" [16:24:26] !log jhancock@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:25:29] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2228.mgmt.codfw.wmnet with reboot policy FORCED [16:26:30] RESOLVED: ProbeDown: Service wdqs2009:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip6) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2009:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:26:40] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2229.mgmt.codfw.wmnet with reboot policy FORCED [16:31:18] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2230.mgmt.codfw.wmnet with reboot policy FORCED [16:32:30] (03PS1) 10Andrea Denisse: burrow: Ensure burrow's configuration stores pidfiles correctly [puppet] - 10https://gerrit.wikimedia.org/r/1057239 (https://phabricator.wikimedia.org/T366573) [16:32:37] (03CR) 10Andrea Denisse: [C:03+2] burrow: Ensure burrow's configuration stores pidfiles correctly [puppet] - 10https://gerrit.wikimedia.org/r/1057239 (https://phabricator.wikimedia.org/T366573) (owner: 10Andrea Denisse) [16:33:38] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2231.mgmt.codfw.wmnet with reboot policy FORCED [16:34:56] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2232.mgmt.codfw.wmnet with reboot policy FORCED [16:35:54] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2233.mgmt.codfw.wmnet with reboot policy FORCED [16:36:48] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2228.mgmt.codfw.wmnet with reboot policy FORCED [16:37:18] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2234.mgmt.codfw.wmnet with reboot policy FORCED [16:38:03] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2229.mgmt.codfw.wmnet with reboot policy FORCED [16:38:16] (03PS1) 10Andrea Denisse: Revert "burrow: Create a runtime directory in the service definition" [puppet] - 10https://gerrit.wikimedia.org/r/1057240 [16:38:29] (03PS1) 10Andrea Denisse: Revert "burrow: Ensure burrow's configuration stores pidfiles correctly" [puppet] - 10https://gerrit.wikimedia.org/r/1057241 [16:38:36] (03CR) 10Andrea Denisse: [C:03+2] Revert "burrow: Create a runtime directory in the service definition" [puppet] - 10https://gerrit.wikimedia.org/r/1057240 (owner: 10Andrea Denisse) [16:38:40] (03CR) 10Andrea Denisse: [V:03+2 C:03+2] Revert "burrow: Create a runtime directory in the service definition" [puppet] - 10https://gerrit.wikimedia.org/r/1057240 (owner: 10Andrea Denisse) [16:38:51] (03CR) 10Andrea Denisse: [C:03+2] Revert "burrow: Ensure burrow's configuration stores pidfiles correctly" [puppet] - 10https://gerrit.wikimedia.org/r/1057241 (owner: 10Andrea Denisse) [16:38:53] (03CR) 10Andrea Denisse: [V:03+2 C:03+2] Revert "burrow: Ensure burrow's configuration stores pidfiles correctly" [puppet] - 10https://gerrit.wikimedia.org/r/1057241 (owner: 10Andrea Denisse) [16:38:55] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2235.mgmt.codfw.wmnet with reboot policy FORCED [16:39:01] (03PS1) 10Clare Ming: Metrics Platform Instrument Configuration: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057242 (https://phabricator.wikimedia.org/T369856) [16:40:11] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2236.mgmt.codfw.wmnet with reboot policy FORCED [16:41:18] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2237.mgmt.codfw.wmnet with reboot policy FORCED [16:41:39] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2230.mgmt.codfw.wmnet with reboot policy FORCED [16:42:18] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2238.mgmt.codfw.wmnet with reboot policy FORCED [16:43:42] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2239.mgmt.codfw.wmnet with reboot policy FORCED [16:44:31] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2231.mgmt.codfw.wmnet with reboot policy FORCED [16:45:07] (03CR) 10Scott French: [C:03+1] "All done. Opted for a `LOCAL_(mediawiki|mw).*` to keep it fairly precise but still allow for historical data." [alerts] - 10https://gerrit.wikimedia.org/r/1057191 (owner: 10Effie Mouzeli) [16:45:50] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2232.mgmt.codfw.wmnet with reboot policy FORCED [16:46:45] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2233.mgmt.codfw.wmnet with reboot policy FORCED [16:47:42] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2234.mgmt.codfw.wmnet with reboot policy FORCED [16:50:16] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2235.mgmt.codfw.wmnet with reboot policy FORCED [16:51:08] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2236.mgmt.codfw.wmnet with reboot policy FORCED [16:51:54] (03CR) 10Santiago Faci: [C:03+2] Metrics Platform Instrument Configuration: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057242 (https://phabricator.wikimedia.org/T369856) (owner: 10Clare Ming) [16:52:04] !log cjming@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [16:52:10] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2237.mgmt.codfw.wmnet with reboot policy FORCED [16:52:24] !log cjming@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [16:52:47] (03Merged) 10jenkins-bot: Metrics Platform Instrument Configuration: Deploying to production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1057242 (https://phabricator.wikimedia.org/T369856) (owner: 10Clare Ming) [16:53:06] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q1:rack/setup/install wikikube-worker1240 to wikikube-worker1304 - https://phabricator.wikimedia.org/T369743#10018850 (10VRiley-WMF) [16:53:50] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2238.mgmt.codfw.wmnet with reboot policy FORCED [16:54:13] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2239.mgmt.codfw.wmnet with reboot policy FORCED [17:16:18] !log cjming@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic: apply [17:16:32] !log cjming@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic: apply [17:19:18] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10018917 (10Jhancock.wm) [17:33:48] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephmon1006.eqiad.wmnet with OS bullseye [17:33:50] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host cloudcephmon1005.eqiad.wmnet with OS bullseye [17:33:55] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install pc1017 - https://phabricator.wikimedia.org/T369661#10018966 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephmon1006.eqiad.wmnet with OS bullseye [17:33:57] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install pc1017 - https://phabricator.wikimedia.org/T369661#10018967 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host cloudcephmon1005.eqiad.wmnet with OS bullseye [17:35:40] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon1006.eqiad.wmnet with reason: host reimage [17:35:48] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on cloudcephmon1005.eqiad.wmnet with reason: host reimage [17:38:43] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephmon1006.eqiad.wmnet with reason: host reimage [17:41:06] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cloudcephmon1005.eqiad.wmnet with reason: host reimage [17:52:39] 10ops-eqiad, 06SRE, 06Data-Engineering, 06Data-Platform-SRE, 06DC-Ops: Degraded RAID on dumpsdata1007 - https://phabricator.wikimedia.org/T369829#10019010 (10Jclark-ctr) 05Open→03Resolved Replaced Failed Drive [17:53:29] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:56:09] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [17:56:10] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephmon1006.eqiad.wmnet with OS bullseye [17:56:17] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install pc1017 - https://phabricator.wikimedia.org/T369661#10019018 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephmon1006.eqiad.wmnet with OS bullseye completed: - cloudc... [17:57:49] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [18:02:13] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jclark@cumin1002" [18:02:15] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cloudcephmon1005.eqiad.wmnet with OS bullseye [18:02:23] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install pc1017 - https://phabricator.wikimedia.org/T369661#10019023 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host cloudcephmon1005.eqiad.wmnet with OS bullseye completed: - cloudc... [18:03:42] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#10019024 (10Jclark-ctr) [18:03:57] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#10019025 (10Jclark-ctr) 05Open→03Resolved [18:17:15] 06SRE, 10Charts, 06serviceops, 10Shellbox: Figure out how a shellbox instance for the Chart extension would work - https://phabricator.wikimedia.org/T370739#10019055 (10Legoktm) Unless it's too slow performance wise, I do think using Shellbox is the easiest path to production in that it's already compliant... [18:28:49] (03PS1) 10Dzahn: site: add new hardware gerrit2003 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1057253 (https://phabricator.wikimedia.org/T369670) [18:29:19] (03CR) 10CI reject: [V:04-1] site: add new hardware gerrit2003 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1057253 (https://phabricator.wikimedia.org/T369670) (owner: 10Dzahn) [18:32:10] (03PS2) 10Dzahn: site: add new hardware gerrit2003 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1057253 (https://phabricator.wikimedia.org/T369670) [18:32:31] (03CR) 10Dzahn: "V-1 from CI but the word "error" isn't even in output?" [puppet] - 10https://gerrit.wikimedia.org/r/1057253 (https://phabricator.wikimedia.org/T369670) (owner: 10Dzahn) [18:32:35] (03CR) 10CI reject: [V:04-1] site: add new hardware gerrit2003 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1057253 (https://phabricator.wikimedia.org/T369670) (owner: 10Dzahn) [18:34:04] (03PS3) 10Dzahn: site: add new hardware gerrit2003 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1057253 (https://phabricator.wikimedia.org/T369670) [18:37:28] (03CR) 10Dzahn: [C:03+2] site: add new hardware gerrit2003 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1057253 (https://phabricator.wikimedia.org/T369670) (owner: 10Dzahn) [18:38:28] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, 13Patch-For-Review: Q1:rack/setup/install gerrit2003 - https://phabricator.wikimedia.org/T369670#10019113 (10Dzahn) [18:39:47] 10ops-codfw, 06SRE, 06collaboration-services, 06DC-Ops, 13Patch-For-Review: Q1:rack/setup/install gerrit2003 - https://phabricator.wikimedia.org/T369670#10019115 (10Dzahn) >>! In T369670#10008996, @Jhancock.wm wrote: > @Dzahn could you update the puppet repo for us when you have a moment? thanks in advan... [18:44:02] (03PS1) 10Dzahn: site: add new hardware gerrit1004 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1057254 (https://phabricator.wikimedia.org/T369671) [18:44:17] (03CR) 10CI reject: [V:04-1] site: add new hardware gerrit1004 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1057254 (https://phabricator.wikimedia.org/T369671) (owner: 10Dzahn) [18:45:42] (03PS2) 10Dzahn: site: add new hardware gerrit1004 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1057254 (https://phabricator.wikimedia.org/T369671) [18:48:29] (03CR) 10Dzahn: [C:03+2] site: add new hardware gerrit1004 with insetup role [puppet] - 10https://gerrit.wikimedia.org/r/1057254 (https://phabricator.wikimedia.org/T369671) (owner: 10Dzahn) [18:51:01] (03PS1) 10Scott French: switchdc: mediawiki cache warmup now targets k8s [cookbooks] - 10https://gerrit.wikimedia.org/r/1057255 (https://phabricator.wikimedia.org/T369921) [18:52:19] !log [deploy1002:~] $ echo 'https://sep11.wikipedia.org' | mwscript purgeList.php --wiki=aawiki - T367014 [18:52:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:24] T367014: Change redirect target of sep11.wikipedia.org - https://phabricator.wikimedia.org/T367014 [18:56:53] (03PS1) 10Andrea Denisse: burrow: Create a runtime directory in the service definition [puppet] - 10https://gerrit.wikimedia.org/r/1057256 (https://phabricator.wikimedia.org/T366573) [18:56:53] (03CR) 10Andrea Denisse: "The previous patch was missing a change in Burrow's configuration to specify where to store pidfiles. This commit fixes that issue." [puppet] - 10https://gerrit.wikimedia.org/r/1057256 (https://phabricator.wikimedia.org/T366573) (owner: 10Andrea Denisse) [18:57:02] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, 13Patch-For-Review: Q1:rack/setup/install gerrit1004 - https://phabricator.wikimedia.org/T369671#10019185 (10Dzahn) [18:57:09] (03CR) 10Andrea Denisse: [C:03+2] burrow: Create a runtime directory in the service definition [puppet] - 10https://gerrit.wikimedia.org/r/1057256 (https://phabricator.wikimedia.org/T366573) (owner: 10Andrea Denisse) [18:59:30] 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install frqueue2003, pay-lb2001, pay-lb2002 - https://phabricator.wikimedia.org/T369566#10019218 (10Dwisehaupt) Awaiting deployment of pfw changes to add the new mgmt subnet to pfw config (T371137). Then we can build the hosts. [19:00:32] (03PS1) 10Dzahn: installserver: add gerrit1004 to partman regex [puppet] - 10https://gerrit.wikimedia.org/r/1057257 (https://phabricator.wikimedia.org/T369671) [19:01:12] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, 13Patch-For-Review: Q1:rack/setup/install gerrit1004 - https://phabricator.wikimedia.org/T369671#10019232 (10Dzahn) [19:01:36] (03CR) 10Dzahn: [C:03+2] installserver: add gerrit1004 to partman regex [puppet] - 10https://gerrit.wikimedia.org/r/1057257 (https://phabricator.wikimedia.org/T369671) (owner: 10Dzahn) [19:06:30] 10ops-eqiad, 06SRE, 06collaboration-services, 06DC-Ops, 13Patch-For-Review: Q1:rack/setup/install gerrit1004 - https://phabricator.wikimedia.org/T369671#10019270 (10Dzahn) @Jclark-ctr Sorry for the delay. Done! Added to site.pp and installserver/partman. [19:11:12] (03CR) 10Scott French: "Thanks for the review, Daniel! Adding Reuven for thoughts as well." [puppet] - 10https://gerrit.wikimedia.org/r/1055999 (https://phabricator.wikimedia.org/T369921) (owner: 10Scott French) [19:23:34] (03CR) 10RLazarus: [C:03+1] deployment_server: install the cache warmup script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1055999 (https://phabricator.wikimedia.org/T369921) (owner: 10Scott French) [19:29:27] (03CR) 10Scott French: "@rcoccioli@wikimedia.org: This goes the route of using the same idiom as other cookbooks for remote execution on the active deployment hos" [cookbooks] - 10https://gerrit.wikimedia.org/r/1057255 (https://phabricator.wikimedia.org/T369921) (owner: 10Scott French) [19:45:06] 06SRE, 06Infrastructure-Foundations, 10Mail: postfix mx puppetry - https://phabricator.wikimedia.org/T325395#10019371 (10jhathaway) [19:45:50] (03PS1) 10JHathaway: postfix: prometheus ops config for mx-in boxes [puppet] - 10https://gerrit.wikimedia.org/r/1057260 (https://phabricator.wikimedia.org/T325395) [19:46:22] (03PS2) 10JHathaway: postfix: prometheus ops config for mx-in boxes [puppet] - 10https://gerrit.wikimedia.org/r/1057260 (https://phabricator.wikimedia.org/T325395) [19:46:32] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1057260 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway) [19:51:36] (03CR) 10JHathaway: [C:03+2] postfix: prometheus ops config for mx-in boxes [puppet] - 10https://gerrit.wikimedia.org/r/1057260 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway) [19:55:57] (03CR) 10RLazarus: [C:03+1] switchdc: mediawiki cache warmup now targets k8s (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1057255 (https://phabricator.wikimedia.org/T369921) (owner: 10Scott French) [19:56:40] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:03:45] (03PS2) 10JHathaway: remove mx{1001,2001) as MX servers [dns] - 10https://gerrit.wikimedia.org/r/1057020 (https://phabricator.wikimedia.org/T325409) [20:12:31] (03CR) 10JHathaway: [C:03+2] remove mx{1001,2001) as MX servers [dns] - 10https://gerrit.wikimedia.org/r/1057020 (https://phabricator.wikimedia.org/T325409) (owner: 10JHathaway) [20:29:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 12.38% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:29:30] 06SRE, 06serviceops: confd setup left without configuration doesn't stop confd - https://phabricator.wikimedia.org/T356296#10019459 (10Dzahn) @cmooney In `profile::firewall` there is a `if $defs_from_etcd and $defs_from_etcd_nft`. So if both are true that installs `confd::file { '/etc/nftables/sets/requestct... [20:33:49] 06SRE, 06serviceops: confd setup left without configuration doesn't stop confd - https://phabricator.wikimedia.org/T356296#10019478 (10Dzahn) Yea, so this: ` if $defs_from_etcd { confd::file { '/etc/ferm/conf.d/00_defs_requestctl': ensure => stdlib::ensure($provider == 'ferm'... [20:36:15] FIRING: MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-int - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:39:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 9.753% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [20:40:51] (03PS1) 10Dzahn: firewal: if provider is nft and not pulling requestctl, remove confd [puppet] - 10https://gerrit.wikimedia.org/r/1057264 [20:41:15] RESOLVED: [6x] MediaWikiHighErrorRate: Elevated rate of MediaWiki errors - kube-mw-api-ext - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:44:02] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2239.codfw.wmnet with OS bookworm [20:44:17] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10019508 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2239.codfw.wmnet with OS bookworm [20:50:56] (03PS2) 10Dzahn: firewal: if provider is nft and not pulling requestctl, remove confd [puppet] - 10https://gerrit.wikimedia.org/r/1057264 (https://phabricator.wikimedia.org/T356296) [20:53:24] (03PS3) 10Dzahn: firewall: if provider is nft and not pulling requestctl, remove confd [puppet] - 10https://gerrit.wikimedia.org/r/1057264 (https://phabricator.wikimedia.org/T356296) [20:54:53] 06SRE, 06serviceops, 13Patch-For-Review: confd setup left without configuration doesn't stop confd - https://phabricator.wikimedia.org/T356296#10019540 (10Dzahn) This is an attempt to fix it per logic "**if the provider is nft and we do NOT pull requestctl data.. THEN ... remove confd**". https://gerrit.wik... [21:03:42] (03PS2) 10Dzahn: gerrit: use list of replicas from hiera again, don't do puppet DB lookup [puppet] - 10https://gerrit.wikimedia.org/r/1056998 [21:04:17] (03CR) 10CI reject: [V:04-1] gerrit: use list of replicas from hiera again, don't do puppet DB lookup [puppet] - 10https://gerrit.wikimedia.org/r/1056998 (owner: 10Dzahn) [21:04:28] (03CR) 10Dzahn: "This would let us compile and deploy unrelated changes such as Ic16199cda82fca1." [puppet] - 10https://gerrit.wikimedia.org/r/1056998 (owner: 10Dzahn) [21:06:09] (03CR) 10Dzahn: [C:03+2] "yea, can't reproduce the issue so far.." [puppet] - 10https://gerrit.wikimedia.org/r/988001 (owner: 10EpicPupper) [21:07:28] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/1056998" [puppet] - 10https://gerrit.wikimedia.org/r/987135 (https://phabricator.wikimedia.org/T257741) (owner: 10EoghanGaffney) [21:07:46] (03CR) 10Dzahn: [C:03+2] gitlab: fix port definition in firewall:service [puppet] - 10https://gerrit.wikimedia.org/r/1057190 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [21:14:24] (03CR) 10Dzahn: [C:03+2] "This change is correct and needed but I think it doesn't actually explain the error we see. If it was just this we would get an error abou" [puppet] - 10https://gerrit.wikimedia.org/r/1057190 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [21:32:03] (03CR) 10Dzahn: [C:03+2] "I have tried to find out why ferm code is still pulled in but only in cloud but I didn't find the reason yet. This change doesn't fix that" [puppet] - 10https://gerrit.wikimedia.org/r/1057190 (https://phabricator.wikimedia.org/T366882) (owner: 10Jelto) [21:40:25] (03PS1) 10Dzahn: planet: disable wikimedia.pt feed causing update errors [puppet] - 10https://gerrit.wikimedia.org/r/1057270 [21:41:31] (03PS2) 10Dzahn: planet: disable wikimedia.pt feed causing update errors [puppet] - 10https://gerrit.wikimedia.org/r/1057270 [21:41:55] (03CR) 10CI reject: [V:04-1] planet: disable wikimedia.pt feed causing update errors [puppet] - 10https://gerrit.wikimedia.org/r/1057270 (owner: 10Dzahn) [21:42:54] (03PS3) 10Dzahn: planet: disable wikimedia.pt feed causing update errors [puppet] - 10https://gerrit.wikimedia.org/r/1057270 [21:44:37] (03CR) 10Dzahn: [C:03+2] planet: disable wikimedia.pt feed causing update errors [puppet] - 10https://gerrit.wikimedia.org/r/1057270 (owner: 10Dzahn) [21:45:40] (03CR) 10Dzahn: [C:03+2] "nevermind, it does happen again and I could reproduce it:" [puppet] - 10https://gerrit.wikimedia.org/r/988001 (owner: 10EpicPupper) [22:26:25] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:35:13] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host db2239.codfw.wmnet with OS bookworm [22:35:25] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10019826 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2239.codfw.wmnet with OS bookworm executed with errors: - db22... [22:35:46] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2239.codfw.wmnet with OS bookworm [22:35:54] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10019827 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2239.codfw.wmnet with OS bookworm [22:44:43] 06SRE, 10fundraising-tech-ops: Q1:rack/setup/install frqueue2003, pay-lb2001, pay-lb2002 - https://phabricator.wikimedia.org/T369566#10019829 (10Dwisehaupt) iptables changes applied for frqueue host. No iptables changes needed for pay-lb hosts. [22:50:22] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2239.codfw.wmnet with reason: host reimage [22:50:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T367856)', diff saved to https://phabricator.wikimedia.org/P66946 and previous config saved to /var/cache/conftool/dbconfig/20240726-225058-marostegui.json [22:51:03] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [22:52:58] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2239.codfw.wmnet with reason: host reimage [23:02:08] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2238.codfw.wmnet with OS bookworm [23:02:22] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10019838 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2238.codfw.wmnet with OS bookworm [23:02:24] 06SRE, 10Charts, 06serviceops, 10Shellbox: Figure out how a shellbox instance for the Chart extension would work - https://phabricator.wikimedia.org/T370739#10019839 (10Catrope) @akosiaris I'm trying to figure out how we should proceed based on your comment. Should we develop a service based on (an up-to-d... [23:06:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P66947 and previous config saved to /var/cache/conftool/dbconfig/20240726-230605-marostegui.json [23:09:36] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [23:10:59] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [23:11:00] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2239.codfw.wmnet with OS bookworm [23:11:07] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10019848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2239.codfw.wmnet with OS bookworm completed: - db2239 (**PASS*... [23:15:49] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2238.codfw.wmnet with reason: host reimage [23:18:50] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2238.codfw.wmnet with reason: host reimage [23:21:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199', diff saved to https://phabricator.wikimedia.org/P66948 and previous config saved to /var/cache/conftool/dbconfig/20240726-232112-marostegui.json [23:21:25] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:28:04] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2237.codfw.wmnet with OS bookworm [23:28:17] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10019875 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2237.codfw.wmnet with OS bookworm [23:34:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163 (T352010)', diff saved to https://phabricator.wikimedia.org/P66949 and previous config saved to /var/cache/conftool/dbconfig/20240726-233454-ladsgroup.json [23:35:00] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [23:36:02] !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [23:36:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1199 (T367856)', diff saved to https://phabricator.wikimedia.org/P66950 and previous config saved to /var/cache/conftool/dbconfig/20240726-233619-marostegui.json [23:36:22] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1221.eqiad.wmnet with reason: Maintenance [23:36:24] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1221.eqiad.wmnet with reason: Maintenance [23:36:24] T367856: Cleanup revision table schema - https://phabricator.wikimedia.org/T367856 [23:36:26] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [23:36:41] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on an-redacteddb1001.eqiad.wmnet,clouddb[1015,1019].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [23:36:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1221 (T367856)', diff saved to https://phabricator.wikimedia.org/P66951 and previous config saved to /var/cache/conftool/dbconfig/20240726-233648-marostegui.json [23:38:15] !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002" [23:38:16] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2238.codfw.wmnet with OS bookworm [23:38:25] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10019885 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host db2238.codfw.wmnet with OS bookworm completed: - db2238 (**PASS*... [23:38:44] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1057305 [23:38:44] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1057305 (owner: 10TrainBranchBot) [23:41:20] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10019887 (10Papaul) [23:42:27] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2237.codfw.wmnet with reason: host reimage [23:44:51] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2237.codfw.wmnet with reason: host reimage [23:46:50] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host db2236.codfw.wmnet with OS bookworm [23:47:02] 10ops-codfw, 06SRE, 06Data-Persistence, 06DBA, 06DC-Ops: Q1:rack/setup/install db22[21-40] - https://phabricator.wikimedia.org/T369654#10019888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host db2236.codfw.wmnet with OS bookworm [23:50:02] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1163', diff saved to https://phabricator.wikimedia.org/P66952 and previous config saved to /var/cache/conftool/dbconfig/20240726-235001-ladsgroup.json