[00:00:36] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [00:00:49] PROBLEM - Disk space on mw2281 is CRITICAL: DISK CRITICAL - free space: / 1556 MB (1% inode=98%): /tmp 1556 MB (1% inode=98%): /var/tmp 1556 MB (1% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2281&var-datasource=codfw+prometheus/ops [00:00:59] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2202.codfw.wmnet with OS bookworm [00:01:06] (03CR) 10RLazarus: [C: 03+2] k8s-controller-sidecars: Add the other missing namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006606 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus) [00:01:09] 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350#9582136 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2202.codfw.wmnet with OS bookworm completed: - db2202 (**PASS**) -... [00:01:49] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [00:01:56] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2199.codfw.wmnet with OS bookworm [00:02:05] 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350#9582137 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2199.codfw.wmnet with OS bookworm completed: - db2199 (**PASS**) -... [00:02:25] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host contint1003.eqiad.wmnet with OS bullseye [00:02:25] !log dzahn@cumin1002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host contint1003.eqiad.wmnet [00:02:33] 06SRE, 10Continuous-Integration-Infrastructure, 06collaboration-services, 10vm-requests: Ganeti VM for contint migration - https://phabricator.wikimedia.org/T358237#9582138 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1002 for host contint1003.eqiad.wmnet with OS bullse... [00:03:31] (03Merged) 10jenkins-bot: k8s-controller-sidecars: Add the other missing namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006606 (https://phabricator.wikimedia.org/T348284) (owner: 10RLazarus) [00:06:24] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 15:00:00 on wdqs1011.eqiad.wmnet with reason: T355617 [00:06:37] T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 [00:06:39] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 15:00:00 on wdqs1011.eqiad.wmnet with reason: T355617 [00:07:57] !log rzl@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [00:08:02] (SystemdUnitFailed) firing: (44) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:08:10] !log rzl@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [00:08:12] (03PS1) 10Dzahn: site: add ci role to contint1003 [puppet] - 10https://gerrit.wikimedia.org/r/1007017 (https://phabricator.wikimedia.org/T358237) [00:08:26] !log rzl@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [00:08:39] !log rzl@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [03:10:39] if I'd gotten to pick, I wouldn't have scheduled an incident in which you were the subject matter expert, just to put you on the spot and see how you do, BUT, given the opportunity, nailed it [03:12:57] !log slyngshede@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host idp-test1003.wikimedia.org with OS bookworm [03:12:58] !log slyngshede@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host idp-test1003.wikimedia.org [03:27:02] RECOVERY - Disk space on mw2278 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2278&var-datasource=codfw+prometheus/ops [03:29:06] https://phabricator.wikimedia.org/T358636 for the etcdmirror issue we saw this evening [04:17:02] PROBLEM - Disk space on mw2278 is CRITICAL: DISK CRITICAL - free space: / 3097 MB (2% inode=98%): /tmp 3097 MB (2% inode=98%): /var/tmp 3097 MB (2% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2278&var-datasource=codfw+prometheus/ops [04:17:25] (SystemdUnitFailed) firing: httpbb_kubernetes_mw-web_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:17:56] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:18:02] (SystemdUnitFailed) firing: (44) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:18:30] (ProbeDown) firing: (2) Service wdqs2008:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2008:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [04:21:46] ha, just for the record -- that mystery etcd-mirror restart at 02:07:23 was *not* somebody messing around and touching stuff without speaking up, just puppet happening to run at the instant we were talking about it! as of course was the next restart exactly 30 minutes later (that I didn't notice because I wasn't tailing the log at the time) [04:23:02] (SystemdUnitFailed) firing: (44) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:23:20] PROBLEM - Disk space on mw2266 is CRITICAL: DISK CRITICAL - free space: / 33 MB (0% inode=98%): /tmp 33 MB (0% inode=98%): /var/tmp 33 MB (0% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2266&var-datasource=codfw+prometheus/ops [04:40:48] RECOVERY - Disk space on mw2281 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2281&var-datasource=codfw+prometheus/ops [05:03:20] PROBLEM - Disk space on mw2266 is CRITICAL: DISK CRITICAL - free space: / 3811 MB (3% inode=98%): /tmp 3811 MB (3% inode=98%): /var/tmp 3811 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2266&var-datasource=codfw+prometheus/ops [05:17:25] (SystemdUnitFailed) resolved: httpbb_kubernetes_mw-web_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:17:58] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [05:48:11] (03PS1) 10Kevin Bazira: ml-services: increase article-descriptions memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006870 (https://phabricator.wikimedia.org/T358467) [06:23:20] RECOVERY - Disk space on mw2266 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2266&var-datasource=codfw+prometheus/ops [06:42:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1232 - optimizing revision table T354015', diff saved to https://phabricator.wikimedia.org/P58014 and previous config saved to /var/cache/conftool/dbconfig/20240228-064210-root.json [06:42:17] T354015: DBQueryDisconnectedError upon editing en:Template:COVID-19 pandemic data - https://phabricator.wikimedia.org/T354015 [06:44:03] !log marostegui@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1013.eqiad.wmnet,service=s3 [06:44:07] !log marostegui@cumin1002 conftool action : set/pooled=no; selector: name=clouddb1013.eqiad.wmnet,service=s1 [06:44:19] (03CR) 10Marostegui: [C: 03+2] clouddb1013: Migrate to MariaDB 10.6 [puppet] - 10https://gerrit.wikimedia.org/r/1006963 (https://phabricator.wikimedia.org/T356838) (owner: 10Marostegui) [06:47:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es2027 T358180', diff saved to https://phabricator.wikimedia.org/P58015 and previous config saved to /var/cache/conftool/dbconfig/20240228-064731-root.json [06:47:38] T358180: Upgrade es3 to MariaDB 10.6 - https://phabricator.wikimedia.org/T358180 [06:49:18] (03PS1) 10Marostegui: es2027: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1007200 (https://phabricator.wikimedia.org/T358180) [06:50:38] (03CR) 10Marostegui: [C: 03+2] es2027: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1007200 (https://phabricator.wikimedia.org/T358180) (owner: 10Marostegui) [06:51:15] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es2027.codfw.wmnet with OS bookworm [06:51:27] !log marostegui@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1013.eqiad.wmnet,service=s3 [06:51:30] !log marostegui@cumin1002 conftool action : set/pooled=yes; selector: name=clouddb1013.eqiad.wmnet,service=s1 [06:57:02] RECOVERY - Disk space on mw2278 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2278&var-datasource=codfw+prometheus/ops [06:57:13] (03PS1) 10Marostegui: db2186: Remove blank line [puppet] - 10https://gerrit.wikimedia.org/r/1007201 [06:58:48] (03CR) 10Marostegui: [C: 03+2] db2186: Remove blank line [puppet] - 10https://gerrit.wikimedia.org/r/1007201 (owner: 10Marostegui) [06:58:53] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2186.codfw.wmnet with OS bookworm [07:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240228T0700) [07:01:18] (03PS1) 10Marostegui: Revert "es2027: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1006782 [07:08:02] (JobUnavailable) firing: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:08:30] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es2027.codfw.wmnet with reason: host reimage [07:09:45] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2187.codfw.wmnet with OS bookworm [07:11:08] 06SRE, 06DBA, 07Wikimedia-Incident: db2118 crashed and rebooted due to HW - https://phabricator.wikimedia.org/T358421#9582477 (10Marostegui) The data looks correct. I am not going to repool this host for now, I am going to wait until its replacement in T355350 gets installed and simply clone that one and dec... [07:11:17] 06SRE, 06DBA, 07Wikimedia-Incident: db2118 crashed and rebooted due to HW - https://phabricator.wikimedia.org/T358421#9582478 (10Marostegui) p:05High→03Medium [07:11:20] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es2027.codfw.wmnet with reason: host reimage [07:11:38] 06SRE, 06DBA, 07Wikimedia-Incident: db2118 crashed and rebooted due to HW - https://phabricator.wikimedia.org/T358421#9582480 (10Marostegui) [07:17:47] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2186.codfw.wmnet with reason: host reimage [07:20:37] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2186.codfw.wmnet with reason: host reimage [07:23:47] 10ops-codfw: lsw1-b7-codfw - FPC0: PEM 0 Not Powered - https://phabricator.wikimedia.org/T358639 (10ayounsi) p:05Triage→03High [07:27:05] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2187.codfw.wmnet with reason: host reimage [07:27:25] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es2027.codfw.wmnet with OS bookworm [07:27:28] (03CR) 10Marostegui: [C: 03+2] Revert "es2027: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1006782 (owner: 10Marostegui) [07:27:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2027 (re)pooling @ 5%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P58016 and previous config saved to /var/cache/conftool/dbconfig/20240228-072757-root.json [07:31:24] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2187.codfw.wmnet with reason: host reimage [07:43:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2156 T358640', diff saved to https://phabricator.wikimedia.org/P58017 and previous config saved to /var/cache/conftool/dbconfig/20240228-074259-root.json [07:43:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2027 (re)pooling @ 10%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P58018 and previous config saved to /var/cache/conftool/dbconfig/20240228-074302-root.json [07:43:14] T358640: Reclone db2186:3313 (sanitarium) - https://phabricator.wikimedia.org/T358640 [07:43:52] (03PS1) 10Marostegui: db2156: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1007257 (https://phabricator.wikimedia.org/T358640) [07:43:57] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2186.codfw.wmnet with OS bookworm [07:48:02] (JobUnavailable) resolved: Reduced availability for job mysql-labs in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:48:18] (03CR) 10Slyngshede: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1006928 (https://phabricator.wikimedia.org/T357748) (owner: 10Muehlenhoff) [07:49:25] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:pki::multirootca::monitoring Collect metrics from intermediate. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1006907 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [07:50:03] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1006945 (https://phabricator.wikimedia.org/T357748) (owner: 10Muehlenhoff) [07:51:04] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2187.codfw.wmnet with OS bookworm [07:56:01] (03PS1) 10Muehlenhoff: idp-test: Align acmechief setting to the role, not via host records [puppet] - 10https://gerrit.wikimedia.org/r/1007258 [07:58:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2027 (re)pooling @ 25%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P58020 and previous config saved to /var/cache/conftool/dbconfig/20240228-075807-root.json [08:00:05] Amir1 and Urbanecm: #bothumor I � Unicode. All rise for UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240228T0800). [08:00:05] kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:17] * kart_ is here; will deploy.. [08:00:43] (03CR) 10KartikMistry: [C: 03+2] Enable SectionTranslation for Wikipedias where ContentTranslation is in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004613 (https://phabricator.wikimedia.org/T353734) (owner: 10KartikMistry) [08:02:15] !log slyngshede@cumin1002 START - Cookbook sre.ganeti.makevm for new host idp-test2003.wikimedia.org [08:02:16] !log slyngshede@cumin1002 START - Cookbook sre.dns.netbox [08:02:18] (03PS3) 10KartikMistry: Enable SectionTranslation for Wikipedias where ContentTranslation is in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004613 (https://phabricator.wikimedia.org/T353734) [08:04:12] (03CR) 10KartikMistry: [V: 03+2 C: 03+2] Enable SectionTranslation for Wikipedias where ContentTranslation is in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004613 (https://phabricator.wikimedia.org/T353734) (owner: 10KartikMistry) [08:05:07] ah rebase. [08:06:22] (03PS3) 10Slyngshede: C:tomcat Allow users to specify which version of Tomcat to install. [puppet] - 10https://gerrit.wikimedia.org/r/1006926 (https://phabricator.wikimedia.org/T357748) [08:06:55] (03PS15) 10KartikMistry: Enable Section Translation on newly created Wikipedias by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/995176 (https://phabricator.wikimedia.org/T298235) [08:06:57] (03PS4) 10KartikMistry: Enable SectionTranslation for Wikipedias where ContentTranslation is in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004613 (https://phabricator.wikimedia.org/T353734) [08:08:40] (03CR) 10KartikMistry: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004613 (https://phabricator.wikimedia.org/T353734) (owner: 10KartikMistry) [08:09:46] (03CR) 10Marostegui: [C: 03+2] db2156: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1007257 (https://phabricator.wikimedia.org/T358640) (owner: 10Marostegui) [08:10:31] (03CR) 10KartikMistry: [C: 03+2] Enable Section Translation on newly created Wikipedias by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/995176 (https://phabricator.wikimedia.org/T298235) (owner: 10KartikMistry) [08:11:19] (03Merged) 10jenkins-bot: Enable Section Translation on newly created Wikipedias by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/995176 (https://phabricator.wikimedia.org/T298235) (owner: 10KartikMistry) [08:11:21] (03Merged) 10jenkins-bot: Enable SectionTranslation for Wikipedias where ContentTranslation is in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1004613 (https://phabricator.wikimedia.org/T353734) (owner: 10KartikMistry) [08:13:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2027 (re)pooling @ 50%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P58021 and previous config saved to /var/cache/conftool/dbconfig/20240228-081312-root.json [08:14:49] No log from scap? [08:15:35] !log kartik@deploy2002 Started scap: Backport for [[gerrit:995176|Enable Section Translation on newly created Wikipedias by default (T298235)]], [[gerrit:1004613|Enable SectionTranslation for Wikipedias where ContentTranslation is in beta (T353734)]] [08:16:35] Now :) [08:17:13] !log kartik@deploy2002 kartik: Backport for [[gerrit:995176|Enable Section Translation on newly created Wikipedias by default (T298235)]], [[gerrit:1004613|Enable SectionTranslation for Wikipedias where ContentTranslation is in beta (T353734)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:18:30] (ProbeDown) firing: (2) Service wdqs2008:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2008:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:20:29] !log kartik@deploy2002 kartik: Continuing with sync [08:23:30] (SystemdUnitFailed) firing: (44) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:28:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2027 (re)pooling @ 75%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P58022 and previous config saved to /var/cache/conftool/dbconfig/20240228-082817-root.json [08:28:35] !log kartik@deploy2002 Finished scap: Backport for [[gerrit:995176|Enable Section Translation on newly created Wikipedias by default (T298235)]], [[gerrit:1004613|Enable SectionTranslation for Wikipedias where ContentTranslation is in beta (T353734)]] (duration: 12m 59s) [08:28:41] T298235: Enable Section Translation on newly created Wikipedias by default - https://phabricator.wikimedia.org/T298235 [08:28:42] T353734: Fix the mobile experience for a group of Wikipedias where Content Translation is in beta - https://phabricator.wikimedia.org/T353734 [08:29:48] * kart_ is done. [08:33:28] 06SRE, 10Wikimedia-Etherpad, 06collaboration-services, 13Patch-For-Review, 07User-notice: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421#9582637 (10Jelto) [08:36:52] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [08:43:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'es2027 (re)pooling @ 100%: After migration to 10.6', diff saved to https://phabricator.wikimedia.org/P58023 and previous config saved to /var/cache/conftool/dbconfig/20240228-084322-root.json [08:46:53] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1156.eqiad.wmnet with reason: Maintenance [08:47:07] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1156.eqiad.wmnet with reason: Maintenance [08:47:09] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:47:25] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [08:47:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1156 (T357189)', diff saved to https://phabricator.wikimedia.org/P58024 and previous config saved to /var/cache/conftool/dbconfig/20240228-084731-arnaudb.json [08:47:46] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [08:50:39] !log slyngshede@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM idp-test2003.wikimedia.org - slyngshede@cumin1002" [08:51:25] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM idp-test2003.wikimedia.org - slyngshede@cumin1002" [08:51:26] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:51:26] !log slyngshede@cumin1002 START - Cookbook sre.dns.wipe-cache idp-test2003.wikimedia.org on all recursors [08:51:29] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) idp-test2003.wikimedia.org on all recursors [08:51:55] !log slyngshede@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM idp-test2003.wikimedia.org - slyngshede@cumin1002" [08:52:45] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM idp-test2003.wikimedia.org - slyngshede@cumin1002" [08:53:02] (JobUnavailable) firing: Reduced availability for job etherpad in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:55:12] !log slyngshede@cumin1002 START - Cookbook sre.hosts.reimage for host idp-test2003.wikimedia.org with OS bookworm [08:55:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T357189)', diff saved to https://phabricator.wikimedia.org/P58025 and previous config saved to /var/cache/conftool/dbconfig/20240228-085523-arnaudb.json [08:55:29] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [08:55:50] PROBLEM - Disk space on mw2281 is CRITICAL: DISK CRITICAL - free space: / 3741 MB (3% inode=98%): /tmp 3741 MB (3% inode=98%): /var/tmp 3741 MB (3% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2281&var-datasource=codfw+prometheus/ops [09:06:52] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [09:10:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P58026 and previous config saved to /var/cache/conftool/dbconfig/20240228-091029-arnaudb.json [09:12:37] !log slyngshede@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on idp-test2003.wikimedia.org with reason: host reimage [09:13:02] !log temporary disabling puppet on cumin1002 [09:13:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:49] !log installing perl security updates on bullseye [09:14:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:36] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on idp-test2003.wikimedia.org with reason: host reimage [09:15:45] (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1003490 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [09:23:38] !log slyngshede@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host idp-test2003.wikimedia.org with OS bookworm [09:23:38] !log slyngshede@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host idp-test2003.wikimedia.org [09:24:02] (03CR) 10Majavah: profile::base: Allow running without cron installed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1006917 (https://phabricator.wikimedia.org/T358343) (owner: 10Muehlenhoff) [09:24:48] (03CR) 10Klausman: [C: 03+1] ml-services: increase article-descriptions memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006870 (https://phabricator.wikimedia.org/T358467) (owner: 10Kevin Bazira) [09:25:25] !log installed spicerack 8.4.0 on cumin2002 [09:25:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:25:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P58027 and previous config saved to /var/cache/conftool/dbconfig/20240228-092535-arnaudb.json [09:25:55] (03CR) 10Ayounsi: [C: 03+2] makevm: pass the v6 IP to GntInstance.add [cookbooks] - 10https://gerrit.wikimedia.org/r/1003490 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [09:26:25] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM (untested)" [puppet] - 10https://gerrit.wikimedia.org/r/1007263 (https://phabricator.wikimedia.org/T3506947) (owner: 10Slyngshede) [09:27:02] PROBLEM - Disk space on mw2278 is CRITICAL: DISK CRITICAL - free space: / 2933 MB (2% inode=98%): /tmp 2933 MB (2% inode=98%): /var/tmp 2933 MB (2% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2278&var-datasource=codfw+prometheus/ops [09:27:51] (03CR) 10Muehlenhoff: profile::base: Allow running without cron installed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1006917 (https://phabricator.wikimedia.org/T358343) (owner: 10Muehlenhoff) [09:28:29] !log joal@deploy2002 Started deploy [analytics/refinery@dba67fd]: Additional analytics weekly train [analytics/refinery@dba67fd6] [09:29:53] (03CR) 10Majavah: profile::base: Allow running without cron installed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1006917 (https://phabricator.wikimedia.org/T358343) (owner: 10Muehlenhoff) [09:31:05] (03PS1) 10STran: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007290 (https://phabricator.wikimedia.org/T354758) [09:31:59] (03Merged) 10jenkins-bot: makevm: pass the v6 IP to GntInstance.add [cookbooks] - 10https://gerrit.wikimedia.org/r/1003490 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [09:32:58] (03PS1) 10Muehlenhoff: standard_packages: Remove more obsolete packages after buster->bullseye update [puppet] - 10https://gerrit.wikimedia.org/r/1007291 [09:33:00] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:tomcat Allow users to specify which version of Tomcat to install. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1006926 (https://phabricator.wikimedia.org/T357748) (owner: 10Slyngshede) [09:33:04] (03CR) 10Kosta Harlan: ipoid: Bump version (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007290 (https://phabricator.wikimedia.org/T354758) (owner: 10STran) [09:34:10] !log installing monitoring-plugins bugfix updates from Bookworm point update [09:34:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:52] PROBLEM - Disk space on mw2281 is CRITICAL: DISK CRITICAL - free space: / 1754 MB (1% inode=98%): /tmp 1754 MB (1% inode=98%): /var/tmp 1754 MB (1% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2281&var-datasource=codfw+prometheus/ops [09:38:12] (03CR) 10Filippo Giunchedi: [C: 03+1] "Thank you! I'll merge/deploy" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006996 (https://phabricator.wikimedia.org/T320555) (owner: 10CDanis) [09:38:19] (03CR) 10Filippo Giunchedi: [C: 03+2] [jaeger] oauth2-proxy doesn't need to authorize [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006996 (https://phabricator.wikimedia.org/T320555) (owner: 10CDanis) [09:39:14] jouncebot: nowandnext [09:39:14] No deployments scheduled for the next 1 hour(s) and 20 minute(s) [09:39:14] In 1 hour(s) and 20 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240228T1100) [09:39:57] !log filippo@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [09:40:00] !log filippo@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [09:40:03] (03CR) 10Ladsgroup: [C: 03+2] Set three more wikis to read new on pagelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006853 (https://phabricator.wikimedia.org/T351237) (owner: 10Ladsgroup) [09:40:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T357189)', diff saved to https://phabricator.wikimedia.org/P58028 and previous config saved to /var/cache/conftool/dbconfig/20240228-094041-arnaudb.json [09:40:44] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1182.eqiad.wmnet with reason: Maintenance [09:40:50] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [09:40:52] !log filippo@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [09:40:52] !log ayounsi@cumin2002 START - Cookbook sre.ganeti.makevm for new host testvm2006.codfw.wmnet [09:40:54] !log ayounsi@cumin2002 START - Cookbook sre.dns.netbox [09:40:55] !log filippo@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [09:40:58] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1182.eqiad.wmnet with reason: Maintenance [09:41:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1182 (T357189)', diff saved to https://phabricator.wikimedia.org/P58029 and previous config saved to /var/cache/conftool/dbconfig/20240228-094103-arnaudb.json [09:41:29] !log filippo@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [09:41:32] !log filippo@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [09:41:45] !log joal@deploy2002 Finished deploy [analytics/refinery@dba67fd]: Additional analytics weekly train [analytics/refinery@dba67fd6] (duration: 13m 16s) [09:42:06] !log joal@deploy2002 Started deploy [analytics/refinery@dba67fd] (thin): Additional analytics weekly train - THIN [analytics/refinery@dba67fd6] [09:42:11] !log joal@deploy2002 Finished deploy [analytics/refinery@dba67fd] (thin): Additional analytics weekly train - THIN [analytics/refinery@dba67fd6] (duration: 00m 05s) [09:42:30] !log joal@deploy2002 Started deploy [analytics/refinery@dba67fd] (hadoop-test): Additional analytics weekly train - TEST [analytics/refinery@dba67fd6] [09:42:48] !log ayounsi@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2006.codfw.wmnet - ayounsi@cumin2002" [09:43:04] (03PS2) 10Ladsgroup: Set three more wikis to read new on pagelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006853 (https://phabricator.wikimedia.org/T351237) [09:43:08] (03CR) 10Ladsgroup: "." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006853 (https://phabricator.wikimedia.org/T351237) (owner: 10Ladsgroup) [09:43:11] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006853 (https://phabricator.wikimedia.org/T351237) (owner: 10Ladsgroup) [09:43:25] (03CR) 10TrainBranchBot: "Approved by ladsgroup@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006853 (https://phabricator.wikimedia.org/T351237) (owner: 10Ladsgroup) [09:44:05] (03Merged) 10jenkins-bot: Set three more wikis to read new on pagelinks migration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006853 (https://phabricator.wikimedia.org/T351237) (owner: 10Ladsgroup) [09:44:27] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:1006853|Set three more wikis to read new on pagelinks migration (T351237)]] [09:44:34] T351237: Set beta and production to read new for pagelinks migration - https://phabricator.wikimedia.org/T351237 [09:44:48] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2006.codfw.wmnet - ayounsi@cumin2002" [09:44:48] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:44:49] !log ayounsi@cumin2002 START - Cookbook sre.dns.wipe-cache testvm2006.codfw.wmnet on all recursors [09:44:52] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) testvm2006.codfw.wmnet on all recursors [09:45:12] !log ayounsi@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM testvm2006.codfw.wmnet - ayounsi@cumin2002" [09:45:20] (03CR) 10Muehlenhoff: profile::base: Allow running without cron installed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1006917 (https://phabricator.wikimedia.org/T358343) (owner: 10Muehlenhoff) [09:45:55] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:1006853|Set three more wikis to read new on pagelinks migration (T351237)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:46:03] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM testvm2006.codfw.wmnet - ayounsi@cumin2002" [09:46:03] !log joal@deploy2002 Finished deploy [analytics/refinery@dba67fd] (hadoop-test): Additional analytics weekly train - TEST [analytics/refinery@dba67fd6] (duration: 03m 33s) [09:46:28] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [09:46:40] !log ayounsi@cumin2002 START - Cookbook sre.hosts.reimage for host testvm2006.codfw.wmnet with OS bookworm [09:48:02] (SystemdUnitFailed) firing: (44) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:49:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T357189)', diff saved to https://phabricator.wikimedia.org/P58030 and previous config saved to /var/cache/conftool/dbconfig/20240228-094900-arnaudb.json [09:49:06] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [09:49:10] (03PS7) 10Fabfur: cache: start using benthos on single host for haproxy log parsing [puppet] - 10https://gerrit.wikimedia.org/r/1006976 (https://phabricator.wikimedia.org/T358109) [09:49:17] (03PS4) 10Majavah: P:openstack: rabbitmq: use firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/998419 [09:49:19] (03PS1) 10Majavah: P:openstack: rabbitmq: remove designate_hosts entirely [puppet] - 10https://gerrit.wikimedia.org/r/1007292 (https://phabricator.wikimedia.org/T350995) [09:49:21] (03PS1) 10Majavah: P:openstack: rabbitmq: remove cloudcontrol term [puppet] - 10https://gerrit.wikimedia.org/r/1007293 [09:49:23] (03PS1) 10Majavah: P:openstack: rabbitmq: remove cloud-hosts term [puppet] - 10https://gerrit.wikimedia.org/r/1007294 [09:49:25] (03PS1) 10Majavah: P:openstack: rabbitmq: remove cinder-backups term [puppet] - 10https://gerrit.wikimedia.org/r/1007295 [09:50:48] (03CR) 10Majavah: profile::base: Allow running without cron installed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1006917 (https://phabricator.wikimedia.org/T358343) (owner: 10Muehlenhoff) [09:51:32] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1500/co" [puppet] - 10https://gerrit.wikimedia.org/r/998419 (owner: 10Majavah) [09:53:02] (SystemdUnitFailed) firing: (44) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:53:47] (03PS3) 10Slyngshede: P:pki::multirootca::monitoring Fix script path. [puppet] - 10https://gerrit.wikimedia.org/r/1007263 (https://phabricator.wikimedia.org/T3506947) [09:53:58] (03CR) 10Slyngshede: P:pki::multirootca::monitoring Fix script path. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1007263 (https://phabricator.wikimedia.org/T3506947) (owner: 10Slyngshede) [09:54:30] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:1006853|Set three more wikis to read new on pagelinks migration (T351237)]] (duration: 10m 03s) [09:54:37] T351237: Set beta and production to read new for pagelinks migration - https://phabricator.wikimedia.org/T351237 [09:55:24] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1006976 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [09:56:20] (03CR) 10Muehlenhoff: profile::base: Allow running without cron installed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1006917 (https://phabricator.wikimedia.org/T358343) (owner: 10Muehlenhoff) [09:56:51] (03CR) 10Majavah: [C: 03+1] profile::base: Allow running without cron installed [puppet] - 10https://gerrit.wikimedia.org/r/1006917 (https://phabricator.wikimedia.org/T358343) (owner: 10Muehlenhoff) [09:57:02] (03CR) 10Hashar: "> With PS3 I get:" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1004192 (owner: 10Hashar) [09:57:14] (03PS4) 10Hashar: Change build image user from root to nobody [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1004192 [09:57:36] (03CR) 10Majavah: [C: 03+1] profile::base: Allow running without cron installed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1006917 (https://phabricator.wikimedia.org/T358343) (owner: 10Muehlenhoff) [09:57:51] 06SRE, 06Infrastructure-Foundations: Integrate Bookworm 12.5 point update - https://phabricator.wikimedia.org/T357133#9582773 (10MoritzMuehlenhoff) [10:00:24] !log ayounsi@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm2006.codfw.wmnet with reason: host reimage [10:03:26] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm2006.codfw.wmnet with reason: host reimage [10:03:29] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1502/co" [puppet] - 10https://gerrit.wikimedia.org/r/1007263 (https://phabricator.wikimedia.org/T3506947) (owner: 10Slyngshede) [10:03:54] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:pki::multirootca::monitoring Fix script path. [puppet] - 10https://gerrit.wikimedia.org/r/1007263 (https://phabricator.wikimedia.org/T3506947) (owner: 10Slyngshede) [10:04:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P58032 and previous config saved to /var/cache/conftool/dbconfig/20240228-100406-arnaudb.json [10:04:59] !log clearing up leftover boxedcommand media files on mw2278 - sudo find . -type f \( -name '*.wav' -o -name '*.ogg' -o -name '*.webm' -o -name '*.mov' -o -name '*.mp4' \) -mmin +1200 -exec sh -c "lsof {} || rm {}" \; [10:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:33] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance [10:06:57] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance [10:06:59] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [10:07:03] RECOVERY - Disk space on mw2278 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2278&var-datasource=codfw+prometheus/ops [10:07:14] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [10:07:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1165 (T352010)', diff saved to https://phabricator.wikimedia.org/P58033 and previous config saved to /var/cache/conftool/dbconfig/20240228-100720-ladsgroup.json [10:07:26] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [10:08:13] (03PS1) 10Marostegui: Revert "db2156: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1007277 [10:08:25] PROBLEM - Disk space on mw2266 is CRITICAL: DISK CRITICAL - free space: / 2122 MB (1% inode=98%): /tmp 2122 MB (1% inode=98%): /var/tmp 2122 MB (1% inode=98%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2266&var-datasource=codfw+prometheus/ops [10:09:48] (03CR) 10Ayounsi: Routed Ganeti: use per tap interface dhcrelay (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1003452 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [10:12:04] !log ladsgroup@cumin1002 START - Cookbook sre.mysql.clone of db2156.codfw.wmnet onto db2177.codfw.wmnet [10:12:13] 06SRE, 10Observability-Alerting: SystemdUnitFailed alert aggregation issues - https://phabricator.wikimedia.org/T358648 (10Volans) [10:12:15] !log clearing up leftover boxedcommand media files on mw2281 - sudo find . -type f \( -name '*.wav' -o -name '*.ogg' -o -name '*.webm' -o -name '*.mov' -o -name '*.mp4' \) -mmin +1200 -exec sh -c "lsof {} || rm {}" \; [10:12:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:13:03] (SystemdUnitFailed) firing: (44) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:13:30] (SystemdUnitFailed) firing: (44) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:14:27] (03PS1) 10Fabfur: hiera: minor fix for benthos env_variables structure [puppet] - 10https://gerrit.wikimedia.org/r/1007299 (https://phabricator.wikimedia.org/T358647) [10:15:53] RECOVERY - Disk space on mw2281 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2281&var-datasource=codfw+prometheus/ops [10:18:02] (SystemdUnitFailed) firing: (44) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:18:20] !log installed spicerack 8.4.0 on cumin1002 [10:18:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:18:42] (03CR) 10Btullis: [C: 03+1] superset: add availability monitor [alerts] - 10https://gerrit.wikimedia.org/r/1005540 (https://phabricator.wikimedia.org/T356484) (owner: 10Stevemunene) [10:19:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P58034 and previous config saved to /var/cache/conftool/dbconfig/20240228-101913-arnaudb.json [10:19:19] (03CR) 10Btullis: [C: 03+1] airflow::instance: Pass web server port as an integer [puppet] - 10https://gerrit.wikimedia.org/r/990060 (owner: 10Muehlenhoff) [10:20:43] (03CR) 10Ayounsi: Cookbook to renumber a host while changing its vlan (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/981472 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [10:28:03] (SystemdUnitFailed) firing: (44) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:28:10] (03PS1) 10Joal: Update analytics mediawiki_dumps_import [puppet] - 10https://gerrit.wikimedia.org/r/1007301 [10:31:07] !log copy cas from bullseye-wikimedia to bookworm-wikimedia T357748 [10:31:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:31:13] T357748: Migrate CAS to Bookworm - https://phabricator.wikimedia.org/T357748 [10:31:20] !log cgoubert@cumin2002 conftool action : set/weight=15; selector: name=mw(2259|226[3-6]|2278|2279|2281).codfw.wmnet,cluster=videoscaler [10:31:54] (03CR) 10CI reject: [V: 04-1] Update analytics mediawiki_dumps_import [puppet] - 10https://gerrit.wikimedia.org/r/1007301 (owner: 10Joal) [10:32:27] !log Lowered the weight of small disk videoscalers [10:32:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T357189)', diff saved to https://phabricator.wikimedia.org/P58035 and previous config saved to /var/cache/conftool/dbconfig/20240228-103419-arnaudb.json [10:34:22] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1188.eqiad.wmnet with reason: Maintenance [10:34:24] (03CR) 10Cathal Mooney: Remove cloud_private_v4_set from cloudgw nftables definition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/999004 (https://phabricator.wikimedia.org/T356850) (owner: 10Cathal Mooney) [10:34:26] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [10:34:36] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1188.eqiad.wmnet with reason: Maintenance [10:34:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1188 (T357189)', diff saved to https://phabricator.wikimedia.org/P58036 and previous config saved to /var/cache/conftool/dbconfig/20240228-103442-arnaudb.json [10:35:20] (03PS1) 10Muehlenhoff: Rebuild cas for Bookworm, and depend on tomcat 10 now [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1007302 (https://phabricator.wikimedia.org/T357749) [10:36:21] (03PS1) 10Marostegui: installserver: Do not reimage es1035 [puppet] - 10https://gerrit.wikimedia.org/r/1007303 [10:37:57] (03PS2) 10Joal: Update analytics mediawiki_dumps_import [puppet] - 10https://gerrit.wikimedia.org/r/1007301 (https://phabricator.wikimedia.org/T357859) [10:39:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T357189)', diff saved to https://phabricator.wikimedia.org/P58037 and previous config saved to /var/cache/conftool/dbconfig/20240228-103942-arnaudb.json [10:39:50] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [10:40:18] (03CR) 10Marostegui: [C: 03+2] installserver: Do not reimage es1035 [puppet] - 10https://gerrit.wikimedia.org/r/1007303 (owner: 10Marostegui) [10:45:16] btullis, brouberol: Hi folks, I have added you as reviewers of https://gerrit.wikimedia.org/r/c/operations/puppet/+/1007301 [10:45:30] The underlying change to the script has been merged and deployed [10:45:47] woops - I should have pinged in the analytics chan - sorry for the noise [10:46:16] joal: That's OK. I have been following along. Happy to review and deploy it any time you like. Checking now. [10:47:32] (03CR) 10Volans: [C: 03+1] "Thanks for the fixes, I think we're good, there is still a minor issue, not sure worth investigating:" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1004192 (owner: 10Hashar) [10:49:02] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1503/co" [puppet] - 10https://gerrit.wikimedia.org/r/1007301 (https://phabricator.wikimedia.org/T357859) (owner: 10Joal) [10:52:42] 06SRE, 10LDAP-Access-Requests: Grant Access to Superset for ifeatu_nnaobi_wmde - https://phabricator.wikimedia.org/T358091#9582950 (10cmooney) a:03cmooney Hi @Ifeatu_Nnaobi_WMDE, I think you can work with @KFRancis to get the NDA signed now. @KFrancis if you can post back here to confirm when that's done I'... [10:54:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P58038 and previous config saved to /var/cache/conftool/dbconfig/20240228-105449-arnaudb.json [10:55:53] (03CR) 10Kevin Bazira: [C: 03+2] ml-services: increase article-descriptions memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006870 (https://phabricator.wikimedia.org/T358467) (owner: 10Kevin Bazira) [10:56:11] 06SRE: Improve automation for the vendor maintenance calendar - https://phabricator.wikimedia.org/T357630#9582954 (10cmooney) p:05Triage→03Low Setting to low as it seems reasonable, @andrea.denisse feel free to change. [10:56:48] 06SRE, 10FY2023/2024-Q3: Fix the Alert hosts Puppet catalogue to be compatible with Puppet 7 - https://phabricator.wikimedia.org/T358506#9582962 (10cmooney) p:05Triage→03Medium [10:56:54] (03Merged) 10jenkins-bot: ml-services: increase article-descriptions memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006870 (https://phabricator.wikimedia.org/T358467) (owner: 10Kevin Bazira) [10:58:39] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1007302 (https://phabricator.wikimedia.org/T357749) (owner: 10Muehlenhoff) [11:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240228T1100) [11:00:06] (03PS2) 10STran: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007290 (https://phabricator.wikimedia.org/T354758) [11:02:53] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [11:02:56] (03CR) 10STran: ipoid: Bump version (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007290 (https://phabricator.wikimedia.org/T354758) (owner: 10STran) [11:02:59] (03PS2) 10Muehlenhoff: Rebuild cas for Bookworm, and depend on tomcat 10 now [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1007302 (https://phabricator.wikimedia.org/T357749) [11:03:08] !log kevinbazira@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [11:04:41] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Rebuild cas for Bookworm, and depend on tomcat 10 now [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1007302 (https://phabricator.wikimedia.org/T357749) (owner: 10Muehlenhoff) [11:05:29] (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007290 (https://phabricator.wikimedia.org/T354758) (owner: 10STran) [11:06:20] (03Merged) 10jenkins-bot: ipoid: Bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007290 (https://phabricator.wikimedia.org/T354758) (owner: 10STran) [11:07:15] (03PS3) 10Btullis: Update analytics mediawiki_dumps_import [puppet] - 10https://gerrit.wikimedia.org/r/1007301 (https://phabricator.wikimedia.org/T357859) (owner: 10Joal) [11:08:25] RECOVERY - Disk space on mw2266 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=mw2266&var-datasource=codfw+prometheus/ops [11:08:29] (03PS1) 10Giuseppe Lavagetto: Rakefile: remove useless files from generated docs [puppet] - 10https://gerrit.wikimedia.org/r/1007304 [11:08:53] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1504/co" [puppet] - 10https://gerrit.wikimedia.org/r/1007301 (https://phabricator.wikimedia.org/T357859) (owner: 10Joal) [11:09:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P58039 and previous config saved to /var/cache/conftool/dbconfig/20240228-110955-arnaudb.json [11:13:19] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host dbstore1007.eqiad.wmnet with OS bookworm [11:13:59] !log import cas 6.6.12+wmf12u1 to bookworm-wikimedia T357748 [11:14:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:05] T357748: Migrate CAS to Bookworm - https://phabricator.wikimedia.org/T357748 [11:14:32] (03PS4) 10Btullis: Update analytics mediawiki_dumps_import [puppet] - 10https://gerrit.wikimedia.org/r/1007301 (https://phabricator.wikimedia.org/T357859) (owner: 10Joal) [11:16:03] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/1007304 (owner: 10Giuseppe Lavagetto) [11:16:09] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1505/co" [puppet] - 10https://gerrit.wikimedia.org/r/1007301 (https://phabricator.wikimedia.org/T357859) (owner: 10Joal) [11:17:32] (03CR) 10Hashar: "I don't have that issue with Linux/Podman. Maybe that is related to Docker on Mac and the underlying filesystem used on the host?" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1004192 (owner: 10Hashar) [11:18:00] 06SRE, 10ops-eqiad, 10Cloud-VPS, 06DC-Ops, 10FY2023/2024-Q3-Q4: cloudcephosd1021-1034: hard drive sector errors increasing - https://phabricator.wikimedia.org/T348643#9582996 (10dcaro) [11:18:49] !log stran@deploy2002 helmfile [staging] START helmfile.d/services/ipoid: apply [11:19:34] (03PS1) 10Muehlenhoff: Add superset-admins to sensitive groups for offboarding [puppet] - 10https://gerrit.wikimedia.org/r/1007305 (https://phabricator.wikimedia.org/T358650) [11:19:38] !log stran@deploy2002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [11:21:12] (03CR) 10Btullis: [V: 03+1 C: 03+2] "Looks good." [puppet] - 10https://gerrit.wikimedia.org/r/1007301 (https://phabricator.wikimedia.org/T357859) (owner: 10Joal) [11:22:02] !log stran@deploy2002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [11:22:46] !log stran@deploy2002 helmfile [eqiad] DONE helmfile.d/services/ipoid: apply [11:23:53] !log stran@deploy2002 helmfile [codfw] START helmfile.d/services/ipoid: apply [11:24:19] !log stran@deploy2002 helmfile [codfw] DONE helmfile.d/services/ipoid: apply [11:25:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T357189)', diff saved to https://phabricator.wikimedia.org/P58041 and previous config saved to /var/cache/conftool/dbconfig/20240228-112501-arnaudb.json [11:25:04] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1197.eqiad.wmnet with reason: Maintenance [11:25:09] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [11:25:17] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1197.eqiad.wmnet with reason: Maintenance [11:25:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1197 (T357189)', diff saved to https://phabricator.wikimedia.org/P58042 and previous config saved to /var/cache/conftool/dbconfig/20240228-112523-arnaudb.json [11:27:02] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dbstore1007.eqiad.wmnet with reason: host reimage [11:28:43] (03CR) 10Effie Mouzeli: [V: 03+2 C: 03+2] php: add env[MCROUTER_SERVER] variable [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/994764 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [11:30:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T357189)', diff saved to https://phabricator.wikimedia.org/P58043 and previous config saved to /var/cache/conftool/dbconfig/20240228-113022-arnaudb.json [11:30:29] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [11:31:21] (03PS16) 10Effie Mouzeli: mcrouter: add vanila chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/979107 (https://phabricator.wikimedia.org/T346690) [11:31:38] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbstore1007.eqiad.wmnet with reason: host reimage [11:32:27] (03PS40) 10Effie Mouzeli: mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) [11:32:31] (03CR) 10Effie Mouzeli: [C: 03+2] mcrouter: add vanila chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/979107 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [11:33:15] (03Merged) 10jenkins-bot: mcrouter: add vanila chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/979107 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [11:33:28] (03CR) 10Effie Mouzeli: [C: 03+2] mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [11:34:15] (03Merged) 10jenkins-bot: mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [11:34:17] 06SRE, 10LDAP-Access-Requests: Grant Access to Superset for ifeatu_nnaobi_wmde - https://phabricator.wikimedia.org/T358091#9583069 (10Ifeatu_Nnaobi_WMDE) Thank you, I have sent the email requesting the NDA :) [11:37:00] (03CR) 10Effie Mouzeli: [C: 03+2] deployment_server: add mw-mcrouter service 1 [puppet] - 10https://gerrit.wikimedia.org/r/979339 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [11:40:50] (03PS4) 10Effie Mouzeli: Add namespace for mw-mcrouter service 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/979340 (https://phabricator.wikimedia.org/T346690) [11:41:53] (03PS1) 10Muehlenhoff: More postinst changes to cope with Tomcat 9->10 changes [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1007308 (https://phabricator.wikimedia.org/T357748) [11:43:36] !jouncebot next [11:43:36] a Python reminder bot for deployments. see https://wikitech.wikimedia.org/wiki/Tool:Jouncebot [11:44:00] jouncebot: next [11:44:00] In 2 hour(s) and 15 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240228T1400) [11:44:03] arg [11:44:06] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2156.codfw.wmnet onto db2177.codfw.wmnet [11:45:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P58044 and previous config saved to /var/cache/conftool/dbconfig/20240228-114529-arnaudb.json [11:46:26] (03CR) 10Effie Mouzeli: [C: 03+2] Add namespace for mw-mcrouter service 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/979340 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [11:49:13] (03Merged) 10jenkins-bot: Add namespace for mw-mcrouter service 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/979340 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [11:49:24] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Rakefile: remove useless files from generated docs [puppet] - 10https://gerrit.wikimedia.org/r/1007304 (owner: 10Giuseppe Lavagetto) [11:52:18] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbstore1007.eqiad.wmnet with OS bookworm [11:54:49] !log jiji@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [11:56:09] (03CR) 10Slyngshede: More postinst changes to cope with Tomcat 9->10 changes (032 comments) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1007308 (https://phabricator.wikimedia.org/T357748) (owner: 10Muehlenhoff) [11:57:49] !log jiji@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [11:58:29] (03CR) 10Slyngshede: More postinst changes to cope with Tomcat 9->10 changes (031 comment) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1007308 (https://phabricator.wikimedia.org/T357748) (owner: 10Muehlenhoff) [11:59:12] (03CR) 10Stevemunene: [C: 03+2] superset: add availability monitor [alerts] - 10https://gerrit.wikimedia.org/r/1005540 (https://phabricator.wikimedia.org/T356484) (owner: 10Stevemunene) [12:00:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P58045 and previous config saved to /var/cache/conftool/dbconfig/20240228-120035-arnaudb.json [12:00:48] (03Merged) 10jenkins-bot: superset: add availability monitor [alerts] - 10https://gerrit.wikimedia.org/r/1005540 (https://phabricator.wikimedia.org/T356484) (owner: 10Stevemunene) [12:01:52] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host dbstore1007.eqiad.wmnet with OS bullseye [12:04:24] (03PS7) 10Fabfur: haproxy: initial work to support easy-ratelimiting [puppet] - 10https://gerrit.wikimedia.org/r/1005089 (https://phabricator.wikimedia.org/T306580) [12:06:14] (03PS1) 10Effie Mouzeli: admin_ng: do not create a TLS certificate for mw-mcrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007315 [12:07:08] (03CR) 10Muehlenhoff: More postinst changes to cope with Tomcat 9->10 changes (033 comments) [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1007308 (https://phabricator.wikimedia.org/T357748) (owner: 10Muehlenhoff) [12:08:38] (03CR) 10Muehlenhoff: [C: 03+2] Add superset-admins to sensitive groups for offboarding [puppet] - 10https://gerrit.wikimedia.org/r/1007305 (https://phabricator.wikimedia.org/T358650) (owner: 10Muehlenhoff) [12:08:44] (03PS2) 10Muehlenhoff: Add superset-admins to sensitive groups for offboarding [puppet] - 10https://gerrit.wikimedia.org/r/1007305 (https://phabricator.wikimedia.org/T358650) [12:09:32] (03PS1) 10Jgiannelos: mobileapps: Switchover outgoing parsoid traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007317 (https://phabricator.wikimedia.org/T339865) [12:10:16] (03PS8) 10Fabfur: haproxy: initial work to support easy-ratelimiting [puppet] - 10https://gerrit.wikimedia.org/r/1005089 (https://phabricator.wikimedia.org/T306580) [12:12:27] 06SRE, 10SRE-Access-Requests: Requesting access to wmf-nda, analytics-private-data, analytics-product for kcvelaga - https://phabricator.wikimedia.org/T358658 (10KCVelaga_WMF) [12:12:42] (03PS9) 10Fabfur: haproxy: initial work to support easy-ratelimiting [puppet] - 10https://gerrit.wikimedia.org/r/1005089 (https://phabricator.wikimedia.org/T306580) [12:14:03] 06SRE, 10SRE-Access-Requests: Requesting access to wmf-nda, analytics-private-data, analytics-product for kcvelaga - https://phabricator.wikimedia.org/T358658#9583174 (10KCVelaga_WMF) [12:14:12] 06SRE, 10SRE-Access-Requests: Requesting access to wmf-nda, analytics-private-data, analytics-product for kcvelaga - https://phabricator.wikimedia.org/T358658#9583172 (10KCVelaga_WMF) [12:14:18] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dbstore1007.eqiad.wmnet with reason: host reimage [12:14:33] (03PS1) 10Btullis: Use the superset-admins LDAP group to map to Admin rights [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006871 (https://phabricator.wikimedia.org/T358650) [12:15:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T357189)', diff saved to https://phabricator.wikimedia.org/P58046 and previous config saved to /var/cache/conftool/dbconfig/20240228-121541-arnaudb.json [12:15:44] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1222.eqiad.wmnet with reason: Maintenance [12:15:49] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [12:15:50] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1005089 (https://phabricator.wikimedia.org/T306580) (owner: 10Fabfur) [12:15:57] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1222.eqiad.wmnet with reason: Maintenance [12:16:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1222 (T357189)', diff saved to https://phabricator.wikimedia.org/P58047 and previous config saved to /var/cache/conftool/dbconfig/20240228-121603-arnaudb.json [12:16:12] (03PS2) 10Muehlenhoff: More postinst changes to cope with Tomcat 9->10 changes [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1007308 (https://phabricator.wikimedia.org/T357748) [12:16:43] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbstore1007.eqiad.wmnet with reason: host reimage [12:17:33] (03CR) 10Btullis: [C: 03+1] superset: add availability monitor (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/1005540 (https://phabricator.wikimedia.org/T356484) (owner: 10Stevemunene) [12:18:30] (ProbeDown) firing: (2) Service wdqs2008:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs2008:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:18:53] (03PS1) 10Klausman: APIGW: Add configuration to expose LW isvc article-descriptions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007318 (https://phabricator.wikimedia.org/T358654) [12:20:24] (03PS1) 10Jaime Nuche: jenkins: add security patch bot token to releases instance secrets [labs/private] - 10https://gerrit.wikimedia.org/r/1007319 (https://phabricator.wikimedia.org/T350065) [12:22:34] (03PS1) 10Slyngshede: PKI: Switch alerts to use the x509 metric. [alerts] - 10https://gerrit.wikimedia.org/r/1007321 (https://phabricator.wikimedia.org/T350694) [12:22:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T357189)', diff saved to https://phabricator.wikimedia.org/P58048 and previous config saved to /var/cache/conftool/dbconfig/20240228-122252-arnaudb.json [12:23:02] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [12:23:08] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Add superset-admins to sensitive groups for offboarding [puppet] - 10https://gerrit.wikimedia.org/r/1007305 (https://phabricator.wikimedia.org/T358650) (owner: 10Muehlenhoff) [12:23:25] (03Abandoned) 10Slyngshede: Silence PKI alerts until we have better data. [alerts] - 10https://gerrit.wikimedia.org/r/1006857 (owner: 10Slyngshede) [12:24:26] (03Abandoned) 10Slyngshede: C:ganeti::prometheus Collect process information [puppet] - 10https://gerrit.wikimedia.org/r/1004061 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [12:30:20] (03PS2) 10Btullis: Use the superset-admins LDAP group to map to Admin rights [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006871 (https://phabricator.wikimedia.org/T358650) [12:31:20] (03PS2) 10Jgiannelos: mobileapps: Switchover outgoing parsoid traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007317 (https://phabricator.wikimedia.org/T339865) [12:34:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2177 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P58049 and previous config saved to /var/cache/conftool/dbconfig/20240228-123448-root.json [12:36:09] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1007308 (https://phabricator.wikimedia.org/T357748) (owner: 10Muehlenhoff) [12:37:20] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbstore1007.eqiad.wmnet with OS bullseye [12:37:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P58050 and previous config saved to /var/cache/conftool/dbconfig/20240228-123759-arnaudb.json [12:40:02] 06SRE, 10Data Pipelines, 06Data-Engineering, 06Traffic-Icebox: Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227#9583341 (10mpopov) Okay, so it's been a few years now and this bug still exists and impacts the quality of our analyses substantially (especially for Future... [12:40:58] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] More postinst changes to cope with Tomcat 9->10 changes [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1007308 (https://phabricator.wikimedia.org/T357748) (owner: 10Muehlenhoff) [12:41:35] (03CR) 10Brouberol: [C: 03+1] Use the superset-admins LDAP group to map to Admin rights [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006871 (https://phabricator.wikimedia.org/T358650) (owner: 10Btullis) [12:41:45] (03PS4) 10Jelto: prometheus::ops: monitor active etherpad instance only [puppet] - 10https://gerrit.wikimedia.org/r/1006523 (https://phabricator.wikimedia.org/T316421) [12:42:07] 06SRE, 10SRE Program Management, 07Logos: SRE needs a logo - https://phabricator.wikimedia.org/T312067#9583351 (10Ladsgroup) >>! In T312067#9576651, @kamila wrote: > Inspired by some of the above: {F42210885} [12:42:58] (03CR) 10CI reject: [V: 04-1] prometheus::ops: monitor active etherpad instance only [puppet] - 10https://gerrit.wikimedia.org/r/1006523 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [12:43:00] (03PS1) 10Jaime Nuche: jenkins: add security patch bot token to releases instance [puppet] - 10https://gerrit.wikimedia.org/r/1007323 (https://phabricator.wikimedia.org/T350065) [12:43:36] (03PS5) 10Jelto: prometheus::ops: monitor active etherpad instance only [puppet] - 10https://gerrit.wikimedia.org/r/1006523 (https://phabricator.wikimedia.org/T316421) [12:43:40] (03CR) 10JMeybohm: [C: 03+1] admin_ng: do not create a TLS certificate for mw-mcrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007315 (owner: 10Effie Mouzeli) [12:47:39] !log import cas 6.6.12+wmf12u2 to bookworm-wikimedia T357748 [12:47:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:47:45] T357748: Migrate CAS to Bookworm - https://phabricator.wikimedia.org/T357748 [12:48:58] (03CR) 10FNegri: [C: 03+2] "Agree! I'll merge this in the hope that it makes a small improvement over the current situation." [puppet] - 10https://gerrit.wikimedia.org/r/1006066 (https://phabricator.wikimedia.org/T356904) (owner: 10FNegri) [12:49:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2177 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P58052 and previous config saved to /var/cache/conftool/dbconfig/20240228-124953-root.json [12:50:05] (03CR) 10Jelto: "ok, let's try that in patchset 5." [puppet] - 10https://gerrit.wikimedia.org/r/1006523 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [12:51:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1232 (re)pooling @ 5%: After running optimize', diff saved to https://phabricator.wikimedia.org/P58053 and previous config saved to /var/cache/conftool/dbconfig/20240228-125102-root.json [12:52:01] (03PS1) 10Muehlenhoff: More Tomcat 10 changes T357748 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1007324 [12:52:52] (03CR) 10Btullis: [C: 03+2] Use the superset-admins LDAP group to map to Admin rights [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006871 (https://phabricator.wikimedia.org/T358650) (owner: 10Btullis) [12:53:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P58054 and previous config saved to /var/cache/conftool/dbconfig/20240228-125305-arnaudb.json [12:53:47] (03Merged) 10jenkins-bot: Use the superset-admins LDAP group to map to Admin rights [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006871 (https://phabricator.wikimedia.org/T358650) (owner: 10Btullis) [12:54:00] (03CR) 10Muehlenhoff: [C: 03+2] airflow::instance: Pass web server port as an integer [puppet] - 10https://gerrit.wikimedia.org/r/990060 (owner: 10Muehlenhoff) [12:54:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T352010)', diff saved to https://phabricator.wikimedia.org/P58055 and previous config saved to /var/cache/conftool/dbconfig/20240228-125418-ladsgroup.json [12:54:25] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [12:57:00] !log ayounsi@cumin1002 START - Cookbook sre.hosts.decommission for hosts testvm2006.codfw.wmnet [12:58:02] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:58:30] (JobUnavailable) resolved: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:58:59] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [12:59:40] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [13:01:18] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [13:01:19] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset: apply [13:01:50] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset: apply [13:03:12] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host dbstore1007.eqiad.wmnet with OS bookworm [13:04:21] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:04:33] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:04:47] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:04:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2177 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P58056 and previous config saved to /var/cache/conftool/dbconfig/20240228-130457-root.json [13:05:18] !log installing bind9 security updates [13:05:21] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [13:05:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:33] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:05:45] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [13:06:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1232 (re)pooling @ 10%: After running optimize', diff saved to https://phabricator.wikimedia.org/P58057 and previous config saved to /var/cache/conftool/dbconfig/20240228-130606-root.json [13:08:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T357189)', diff saved to https://phabricator.wikimedia.org/P58058 and previous config saved to /var/cache/conftool/dbconfig/20240228-130811-arnaudb.json [13:08:14] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1225.eqiad.wmnet with reason: Maintenance [13:08:28] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1225.eqiad.wmnet with reason: Maintenance [13:08:29] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [13:09:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P58059 and previous config saved to /var/cache/conftool/dbconfig/20240228-130925-ladsgroup.json [13:11:12] !log ayounsi@cumin1002 START - Cookbook sre.dns.wipe-cache 10.192.0.229 on codfw recursors [13:11:13] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 10.192.0.229 on codfw recursors [13:11:34] !log ayounsi@cumin1002 START - Cookbook sre.dns.wipe-cache 255.0.192.10.in-addr.arpa on codfw recursors [13:11:35] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 255.0.192.10.in-addr.arpa on codfw recursors [13:12:15] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: testvm2006.codfw.wmnet decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1002" [13:12:59] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1229.eqiad.wmnet with reason: Maintenance [13:13:05] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: testvm2006.codfw.wmnet decommissioned, removing all IPs except the asset tag one - ayounsi@cumin1002" [13:13:05] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:13:05] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts testvm2006.codfw.wmnet [13:13:12] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1229.eqiad.wmnet with reason: Maintenance [13:13:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1229 (T357189)', diff saved to https://phabricator.wikimedia.org/P58060 and previous config saved to /var/cache/conftool/dbconfig/20240228-131318-arnaudb.json [13:13:33] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [13:16:14] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on dbstore1007.eqiad.wmnet with reason: host reimage [13:18:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T357189)', diff saved to https://phabricator.wikimedia.org/P58061 and previous config saved to /var/cache/conftool/dbconfig/20240228-131804-arnaudb.json [13:18:46] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbstore1007.eqiad.wmnet with reason: host reimage [13:20:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db2177 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P58062 and previous config saved to /var/cache/conftool/dbconfig/20240228-132002-root.json [13:21:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1232 (re)pooling @ 25%: After running optimize', diff saved to https://phabricator.wikimedia.org/P58063 and previous config saved to /var/cache/conftool/dbconfig/20240228-132111-root.json [13:24:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P58064 and previous config saved to /var/cache/conftool/dbconfig/20240228-132431-ladsgroup.json [13:33:07] !log ayounsi@cumin1002 START - Cookbook sre.hosts.move-vlan for host [13:33:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P58065 and previous config saved to /var/cache/conftool/dbconfig/20240228-133311-arnaudb.json [13:36:05] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [13:36:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1232 (re)pooling @ 50%: After running optimize', diff saved to https://phabricator.wikimedia.org/P58066 and previous config saved to /var/cache/conftool/dbconfig/20240228-133616-root.json [13:38:02] (ProbeDown) firing: (2) Service api-https:443 has failed probes (http_api-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:38:12] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host sretest2003 - ayounsi@cumin1002" [13:39:04] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Update records for host sretest2003 - ayounsi@cumin1002" [13:39:04] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:39:04] !log ayounsi@cumin1002 START - Cookbook sre.dns.wipe-cache sretest2003.codfw.wmnet on all recursors [13:39:07] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) sretest2003.codfw.wmnet on all recursors [13:39:07] !log ayounsi@cumin1002 START - Cookbook sre.dns.wipe-cache 255.0.192.10.in-addr.arpa on all recursors [13:39:10] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 255.0.192.10.in-addr.arpa on all recursors [13:39:11] !log ayounsi@cumin1002 START - Cookbook sre.dns.wipe-cache 5.5.2.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:39:14] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 5.5.2.0.0.0.0.0.2.9.1.0.0.1.0.0.1.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:39:27] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [13:39:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T352010)', diff saved to https://phabricator.wikimedia.org/P58068 and previous config saved to /var/cache/conftool/dbconfig/20240228-133937-ladsgroup.json [13:39:40] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance [13:39:53] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance [13:39:55] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [13:39:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1168 (T352010)', diff saved to https://phabricator.wikimedia.org/P58069 and previous config saved to /var/cache/conftool/dbconfig/20240228-133959-ladsgroup.json [13:40:11] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbstore1007.eqiad.wmnet with OS bookworm [13:40:52] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:40:52] !log ayounsi@cumin1002 START - Cookbook sre.dns.wipe-cache sretest2003.codfw.wmnet on all recursors [13:40:56] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) sretest2003.codfw.wmnet on all recursors [13:40:56] !log ayounsi@cumin1002 START - Cookbook sre.dns.wipe-cache 6.10.192.10.in-addr.arpa on all recursors [13:40:59] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 6.10.192.10.in-addr.arpa on all recursors [13:40:59] !log ayounsi@cumin1002 START - Cookbook sre.dns.wipe-cache 6.0.0.0.0.1.0.0.2.9.1.0.0.1.0.0.b.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:41:02] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) 6.0.0.0.0.1.0.0.2.9.1.0.0.1.0.0.b.0.1.0.0.6.8.0.0.0.0.0.0.2.6.2.ip6.arpa on all recursors [13:41:02] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.hosts.move-vlan (exit_code=99) for host [13:43:02] (ProbeDown) firing: (2) Service api-https:443 has failed probes (http_api-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:47:56] 07Puppet, 10Cloud-VPS, 06Infrastructure-Foundations, 06cloud-services-team, 13Patch-For-Review: wmf_auto_restart_cron.service failing in Cloud VPS bookworm instances - https://phabricator.wikimedia.org/T358343#9583540 (10MoritzMuehlenhoff) 05Open→03Resolved I added a new Hiera option for this: profi... [13:48:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P58070 and previous config saved to /var/cache/conftool/dbconfig/20240228-134817-arnaudb.json [13:49:24] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [13:50:45] (03CR) 10Filippo Giunchedi: [C: 03+1] standard_packages: Remove more obsolete packages after buster->bullseye update [puppet] - 10https://gerrit.wikimedia.org/r/1007291 (owner: 10Muehlenhoff) [13:51:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1232 (re)pooling @ 75%: After running optimize', diff saved to https://phabricator.wikimedia.org/P58071 and previous config saved to /var/cache/conftool/dbconfig/20240228-135121-root.json [13:52:55] (03PS2) 10Elukey: kserve: upgrade to upstream 0.11.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007330 (https://phabricator.wikimedia.org/T337213) [13:52:59] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: sretest2003 - ayounsi@cumin1002" [13:53:02] (ProbeDown) resolved: (2) Service api-https:443 has failed probes (http_api-https_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:53:16] (03PS1) 10Jelto: passwords: update etherpad labs [labs/private] - 10https://gerrit.wikimedia.org/r/1007331 (https://phabricator.wikimedia.org/T316421) [13:53:30] (03PS2) 10Jelto: passwords: update etherpad labs [labs/private] - 10https://gerrit.wikimedia.org/r/1007331 (https://phabricator.wikimedia.org/T316421) [13:53:51] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: sretest2003 - ayounsi@cumin1002" [13:53:51] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: Time to do the UTC afternoon backport window deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240228T1400). [14:00:05] No Gerrit patches in the queue for this window AFAICS. [14:00:12] * Lucas_WMDE can’t deploy anyway [14:01:07] (03CR) 10Muehlenhoff: [C: 03+2] standard_packages: Remove more obsolete packages after buster->bullseye update [puppet] - 10https://gerrit.wikimedia.org/r/1007291 (owner: 10Muehlenhoff) [14:03:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T357189)', diff saved to https://phabricator.wikimedia.org/P58072 and previous config saved to /var/cache/conftool/dbconfig/20240228-140323-arnaudb.json [14:03:26] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1233.eqiad.wmnet with reason: Maintenance [14:03:31] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [14:03:40] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1233.eqiad.wmnet with reason: Maintenance [14:03:46] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1233 (T357189)', diff saved to https://phabricator.wikimedia.org/P58073 and previous config saved to /var/cache/conftool/dbconfig/20240228-140346-arnaudb.json [14:04:37] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 5 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#9583645 (10Ganesha811) @Jdlrobson thanks for laying out this plan, and thank you to all the WMF staff who are thinking about how best to do this! It looks... [14:06:01] (03PS1) 10Filippo Giunchedi: thanos: revert to standard logging level [puppet] - 10https://gerrit.wikimedia.org/r/1007332 (https://phabricator.wikimedia.org/T356788) [14:06:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1232 (re)pooling @ 100%: After running optimize', diff saved to https://phabricator.wikimedia.org/P58074 and previous config saved to /var/cache/conftool/dbconfig/20240228-140626-root.json [14:07:10] (03CR) 10CI reject: [V: 04-1] thanos: revert to standard logging level [puppet] - 10https://gerrit.wikimedia.org/r/1007332 (https://phabricator.wikimedia.org/T356788) (owner: 10Filippo Giunchedi) [14:08:25] (03PS2) 10Filippo Giunchedi: thanos: revert to standard logging level [puppet] - 10https://gerrit.wikimedia.org/r/1007332 (https://phabricator.wikimedia.org/T356788) [14:09:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T357189)', diff saved to https://phabricator.wikimedia.org/P58075 and previous config saved to /var/cache/conftool/dbconfig/20240228-140938-arnaudb.json [14:09:45] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [14:10:16] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1007327 (owner: 10Muehlenhoff) [14:11:43] So, if I want to deploy a config change related to Parser Cache, who can help me keep an eye on the health of the respective backend systems? Doesn't have to be now, could be later today or tomorrow. Amir is out, so... who else knows about that stuff? arnaudb? kormat? [14:11:43] The patch in question is https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/994212 [14:13:09] Amir1: is not out afaik duesen → I'd be glad to shadow the conversation though! [14:13:51] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2190.codfw.wmnet with reason: Maintenance [14:14:07] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2190.codfw.wmnet with reason: Maintenance [14:14:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2190 (T352010)', diff saved to https://phabricator.wikimedia.org/P58076 and previous config saved to /var/cache/conftool/dbconfig/20240228-141413-ladsgroup.json [14:14:16] I can take care of it with arnaudb ! [14:14:19] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [14:14:26] (03CR) 10Daniel Kinzler: [C: 03+1] "No idea how this config wirks, but +1 for the intent." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007317 (https://phabricator.wikimedia.org/T339865) (owner: 10Jgiannelos) [14:14:36] oh that fun one [14:15:33] Amir1: hey, I though you were off getting married! Get off the internet! [14:15:44] I was :D [14:15:55] back starting today [14:15:59] CONGRATULATIONS!!!!1111eleven [14:16:06] <3 <3 [14:16:45] if you need more to be around, just ping me [14:17:36] Amir1: if you are up for it, we can just do it now. You have all the context. [14:17:49] sure [14:18:22] arnaudb: https://grafana.wikimedia.org/d/000000106/parser-cache?orgId=1 this is what we should keep an eye on [14:18:32] ok, here goes... [14:18:35] and mysql aggregated on pc [14:18:43] 06SRE, 06Infrastructure-Foundations: Integrate Bullseye 11.9 point update - https://phabricator.wikimedia.org/T357144#9583706 (10MoritzMuehlenhoff) [14:18:55] (03CR) 10Effie Mouzeli: [C: 03+2] admin_ng: do not create a TLS certificate for mw-mcrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007315 (owner: 10Effie Mouzeli) [14:19:11] Is deploy2002 good? [14:19:50] https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1&var-site=eqiad&var-group=parsercache&var-shard=All&var-role=All [14:19:55] duesen: Ja [14:20:28] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by daniel@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994212 (https://phabricator.wikimedia.org/T346765) (owner: 10Daniel Kinzler) [14:20:35] (03PS1) 10Majavah: ldap: fix sssd socket activation on bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1007335 [14:21:17] meh, merge conflict [14:21:37] just click on rebase [14:21:45] in config repos, it's like this [14:22:03] if someone has touched IS.php since the change, it shows it but it's a lie [14:22:09] (03PS7) 10Daniel Kinzler: Configure parser cache filters for parsoid-pcache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994212 (https://phabricator.wikimedia.org/T346765) [14:22:17] (03CR) 10TrainBranchBot: "Approved by daniel@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994212 (https://phabricator.wikimedia.org/T346765) (owner: 10Daniel Kinzler) [14:22:19] (03Merged) 10jenkins-bot: admin_ng: do not create a TLS certificate for mw-mcrouter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007315 (owner: 10Effie Mouzeli) [14:22:43] (03PS1) 10Andrew Bogott: wmfsink: add try/except around deletion call [puppet] - 10https://gerrit.wikimedia.org/r/1007337 [14:22:43] !log jiji@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [14:22:56] !log jiji@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [14:23:17] (03Merged) 10jenkins-bot: Configure parser cache filters for parsoid-pcache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994212 (https://phabricator.wikimedia.org/T346765) (owner: 10Daniel Kinzler) [14:23:28] !log jiji@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [14:23:42] !log daniel@deploy2002 Started scap: Backport for [[gerrit:994212|Configure parser cache filters for parsoid-pcache (T346765 T355375)]] [14:23:49] T346765: Control ParserCache use per namespace, based on parse time and output size. - https://phabricator.wikimedia.org/T346765 [14:23:50] T355375: Removed wgTemporaryParsoidHandlerParserCacheWriteRatio - https://phabricator.wikimedia.org/T355375 [14:23:56] (03CR) 10CI reject: [V: 04-1] wmfsink: add try/except around deletion call [puppet] - 10https://gerrit.wikimedia.org/r/1007337 (owner: 10Andrew Bogott) [14:24:14] !log jiji@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [14:24:25] !log jiji@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [14:24:36] Amir1: yea, I know gerrit lies on the config repo. But I thought scap wouldn't complain if the conflict wasn't real. Otoh, the rebase was clean when i did it locally. [14:24:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P58077 and previous config saved to /var/cache/conftool/dbconfig/20240228-142445-arnaudb.json [14:25:11] !log daniel@deploy2002 daniel: Backport for [[gerrit:994212|Configure parser cache filters for parsoid-pcache (T346765 T355375)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:25:13] (03PS2) 10Muehlenhoff: airflow: Add option to pass the firewall settings via firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1007327 [14:25:35] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1007328 (https://phabricator.wikimedia.org/T358343) (owner: 10Majavah) [14:25:50] !log jiji@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [14:26:24] duesen: just a double check, this effectively makes commons not store any PC entry for images [14:26:28] is that intentional? [14:26:35] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1511/co" [puppet] - 10https://gerrit.wikimedia.org/r/1007335 (owner: 10Majavah) [14:26:37] https://integration.wikimedia.org/ci/job/operations-mw-config-php74-composer-diffConfig-docker/6030/consoleFull [14:26:42] (03PS2) 10Andrew Bogott: wmfsink: add try/except around deletion call [puppet] - 10https://gerrit.wikimedia.org/r/1007337 [14:26:42] Amir1, arnaudb: when looking at the grafana dashboard, don't forget to enable parsoid_pacache at the top. Otherwise you may miss the fun. [14:27:04] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 6:00:00 on wdqs2008.codfw.wmnet with reason: T355617 [14:27:11] T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 [14:27:17] duesen: do you mean to tick "parsercache" group? [14:27:20] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on wdqs2008.codfw.wmnet with reason: T355617 [14:27:25] (03CR) 10Ssingh: "Looks good, two minor nits!" [puppet] - 10https://gerrit.wikimedia.org/r/1006976 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [14:27:27] (03CR) 10Klausman: [C: 03+1] kserve: upgrade to upstream 0.11.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007330 (https://phabricator.wikimedia.org/T337213) (owner: 10Elukey) [14:27:28] Amir1: commons should not store PC entries of file description pages for parsoid. But it should for the old parser. [14:27:36] ah okay [14:27:43] Amir1: that'S how it already currently is (mostly), except that now it's hacked in [14:27:56] we can reconsider when switching to parsoid for read :P [14:28:29] arnaudb: on https://grafana.wikimedia.org/d/000000106/parser-cache, there's a "cache name" setting at the top. it needs to include parsoid_pcache. [14:28:30] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:28:39] arnaudb: https://grafana.wikimedia.org/d/000000106/parser-cache?orgId=1 in the cache_name above [14:28:47] <3 thanks [14:29:19] it's the new parser of wikitext in mw. Fun stuff :P [14:29:22] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:29:53] I've subscribed to the beta, most uneventful subscription ever [14:30:04] which is a good thing I guess [14:30:18] Ok, confirmed that nothing explodes on mwdebug2002 [14:30:21] !log daniel@deploy2002 daniel: Continuing with sync [14:30:25] sending it [14:31:09] arnaudb: lots of complexity in the background, to juggle the two kinds of almost-the-same-but-not-quite outpuits [14:31:24] oh I bet there is! [14:31:51] which data center should I be looking at for PC writes? Both of them, Is uppose... [14:32:15] should not it be just on codfw? [14:32:39] we serve traffic for eqiad too [14:32:46] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:32:47] this can trigger read [14:32:59] read can trigger this [14:33:10] I suppose the most important graph is this one: https://grafana.wikimedia.org/goto/a9N5llTSk?orgId=1 [14:33:48] one other failure scenario is that it might start parsing a lot more which adds read everywhere, including core sections too [14:35:17] (03CR) 10Btullis: [C: 03+2] Rename victorops-analytics to victorops-data-platform [puppet] - 10https://gerrit.wikimedia.org/r/1006047 (https://phabricator.wikimedia.org/T344202) (owner: 10Btullis) [14:36:39] (03CR) 10Herron: [C: 03+1] thanos: revert to standard logging level [puppet] - 10https://gerrit.wikimedia.org/r/1007332 (https://phabricator.wikimedia.org/T356788) (owner: 10Filippo Giunchedi) [14:37:18] Amir1: in that case, I'd expect we'd see the write rate or hit rate go down [14:37:20] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:37:31] yeah [14:38:02] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:38:12] (03Abandoned) 10Bking: wdqs: remove failing blackbox check [puppet] - 10https://gerrit.wikimedia.org/r/1007014 (https://phabricator.wikimedia.org/T358029) (owner: 10Bking) [14:38:32] (03PS16) 10Brouberol: external-services: define a chart referencing external kafka/zookeeper clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894) [14:38:38] !log daniel@deploy2002 Finished scap: Backport for [[gerrit:994212|Configure parser cache filters for parsoid-pcache (T346765 T355375)]] (duration: 14m 56s) [14:38:49] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:38:50] T346765: Control ParserCache use per namespace, based on parse time and output size. - https://phabricator.wikimedia.org/T346765 [14:38:50] T355375: Removed wgTemporaryParsoidHandlerParserCacheWriteRatio - https://phabricator.wikimedia.org/T355375 [14:38:57] The vast majority of writes are triggered by REST API hits, which are caused by requests originating from changeprop (via restbase). That should go away soon. [14:39:11] !log jiji@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:39:14] (03PS1) 10KartikMistry: Section Translation: Add 'nb' in target language code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1007340 (https://phabricator.wikimedia.org/T353734) [14:39:18] !log jiji@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:39:40] scap complete [14:39:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P58079 and previous config saved to /var/cache/conftool/dbconfig/20240228-143951-arnaudb.json [14:39:52] !log jiji@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [14:40:21] I am starting to see "save_filtered" outcomes show in the graph. Good. [14:40:55] !log jiji@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [14:41:02] 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350#9583812 (10Jhancock.wm) [14:41:14] The filtered (non)writes were previously invisible. Now we are collecting stats on them. [14:41:33] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:41:45] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:41:50] (03PS3) 10Andrew Bogott: wmfsink: add try/except around deletion call [puppet] - 10https://gerrit.wikimedia.org/r/1007337 (https://phabricator.wikimedia.org/T358672) [14:42:11] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:42:24] (03CR) 10Nikerabbit: "Should "no" be removed then?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1007340 (https://phabricator.wikimedia.org/T353734) (owner: 10KartikMistry) [14:43:06] (03CR) 10Fabfur: [V: 03+1] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1006976 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [14:43:27] (03PS8) 10Fabfur: cache: start using benthos on single host for haproxy log parsing [puppet] - 10https://gerrit.wikimedia.org/r/1006976 (https://phabricator.wikimedia.org/T358109) [14:44:18] (03PS6) 10Effie Mouzeli: mw-mcrouter: add helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/979363 (https://phabricator.wikimedia.org/T346690) [14:44:21] I guess we shouldn't see save_success rate go much above 40k for parsoid_pcache. If it goes beyond 50k I'd start to get worried. [14:44:42] (03PS7) 10Effie Mouzeli: mw-mcrouter: add helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/979363 (https://phabricator.wikimedia.org/T346690) [14:45:30] Looks like it's fluctuating normally [14:46:01] yeah, and the lower level ones graphs seems to be fine [14:46:18] I go eat, if anything happens Daniel broke it [14:46:45] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 5.858 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:47:05] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 15 Apr 2024 02:06:19 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:47:20] I'm seeing an increase in errors from the corresponding rest API endpoint. investigating... [14:47:29] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51452 bytes in 0.122 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:50:11] PROBLEM - Check correctness of the icinga configuration on alert1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [14:50:19] (03PS2) 10KartikMistry: Section Translation: Add 'nb' in target language code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1007340 (https://phabricator.wikimedia.org/T353734) [14:51:18] The API errors are unrelated. It's a known issue, T350852. [14:51:19] T350852: Exception: Invalid ETag returned by handler: Expected """ at 69 - https://phabricator.wikimedia.org/T350852 [14:52:00] Iooking good! [14:52:09] Amir1, arnaudb: thank you! [14:53:07] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2215.codfw.wmnet with OS bookworm [14:53:12] 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350#9583868 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2215.codfw.wmnet with OS bookworm [14:53:42] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 5 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#9583869 (10Jdlrobson) [14:54:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T357189)', diff saved to https://phabricator.wikimedia.org/P58080 and previous config saved to /var/cache/conftool/dbconfig/20240228-145457-arnaudb.json [14:55:00] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2197.codfw.wmnet with OS bookworm [14:55:01] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1239.eqiad.wmnet with reason: Maintenance [14:55:04] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [14:55:05] 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350#9583875 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2197.codfw.wmnet with OS bookworm [14:55:14] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1239.eqiad.wmnet with reason: Maintenance [14:56:06] 06SRE, 10ops-codfw, 06DBA, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B6 from asw-b6-codfw to lsw1-b6-codfw - https://phabricator.wikimedia.org/T355871#9583876 (10JMeybohm) [14:56:43] !log jhancock@cumin2002 START - Cookbook sre.hosts.provision for host db2198.mgmt.codfw.wmnet with reboot policy FORCED [14:57:37] RECOVERY - Juniper alarms on lsw1-b7-codfw.mgmt is OK: JNX_ALARMS OK - 0 red alarms, 0 yellow alarms https://wikitech.wikimedia.org/wiki/Network_monitoring%23Juniper_alarm [14:58:02] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:26] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1246.eqiad.wmnet with reason: Maintenance [14:59:41] (03PS17) 10Brouberol: external-services: define a chart referencing external services clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894) [14:59:52] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1246.eqiad.wmnet with reason: Maintenance [14:59:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1246 (T357189)', diff saved to https://phabricator.wikimedia.org/P58081 and previous config saved to /var/cache/conftool/dbconfig/20240228-145958-arnaudb.json [15:00:05] Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240228T1500) [15:00:10] RECOVERY - Check correctness of the icinga configuration on alert1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [15:00:13] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [15:02:05] 06SRE, 10ops-codfw: lsw1-b7-codfw - FPC0: PEM 0 Not Powered - https://phabricator.wikimedia.org/T358639#9583905 (10Jhancock.wm) Checked the cable. Wasn't seated all the way but should be good now. Confirm alert cleared? before closing [15:03:27] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1007327 (owner: 10Muehlenhoff) [15:03:47] !log fab@deploy2002 Started deploy [airflow-dags/research@4bed377]: (no justification provided) [15:04:29] !log fab@deploy2002 Finished deploy [airflow-dags/research@4bed377]: (no justification provided) (duration: 00m 42s) [15:05:11] (03PS18) 10Brouberol: external-services: define a chart referencing external services clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/984819 (https://phabricator.wikimedia.org/T331894) [15:05:17] (03PS3) 10Jforrester: wikifunctions: Upgrade evaluators from 2024-01-18-182630 to 2024-02-12-160222 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002625 (https://phabricator.wikimedia.org/T287978) [15:05:22] (03CR) 10Jforrester: [C: 03+2] wikifunctions: Upgrade evaluators from 2024-01-18-182630 to 2024-02-12-160222 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002625 (https://phabricator.wikimedia.org/T287978) (owner: 10Jforrester) [15:05:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246 (T357189)', diff saved to https://phabricator.wikimedia.org/P58082 and previous config saved to /var/cache/conftool/dbconfig/20240228-150554-arnaudb.json [15:06:13] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [15:06:31] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2024-01-18-182630 to 2024-02-12-160222 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1002625 (https://phabricator.wikimedia.org/T287978) (owner: 10Jforrester) [15:06:36] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host db2198.mgmt.codfw.wmnet with reboot policy FORCED [15:08:13] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:08:53] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:09:20] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:09:37] 06SRE, 10ops-codfw: lsw1-b7-codfw - FPC0: PEM 0 Not Powered - https://phabricator.wikimedia.org/T358639#9583927 (10ayounsi) 05Open→03Resolved All good, thanks ! [15:10:18] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:10:23] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:11:24] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:12:20] (03PS1) 10Jforrester: wikifunctions: Upgrade orchestrator from 2024-02-12-155846 to 2024-02-26-150614 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007351 (https://phabricator.wikimedia.org/T335695) [15:12:42] (03PS3) 10KartikMistry: Section Translation: Add 'nb' in target language code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1007340 (https://phabricator.wikimedia.org/T353734) [15:12:51] (03CR) 10Jforrester: [C: 03+2] wikifunctions: Upgrade orchestrator from 2024-02-12-155846 to 2024-02-26-150614 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007351 (https://phabricator.wikimedia.org/T335695) (owner: 10Jforrester) [15:13:19] (03CR) 10KartikMistry: "Done" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1007340 (https://phabricator.wikimedia.org/T353734) (owner: 10KartikMistry) [15:13:42] (03Merged) 10jenkins-bot: wikifunctions: Upgrade orchestrator from 2024-02-12-155846 to 2024-02-26-150614 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007351 (https://phabricator.wikimedia.org/T335695) (owner: 10Jforrester) [15:13:51] (03PS1) 10Mabualruz: Performance Impact Assessment for Night Mode Style Correction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1007352 (https://phabricator.wikimedia.org/T358240) [15:14:40] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2215.codfw.wmnet with reason: host reimage [15:14:51] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:15:31] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:15:43] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2197.codfw.wmnet with reason: host reimage [15:15:52] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:16:29] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet 7.0: Phase out cergen - https://phabricator.wikimedia.org/T357750#9583989 (10MoritzMuehlenhoff) >>! In T357750#9576803, @CDanis wrote: > Should this ticket really be "deprecate cergen"? :) Good point :-) [15:16:31] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet 7.0: Phase out cergen - https://phabricator.wikimedia.org/T357750#9583991 (10MoritzMuehlenhoff) [15:16:59] (03PS1) 10Jforrester: wikifunctions: Upgrade evaluators from 2024-02-12-160222 to 2024-02-26-150300 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007353 (https://phabricator.wikimedia.org/T296937) [15:17:00] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:17:07] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:17:13] (03PS1) 10Majavah: dynamicproxy: fix connection pooling timeouts [puppet] - 10https://gerrit.wikimedia.org/r/1007354 (https://phabricator.wikimedia.org/T358672) [15:17:19] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host db2198.codfw.wmnet with OS bookworm [15:17:28] 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350#9584004 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host db2198.codfw.wmnet with OS bookworm [15:17:30] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2215.codfw.wmnet with reason: host reimage [15:18:26] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:19:13] (03CR) 10Jforrester: [C: 03+2] wikifunctions: Upgrade evaluators from 2024-02-12-160222 to 2024-02-26-150300 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007353 (https://phabricator.wikimedia.org/T296937) (owner: 10Jforrester) [15:20:21] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2197.codfw.wmnet with reason: host reimage [15:21:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246', diff saved to https://phabricator.wikimedia.org/P58083 and previous config saved to /var/cache/conftool/dbconfig/20240228-152101-arnaudb.json [15:21:04] (03Merged) 10jenkins-bot: wikifunctions: Upgrade evaluators from 2024-02-12-160222 to 2024-02-26-150300 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007353 (https://phabricator.wikimedia.org/T296937) (owner: 10Jforrester) [15:22:22] (03PS1) 10Majavah: hieradata: enable unattended-upgrades on project-proxy [puppet] - 10https://gerrit.wikimedia.org/r/1007356 [15:22:28] 06SRE, 10observability, 10FY2023/2024-Q3, 10Incident Followup, 13Patch-For-Review: thanos-query probedown due to OOM of both eqiad titan frontends - https://phabricator.wikimedia.org/T356788#9584038 (10lmata) [15:23:33] !log jforrester@deploy2002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [15:25:03] !log jforrester@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [15:25:33] !log jforrester@deploy2002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [15:26:49] 07sre-alert-triage, 06Data-Platform-SRE: Alert in need of triage: Number of requests triggering circuit breakers due to excessive memory usage (instance graphite1005) - https://phabricator.wikimedia.org/T357614#9584112 (10lmata) Looking at {T355795}, this seems like something for the data platform team. Please... [15:27:31] (03Abandoned) 10Jforrester: wikifunctions: Upgrade orchestrator from 2024-02-12-155846 to 2024-02-22-165335 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005843 (https://phabricator.wikimedia.org/T335695) (owner: 10Jforrester) [15:28:03] !log jforrester@deploy2002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [15:28:05] !log jforrester@deploy2002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [15:28:57] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 92 probes of 735 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:29:13] (03CR) 10Jforrester: "Happy to deploy this if you're looking for someone to do so; I did this recently for I7de800fc4457d7cea6ef4e2664d993ebdde6a456 and a coupl" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006950 (owner: 10Clément Goubert) [15:30:35] !log jforrester@deploy2002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [15:31:53] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:33:13] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:33:15] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2215.codfw.wmnet with OS bookworm [15:33:28] (03CR) 10David Caro: dynamicproxy: fix connection pooling timeouts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1007354 (https://phabricator.wikimedia.org/T358672) (owner: 10Majavah) [15:33:57] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 41 probes of 735 (alerts on 90) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:34:17] (03PS2) 10Majavah: dynamicproxy: fix connection pooling timeouts [puppet] - 10https://gerrit.wikimedia.org/r/1007354 (https://phabricator.wikimedia.org/T358672) [15:34:19] (03PS2) 10Majavah: hieradata: enable unattended-upgrades on project-proxy [puppet] - 10https://gerrit.wikimedia.org/r/1007356 [15:34:33] (03CR) 10Majavah: dynamicproxy: fix connection pooling timeouts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1007354 (https://phabricator.wikimedia.org/T358672) (owner: 10Majavah) [15:35:39] !log jayme@cumin1002 conftool action : set/pooled=inactive; selector: name=mw23(2[5-9]|3[0-4]).codfw.wmnet [15:35:48] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:36:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246', diff saved to https://phabricator.wikimedia.org/P58084 and previous config saved to /var/cache/conftool/dbconfig/20240228-153607-arnaudb.json [15:37:36] (03CR) 10Jelto: [V: 03+2 C: 03+2] passwords: update etherpad labs [labs/private] - 10https://gerrit.wikimedia.org/r/1007331 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [15:37:53] !log jhancock@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2198.codfw.wmnet with reason: host reimage [15:39:42] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1007354 (https://phabricator.wikimedia.org/T358672) (owner: 10Majavah) [15:40:00] (03CR) 10Jgiannelos: [C: 04-1] "Blocking this one until MW API workers are increased in order to receive more traffic:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007317 (https://phabricator.wikimedia.org/T339865) (owner: 10Jgiannelos) [15:40:11] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 0:40:00 on 7 hosts with reason: Silence for maintenance T355871 [15:40:21] (03CR) 10Majavah: [C: 03+2] dynamicproxy: fix connection pooling timeouts [puppet] - 10https://gerrit.wikimedia.org/r/1007354 (https://phabricator.wikimedia.org/T358672) (owner: 10Majavah) [15:40:28] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:40:00 on 7 hosts with reason: Silence for maintenance T355871 [15:40:29] T355871: Migrate servers in codfw rack B6 from asw-b6-codfw to lsw1-b6-codfw - https://phabricator.wikimedia.org/T355871 [15:40:29] (03CR) 10Majavah: [C: 03+2] hieradata: enable unattended-upgrades on project-proxy [puppet] - 10https://gerrit.wikimedia.org/r/1007356 (owner: 10Majavah) [15:40:38] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2198.codfw.wmnet with reason: host reimage [15:40:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'T355871 - depooling db2110 db2111 db2124 db2134 db2096 db2161 db2162', diff saved to https://phabricator.wikimedia.org/P58085 and previous config saved to /var/cache/conftool/dbconfig/20240228-154043-arnaudb.json [15:41:55] (03PS1) 10PipelineBot: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006875 [15:42:59] jouncebot: nowandnext [15:42:59] For the next 0 hour(s) and 17 minute(s): Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240228T1500) [15:42:59] In 2 hour(s) and 17 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240228T1800) [15:43:41] (03PS2) 10Samtar: InitialiseSettings: Enable Edit Recovery on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006498 (https://phabricator.wikimedia.org/T355548) [15:44:07] (03PS1) 10DDesouza: miscweb(design-strategy): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007362 (https://phabricator.wikimedia.org/T344471) [15:44:33] (03PS1) 10Muehlenhoff: Stop including profile::configmaster in puppetmaster frontends [puppet] - 10https://gerrit.wikimedia.org/r/1007363 (https://phabricator.wikimedia.org/T341717) [15:45:12] !log jhancock@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:45:18] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2197.codfw.wmnet with OS bookworm [15:45:24] 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350#9584220 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2197.codfw.wmnet with OS bookworm completed: - db2197 (**PASS**) -... [15:45:40] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by samtar@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006498 (https://phabricator.wikimedia.org/T355548) (owner: 10Samtar) [15:46:00] 06SRE, 06DBA, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B3 from asw-b3-codfw to lsw1-b3-codfw - https://phabricator.wikimedia.org/T355870#9584222 (10cmooney) 05Open→03Resolved a:03cmooney [15:46:08] 06SRE, 10ops-codfw, 06Infrastructure-Foundations, 10netops: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544#9584224 (10cmooney) [15:46:24] (03Merged) 10jenkins-bot: InitialiseSettings: Enable Edit Recovery on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006498 (https://phabricator.wikimedia.org/T355548) (owner: 10Samtar) [15:46:51] !log samtar@deploy2002 Started scap: Backport for [[gerrit:1006498|InitialiseSettings: Enable Edit Recovery on arwiki (T355548)]] [15:46:57] T355548: Edit Recovery deployment - https://phabricator.wikimedia.org/T355548 [15:48:12] !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=maps2009.codfw.wmnet [15:48:23] !log samtar@deploy2002 samtar: Backport for [[gerrit:1006498|InitialiseSettings: Enable Edit Recovery on arwiki (T355548)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [15:48:34] * TheresNoTime tests [15:48:50] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on asw-b-codfw,cr[1-2]-codfw,lsw1-b6-codfw.mgmt with reason: prepping for server uplink migration codfw rack b6 [15:49:07] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on asw-b-codfw,cr[1-2]-codfw,lsw1-b6-codfw.mgmt with reason: prepping for server uplink migration codfw rack b6 [15:49:17] !log samtar@deploy2002 samtar: Continuing with sync [15:49:37] 06SRE, 10ops-codfw, 06DBA, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B6 from asw-b6-codfw to lsw1-b6-codfw - https://phabricator.wikimedia.org/T355871#9584239 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=1f99f40e-0648-48d6-a40a-a3ebae9e7b2b) set by cmoon... [15:51:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1246 (T357189)', diff saved to https://phabricator.wikimedia.org/P58086 and previous config saved to /var/cache/conftool/dbconfig/20240228-155113-arnaudb.json [15:51:16] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [15:51:18] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [15:51:20] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [15:51:23] !log configuring lsw1-b6-codfw in advance of server migration T355871 [15:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:51:39] T355871: Migrate servers in codfw rack B6 from asw-b6-codfw to lsw1-b6-codfw - https://phabricator.wikimedia.org/T355871 [15:53:33] (03CR) 10Slyngshede: [C: 03+2] More Tomcat 10 changes T357748 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1007324 (owner: 10Muehlenhoff) [15:54:18] (03CR) 10Slyngshede: [C: 03+1] "Look good to me, just don't understand why Tomcat feel the need to have the version in directory names." [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1007324 (owner: 10Muehlenhoff) [15:55:00] !log jhancock@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:55:29] (03CR) 10Muehlenhoff: "Older Debian releases had two Tomcat releases in parallel, hence the separate config dirs." [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1007324 (owner: 10Muehlenhoff) [15:55:33] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] More Tomcat 10 changes T357748 [software/cas-overlay-template] - 10https://gerrit.wikimedia.org/r/1007324 (owner: 10Muehlenhoff) [15:55:49] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2097.codfw.wmnet with reason: Maintenance [15:55:53] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - jhancock@cumin2002" [15:55:59] !log jhancock@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2198.codfw.wmnet with OS bookworm [15:56:03] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2097.codfw.wmnet with reason: Maintenance [15:56:07] 06SRE, 10ops-codfw, 10Cassandra, 10decommission-hardware: Decommission sessionstore200[1-3] - https://phabricator.wikimedia.org/T357356#9584252 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm [15:56:14] 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350#9584262 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host db2198.codfw.wmnet with OS bookworm completed: - db2198 (**WARN**) -... [15:57:02] !log samtar@deploy2002 Finished scap: Backport for [[gerrit:1006498|InitialiseSettings: Enable Edit Recovery on arwiki (T355548)]] (duration: 10m 10s) [15:57:13] T355548: Edit Recovery deployment - https://phabricator.wikimedia.org/T355548 [15:58:09] (03CR) 10Elukey: [C: 03+2] kserve: upgrade to upstream 0.11.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007330 (https://phabricator.wikimedia.org/T337213) (owner: 10Elukey) [15:59:04] 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350#9584272 (10Jhancock.wm) 05Open→03Resolved [15:59:08] (03CR) 10Muehlenhoff: [C: 04-2] "After looking at access logs on puppetmaster1001; this is still used for serve the sha1s for the puppet and labsprivate" [puppet] - 10https://gerrit.wikimedia.org/r/1007363 (https://phabricator.wikimedia.org/T341717) (owner: 10Muehlenhoff) [15:59:28] 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350#9584278 (10Jhancock.wm) @Marostegui this is complete! [15:59:41] !log sudo cumin "A:dns-rec" "disable-puppet 'merging CR 1006955'" [15:59:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:00:12] (03CR) 10Ssingh: [V: 03+1 C: 03+2] P:dns::auth::update: move authdns-update state to confd [puppet] - 10https://gerrit.wikimedia.org/r/1006955 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [16:01:43] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2104.codfw.wmnet with reason: Maintenance [16:01:56] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2104.codfw.wmnet with reason: Maintenance [16:02:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2104 (T357189)', diff saved to https://phabricator.wikimedia.org/P58087 and previous config saved to /var/cache/conftool/dbconfig/20240228-160202-arnaudb.json [16:02:17] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [16:04:02] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 0:30:00 on 37 hosts with reason: Migrating servers in codfw rack B6 to lsw1-b6-codfw [16:04:24] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 13Patch-For-Review, 10Puppet 7.0: Phase out cergen - https://phabricator.wikimedia.org/T357750#9584295 (10MoritzMuehlenhoff) [16:04:33] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 0:30:00 on 37 hosts with reason: Migrating servers in codfw rack B6 to lsw1-b6-codfw [16:04:39] 06SRE, 10ops-codfw, 06DBA, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B6 from asw-b6-codfw to lsw1-b6-codfw - https://phabricator.wikimedia.org/T355871#9584297 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=691919af-8b8a-4f2d-b390-eea3c6a54f5c) set by cmoon... [16:06:01] !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [16:06:10] (03CR) 10Andrew Bogott: [C: 03+1] dynamicproxy: fix connection pooling timeouts [puppet] - 10https://gerrit.wikimedia.org/r/1007354 (https://phabricator.wikimedia.org/T358672) (owner: 10Majavah) [16:06:16] !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [16:06:21] jouncebot nowandnext [16:06:21] No deployments scheduled for the next 1 hour(s) and 53 minute(s) [16:06:21] In 1 hour(s) and 53 minute(s): MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240228T1800) [16:06:34] (03CR) 10Andrew Bogott: [C: 03+2] wmfsink: add try/except around deletion call [puppet] - 10https://gerrit.wikimedia.org/r/1007337 (https://phabricator.wikimedia.org/T358672) (owner: 10Andrew Bogott) [16:06:49] (03PS2) 10Dzahn: delete passwords::racktables [labs/private] - 10https://gerrit.wikimedia.org/r/1007008 (https://phabricator.wikimedia.org/T327405) [16:07:06] 06SRE, 10ops-codfw, 06DC-Ops, 06Data-Persistence: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350#9584299 (10Marostegui) Thank you so much @Jhancock.wm! [16:07:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T357189)', diff saved to https://phabricator.wikimedia.org/P58088 and previous config saved to /var/cache/conftool/dbconfig/20240228-160734-arnaudb.json [16:07:52] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [16:08:30] (03CR) 10Nik Gkountas: [C: 03+1] Section Translation: Add 'nb' in target language code [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1007340 (https://phabricator.wikimedia.org/T353734) (owner: 10KartikMistry) [16:10:45] (03PS1) 10Elukey: kserve: bump Docker image default versions to 0.11.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007391 (https://phabricator.wikimedia.org/T337213) [16:11:00] !log dancy@deploy2002 Installing scap version "4.67.0" for 445 hosts [16:11:42] (03CR) 10Klausman: [C: 03+1] kserve: bump Docker image default versions to 0.11.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007391 (https://phabricator.wikimedia.org/T337213) (owner: 10Elukey) [16:11:58] !log dancy@deploy2002 Installation of scap version "4.67.0" completed for 445 hosts [16:11:58] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=dns6001.wikimedia.org,service=authdns-update [16:12:19] !log dancy@deploy2002 Started scap: testing new scap release [16:12:27] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns6001.wikimedia.org,service=authdns-update [16:12:27] 06SRE, 10ops-codfw, 06DBA, 06Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B6 from asw-b6-codfw to lsw1-b6-codfw - https://phabricator.wikimedia.org/T355871#9584311 (10cmooney) Works completed, all servers moved to the new switch and back responding to ping now. No issues. [16:12:52] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2110 (re)pooling @ 20%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58089 and previous config saved to /var/cache/conftool/dbconfig/20240228-161251-arnaudb.json [16:12:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2111 (re)pooling @ 20%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58090 and previous config saved to /var/cache/conftool/dbconfig/20240228-161254-arnaudb.json [16:12:56] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps2009.codfw.wmnet [16:13:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 20%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58091 and previous config saved to /var/cache/conftool/dbconfig/20240228-161303-arnaudb.json [16:13:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2096 (re)pooling @ 20%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58092 and previous config saved to /var/cache/conftool/dbconfig/20240228-161312-arnaudb.json [16:13:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2162 (re)pooling @ 20%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58093 and previous config saved to /var/cache/conftool/dbconfig/20240228-161318-arnaudb.json [16:13:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2161 (re)pooling @ 20%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58094 and previous config saved to /var/cache/conftool/dbconfig/20240228-161327-arnaudb.json [16:13:59] (03CR) 10Elukey: [C: 03+2] kserve: bump Docker image default versions to 0.11.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007391 (https://phabricator.wikimedia.org/T337213) (owner: 10Elukey) [16:14:28] 06SRE, 10ops-eqiad, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9584315 (10VRiley-WMF) a:03VRiley-WMF [16:17:09] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1006976 (https://phabricator.wikimedia.org/T358109) (owner: 10Fabfur) [16:17:35] !log import cas 6.6.12+wmf12u3 to bookworm-wikimedia T357748 [16:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:42] T357748: Migrate CAS to Bookworm - https://phabricator.wikimedia.org/T357748 [16:17:58] !log sudo cumin 'A:dns-rec' "run-puppet-agent --enable 'merging CR 1006955'" [16:18:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:08] (03PS1) 10Cathal Mooney: Disable CR DHCP relay and IPv6 RA generation private1-b-codfw vlan [homer/public] - 10https://gerrit.wikimedia.org/r/1007393 (https://phabricator.wikimedia.org/T355544) [16:18:20] !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [16:21:00] (03CR) 10C. Scott Ananian: [C: 03+1] "Seems like after this change we don't need to block the ParserMigration extension on commons/wikidata?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994212 (https://phabricator.wikimedia.org/T346765) (owner: 10Daniel Kinzler) [16:21:02] (03PS3) 10RLazarus: admin: Remove the os-installers group [puppet] - 10https://gerrit.wikimedia.org/r/1007023 (https://phabricator.wikimedia.org/T358361) [16:21:32] !log dancy@deploy2002 Finished scap: testing new scap release (duration: 09m 12s) [16:21:57] (03Abandoned) 10Ssingh: depool codfw: emergency depool patch (do not merge unless required) [dns] - 10https://gerrit.wikimedia.org/r/1006929 (owner: 10Ssingh) [16:22:38] (03CR) 10CI reject: [V: 04-1] admin: Remove the os-installers group [puppet] - 10https://gerrit.wikimedia.org/r/1007023 (https://phabricator.wikimedia.org/T358361) (owner: 10RLazarus) [16:22:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P58095 and previous config saved to /var/cache/conftool/dbconfig/20240228-162240-arnaudb.json [16:23:32] (03PS1) 10C. Scott Ananian: Enable ParserMigration extension on commons and wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1007394 [16:23:35] (03PS4) 10RLazarus: admin: Remove the os-installers group [puppet] - 10https://gerrit.wikimedia.org/r/1007023 (https://phabricator.wikimedia.org/T358361) [16:25:08] !log Disabling IPv6 RAs for private1-b-codfw vlan on codfw CR routers, moving GW to lsw/ssw T355544 [16:25:18] (03CR) 10Cathal Mooney: [C: 03+2] Disable CR DHCP relay and IPv6 RA generation private1-b-codfw vlan [homer/public] - 10https://gerrit.wikimedia.org/r/1007393 (https://phabricator.wikimedia.org/T355544) (owner: 10Cathal Mooney) [16:25:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:21] T355544: Migrate hosts from codfw row A/B ASW to new LSW devices - https://phabricator.wikimedia.org/T355544 [16:25:54] (03Merged) 10jenkins-bot: Disable CR DHCP relay and IPv6 RA generation private1-b-codfw vlan [homer/public] - 10https://gerrit.wikimedia.org/r/1007393 (https://phabricator.wikimedia.org/T355544) (owner: 10Cathal Mooney) [16:26:17] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T352010)', diff saved to https://phabricator.wikimedia.org/P58096 and previous config saved to /var/cache/conftool/dbconfig/20240228-162616-ladsgroup.json [16:26:23] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [16:27:36] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:27:41] (03CR) 10RLazarus: admin: Remove the os-installers group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1007023 (https://phabricator.wikimedia.org/T358361) (owner: 10RLazarus) [16:27:43] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:27:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2110 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58097 and previous config saved to /var/cache/conftool/dbconfig/20240228-162756-arnaudb.json [16:28:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2111 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58098 and previous config saved to /var/cache/conftool/dbconfig/20240228-162806-arnaudb.json [16:28:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58099 and previous config saved to /var/cache/conftool/dbconfig/20240228-162807-arnaudb.json [16:28:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2096 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58100 and previous config saved to /var/cache/conftool/dbconfig/20240228-162816-arnaudb.json [16:28:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2162 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58101 and previous config saved to /var/cache/conftool/dbconfig/20240228-162823-arnaudb.json [16:28:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2161 (re)pooling @ 50%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58102 and previous config saved to /var/cache/conftool/dbconfig/20240228-162832-arnaudb.json [16:28:34] !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [16:29:22] (03PS16) 10Ayounsi: Cookbook to renumber a host while changing its vlan [cookbooks] - 10https://gerrit.wikimedia.org/r/981472 (https://phabricator.wikimedia.org/T350152) [16:31:36] !log jayme@cumin1002 conftool action : set/pooled=yes; selector: name=mw23(2[5-9]|3[0-4]).codfw.wmnet [16:37:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P58103 and previous config saved to /var/cache/conftool/dbconfig/20240228-163747-arnaudb.json [16:39:57] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [16:40:03] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [16:41:21] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, thanks for the cleanup" [puppet] - 10https://gerrit.wikimedia.org/r/1007023 (https://phabricator.wikimedia.org/T358361) (owner: 10RLazarus) [16:41:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P58104 and previous config saved to /var/cache/conftool/dbconfig/20240228-164123-ladsgroup.json [16:43:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2110 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58105 and previous config saved to /var/cache/conftool/dbconfig/20240228-164301-arnaudb.json [16:43:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2111 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58106 and previous config saved to /var/cache/conftool/dbconfig/20240228-164310-arnaudb.json [16:43:11] (03PS1) 10Majavah: P:puppetserver: git: use creates for initial deploy-code [puppet] - 10https://gerrit.wikimedia.org/r/1007396 [16:43:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58107 and previous config saved to /var/cache/conftool/dbconfig/20240228-164312-arnaudb.json [16:43:18] (03CR) 10RLazarus: [C: 03+2] admin: Remove the os-installers group [puppet] - 10https://gerrit.wikimedia.org/r/1007023 (https://phabricator.wikimedia.org/T358361) (owner: 10RLazarus) [16:43:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2096 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58108 and previous config saved to /var/cache/conftool/dbconfig/20240228-164321-arnaudb.json [16:43:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2162 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58109 and previous config saved to /var/cache/conftool/dbconfig/20240228-164327-arnaudb.json [16:43:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2161 (re)pooling @ 75%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58110 and previous config saved to /var/cache/conftool/dbconfig/20240228-164337-arnaudb.json [16:44:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1169 T354015', diff saved to https://phabricator.wikimedia.org/P58111 and previous config saved to /var/cache/conftool/dbconfig/20240228-164451-root.json [16:45:09] T354015: DBQueryDisconnectedError upon editing en:Template:COVID-19 pandemic data - https://phabricator.wikimedia.org/T354015 [16:45:21] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1169.eqiad.wmnet with reason: Optimize revision table T354015 [16:45:27] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1513/co" [puppet] - 10https://gerrit.wikimedia.org/r/1007396 (owner: 10Majavah) [16:45:35] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1169.eqiad.wmnet with reason: Optimize revision table T354015 [16:52:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T357189)', diff saved to https://phabricator.wikimedia.org/P58112 and previous config saved to /var/cache/conftool/dbconfig/20240228-165253-arnaudb.json [16:52:56] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2125.codfw.wmnet with reason: Maintenance [16:53:05] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [16:53:10] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2125.codfw.wmnet with reason: Maintenance [16:53:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2125 (T357189)', diff saved to https://phabricator.wikimedia.org/P58113 and previous config saved to /var/cache/conftool/dbconfig/20240228-165315-arnaudb.json [16:56:30] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P58114 and previous config saved to /var/cache/conftool/dbconfig/20240228-165629-ladsgroup.json [16:58:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2110 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58115 and previous config saved to /var/cache/conftool/dbconfig/20240228-165806-arnaudb.json [16:58:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2111 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58116 and previous config saved to /var/cache/conftool/dbconfig/20240228-165815-arnaudb.json [16:58:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2124 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58117 and previous config saved to /var/cache/conftool/dbconfig/20240228-165823-arnaudb.json [16:58:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2096 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58118 and previous config saved to /var/cache/conftool/dbconfig/20240228-165832-arnaudb.json [16:58:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2162 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58119 and previous config saved to /var/cache/conftool/dbconfig/20240228-165832-arnaudb.json [16:58:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2161 (re)pooling @ 100%: Maintenance done', diff saved to https://phabricator.wikimedia.org/P58120 and previous config saved to /var/cache/conftool/dbconfig/20240228-165841-arnaudb.json [17:01:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es1028 to es3 eqiad master T358180', diff saved to https://phabricator.wikimedia.org/P58121 and previous config saved to /var/cache/conftool/dbconfig/20240228-170134-marostegui.json [17:01:47] T358180: Upgrade es3 to MariaDB 10.6 - https://phabricator.wikimedia.org/T358180 [17:02:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T357189)', diff saved to https://phabricator.wikimedia.org/P58122 and previous config saved to /var/cache/conftool/dbconfig/20240228-170201-arnaudb.json [17:02:15] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [17:03:33] !log running dummy authdns-update [17:03:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:45] (03PS1) 10Elukey: kserve: upgrade all images to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1007400 [17:05:29] (03PS1) 10Marostegui: mariadb: Productionize db2218 [puppet] - 10https://gerrit.wikimedia.org/r/1007401 (https://phabricator.wikimedia.org/T355422) [17:07:40] (03CR) 10Dzahn: [V: 03+2 C: 03+2] delete passwords::racktables [labs/private] - 10https://gerrit.wikimedia.org/r/1007008 (https://phabricator.wikimedia.org/T327405) (owner: 10Dzahn) [17:07:52] (03CR) 10Marostegui: [C: 03+2] mariadb: Productionize db2218 [puppet] - 10https://gerrit.wikimedia.org/r/1007401 (https://phabricator.wikimedia.org/T355422) (owner: 10Marostegui) [17:11:02] (03PS1) 10Sbailey: wikifeeds: upgrade to node18 from node16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007403 (https://phabricator.wikimedia.org/T358017) [17:11:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T352010)', diff saved to https://phabricator.wikimedia.org/P58123 and previous config saved to /var/cache/conftool/dbconfig/20240228-171136-ladsgroup.json [17:11:38] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance [17:11:44] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [17:11:52] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance [17:11:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1180 (T352010)', diff saved to https://phabricator.wikimedia.org/P58124 and previous config saved to /var/cache/conftool/dbconfig/20240228-171157-ladsgroup.json [17:13:07] (03PS1) 10Marostegui: instances.yaml: Add db2218 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1007404 (https://phabricator.wikimedia.org/T355422) [17:14:44] (03CR) 10Marostegui: [C: 03+2] instances.yaml: Add db2218 to dbctl [puppet] - 10https://gerrit.wikimedia.org/r/1007404 (https://phabricator.wikimedia.org/T355422) (owner: 10Marostegui) [17:15:18] 06SRE, 10ops-eqiad, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9584611 (10VRiley-WMF) [17:15:57] 06SRE, 10ops-eqiad, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9584625 (10VRiley-WMF) dbprov1005 Rack A2 U 25 CableID 4905 Port 8 dbprov1006 Rack B2 U 24 CableID 4903 Port 17 [17:16:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add db2218 depooled T355422', diff saved to https://phabricator.wikimedia.org/P58125 and previous config saved to /var/cache/conftool/dbconfig/20240228-171633-marostegui.json [17:16:51] T355422: Productionize db2196-db2220 - https://phabricator.wikimedia.org/T355422 [17:16:52] !log marostegui@cumin1002 START - Cookbook sre.mysql.clone of db2118.codfw.wmnet onto db2218.codfw.wmnet [17:17:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P58126 and previous config saved to /var/cache/conftool/dbconfig/20240228-171707-arnaudb.json [17:18:29] (03CR) 10Klausman: [C: 03+1] kserve: upgrade all images to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1007400 (owner: 10Elukey) [17:18:40] (03PS1) 10Marostegui: installserver: Do not reimage db2218 [puppet] - 10https://gerrit.wikimedia.org/r/1007405 [17:22:27] (03CR) 10Marostegui: [C: 03+2] installserver: Do not reimage db2218 [puppet] - 10https://gerrit.wikimedia.org/r/1007405 (owner: 10Marostegui) [17:27:20] (03PS2) 10Elukey: kserve: upgrade all images to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1007400 (https://phabricator.wikimedia.org/T337213) [17:27:55] (03PS1) 10Joal: Update analytics sqoop jobs [puppet] - 10https://gerrit.wikimedia.org/r/1007407 [17:32:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P58127 and previous config saved to /var/cache/conftool/dbconfig/20240228-173214-arnaudb.json [17:38:16] !log jclark@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['an-redacteddb1001'] [17:38:40] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['an-redacteddb1001'] [17:42:38] (03CR) 10Jgiannelos: [C: 03+2] wikifeeds: upgrade to node18 from node16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007403 (https://phabricator.wikimedia.org/T358017) (owner: 10Sbailey) [17:44:17] (03Merged) 10jenkins-bot: wikifeeds: upgrade to node18 from node16 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007403 (https://phabricator.wikimedia.org/T358017) (owner: 10Sbailey) [17:45:06] (03PS17) 10Ayounsi: Cookbook to renumber a host while changing its vlan [cookbooks] - 10https://gerrit.wikimedia.org/r/981472 (https://phabricator.wikimedia.org/T350152) [17:46:00] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [17:47:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T357189)', diff saved to https://phabricator.wikimedia.org/P58128 and previous config saved to /var/cache/conftool/dbconfig/20240228-174720-arnaudb.json [17:47:23] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2126.codfw.wmnet with reason: Maintenance [17:47:26] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [17:47:38] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2126.codfw.wmnet with reason: Maintenance [17:47:40] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [17:47:53] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [17:47:56] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt dbprov1005 - vriley@cumin1002" [17:47:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2126 (T357189)', diff saved to https://phabricator.wikimedia.org/P58129 and previous config saved to /var/cache/conftool/dbconfig/20240228-174759-arnaudb.json [17:48:47] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt dbprov1005 - vriley@cumin1002" [17:48:47] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:49:08] 06SRE, 10ops-codfw, 06DC-Ops, 06serviceops: mw2420-mw2451 do have unnecessary raid controllers (configured) - https://phabricator.wikimedia.org/T358489#9584776 (10JMeybohm) [17:49:32] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host dbprov1005.mgmt.eqiad.wmnet with reboot policy FORCED [17:51:00] 06SRE, 10LDAP-Access-Requests: Grant Access to nda, wmde for Frederik Ring - https://phabricator.wikimedia.org/T358584#9584797 (10KFrancis) Hi All, the NDA is out for signatures. I'll confirm when it's complete. [17:51:40] (03PS18) 10Ayounsi: Cookbook to renumber a host while changing its vlan [cookbooks] - 10https://gerrit.wikimedia.org/r/981472 (https://phabricator.wikimedia.org/T350152) [17:52:16] !log sbailey@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [17:52:50] !log sbailey@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [17:53:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T357189)', diff saved to https://phabricator.wikimedia.org/P58130 and previous config saved to /var/cache/conftool/dbconfig/20240228-175333-arnaudb.json [17:53:40] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [17:59:25] 06SRE, 10LDAP-Access-Requests: Grant Access to Superset for ifeatu_nnaobi_wmde - https://phabricator.wikimedia.org/T358091#9584834 (10KFrancis) Hi all, the NDA has been sent for signatures. I'll confirm when it's complete. [18:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240228T1800) [18:01:40] (03PS1) 10Jgiannelos: Revert "wikifeeds: upgrade to node18 from node16" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007370 [18:01:56] (03CR) 10Jgiannelos: [C: 03+1] Revert "wikifeeds: upgrade to node18 from node16" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007370 (owner: 10Jgiannelos) [18:02:38] (03PS2) 10Dzahn: delete grafana password classes [labs/private] - 10https://gerrit.wikimedia.org/r/1007011 [18:05:37] (03CR) 10Dzahn: [C: 03+1] "still exist in private repo with real passwords and a comment "Deprecated 2017-01-18"" [labs/private] - 10https://gerrit.wikimedia.org/r/1007011 (owner: 10Dzahn) [18:06:01] (03CR) 10Btullis: [C: 03+2] Update analytics sqoop jobs [puppet] - 10https://gerrit.wikimedia.org/r/1007407 (owner: 10Joal) [18:07:02] (03CR) 10Sbailey: [C: 03+1] Revert "wikifeeds: upgrade to node18 from node16" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007370 (owner: 10Jgiannelos) [18:07:44] (03CR) 10Jgiannelos: [C: 03+2] Revert "wikifeeds: upgrade to node18 from node16" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007370 (owner: 10Jgiannelos) [18:07:49] (03CR) 10Klausman: [C: 03+1] kserve: upgrade all images to Bookworm [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1007400 (https://phabricator.wikimedia.org/T337213) (owner: 10Elukey) [18:08:00] (03PS19) 10Ayounsi: Cookbook to renumber a host while changing its vlan [cookbooks] - 10https://gerrit.wikimedia.org/r/981472 (https://phabricator.wikimedia.org/T350152) [18:08:04] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [18:08:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P58131 and previous config saved to /var/cache/conftool/dbconfig/20240228-180840-arnaudb.json [18:08:47] (03Merged) 10jenkins-bot: Revert "wikifeeds: upgrade to node18 from node16" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007370 (owner: 10Jgiannelos) [18:10:40] !log sbailey@deploy2002 helmfile [staging] START helmfile.d/services/wikifeeds: apply [18:10:48] (PuppetFailure) firing: Puppet has failed on install1004:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:10:55] !log sbailey@deploy2002 helmfile [staging] DONE helmfile.d/services/wikifeeds: apply [18:13:01] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt dbprov1006 - vriley@cumin1002" [18:13:54] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt dbprov1006 - vriley@cumin1002" [18:13:54] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:14:59] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host dbprov1006.mgmt.eqiad.wmnet with reboot policy FORCED [18:15:48] (PuppetFailure) firing: (3) Puppet has failed on apt2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:16:10] (03PS6) 10Ssingh: P:dns::auth: add support for depooling services via confd/confctl [puppet] - 10https://gerrit.wikimedia.org/r/980427 (https://phabricator.wikimedia.org/T347054) [18:17:09] (03CR) 10Dzahn: "thanks, yep. btw I also deleted the "etherpad" (not etherpad_lite) passwords class in the actually private repo" [labs/private] - 10https://gerrit.wikimedia.org/r/1007331 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [18:17:48] (PuppetFailure) firing: Puppet has failed on cumin2002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:19:37] checks cumin2002 and the puppet error is with the datacenter-ops group somehow? [18:20:01] rzl: ^^^ [18:20:37] thanks, looking [18:20:48] (PuppetFailure) firing: Puppet has failed on puppetmaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:21:16] (03PS7) 10Ssingh: P:dns::auth: add support for depooling services via confd/confctl [puppet] - 10https://gerrit.wikimedia.org/r/980427 (https://phabricator.wikimedia.org/T347054) [18:21:16] rzl: it's trying to cleanup a user that is already gone or so [18:22:48] (PuppetFailure) firing: (2) Puppet has failed on cumin1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:23:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P58132 and previous config saved to /var/cache/conftool/dbconfig/20240228-182347-arnaudb.json [18:25:42] (03CR) 10Dzahn: [C: 03+2] "https://upload.wikimedia.org/wikipedia/commons/7/7f/MTMHP_Executive_Summary.pdf works :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007362 (https://phabricator.wikimedia.org/T344471) (owner: 10DDesouza) [18:25:48] (PuppetFailure) firing: (5) Puppet has failed on apt2001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:26:44] (03Merged) 10jenkins-bot: miscweb(design-strategy): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1007362 (https://phabricator.wikimedia.org/T344471) (owner: 10DDesouza) [18:26:55] (03CR) 10Ssingh: "dns1004 failure is expected in PCC (and it's a good catch) because dnsbox.yaml is not yet updated. That is very much intentional." [puppet] - 10https://gerrit.wikimedia.org/r/980427 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [18:27:23] (03PS1) 10RLazarus: admin: Remove *sre_admins_members from datacenter-ops [puppet] - 10https://gerrit.wikimedia.org/r/1007415 (https://phabricator.wikimedia.org/T358361) [18:28:30] (SystemdUnitFailed) firing: (2) netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:28:59] (03CR) 10CI reject: [V: 04-1] admin: Remove *sre_admins_members from datacenter-ops [puppet] - 10https://gerrit.wikimedia.org/r/1007415 (https://phabricator.wikimedia.org/T358361) (owner: 10RLazarus) [18:29:16] (03PS2) 10RLazarus: admin: Remove *sre_admins_members from datacenter-ops [puppet] - 10https://gerrit.wikimedia.org/r/1007415 (https://phabricator.wikimedia.org/T358361) [18:29:24] 06SRE, 10ops-eqiad, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9584987 (10VRiley-WMF) [18:30:48] (PuppetFailure) firing: (6) Puppet has failed on apt1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:31:57] mutante, volans: can you stamp https://gerrit.wikimedia.org/r/1007415 real quick? [18:33:45] 06SRE, 06Data-Engineering, 06Data-Platform-SRE, 06serviceops, 10Event-Platform: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058#9585000 (10BTullis) I believe that this ticket will be invalidated by the approach that that has tested and agreed upon in {T331894}. There... [18:33:56] (03CR) 10Dzahn: [C: 03+1] admin: Remove *sre_admins_members from datacenter-ops [puppet] - 10https://gerrit.wikimedia.org/r/1007415 (https://phabricator.wikimedia.org/T358361) (owner: 10RLazarus) [18:34:00] rzl: yea [18:34:28] thanks! [18:34:30] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dbprov1005.mgmt.eqiad.wmnet with reboot policy FORCED [18:34:33] (03CR) 10RLazarus: [C: 03+2] admin: Remove *sre_admins_members from datacenter-ops [puppet] - 10https://gerrit.wikimedia.org/r/1007415 (https://phabricator.wikimedia.org/T358361) (owner: 10RLazarus) [18:35:48] (PuppetFailure) firing: (2) Puppet has failed on puppetmaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:35:48] (PuppetFailure) firing: (7) Puppet has failed on apt1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:37:46] merged, reran puppet on cumin2002, works [18:37:54] thanks mutante and volans for the heads-up, sorry for the noise [18:37:59] rerunning on all the failed hosts now [18:38:30] thx [18:38:53] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/output/980427/1212/dns6001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/980427 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [18:38:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T357189)', diff saved to https://phabricator.wikimedia.org/P58133 and previous config saved to /var/cache/conftool/dbconfig/20240228-183853-arnaudb.json [18:38:55] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2138.codfw.wmnet with reason: Maintenance [18:39:00] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [18:39:10] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2138.codfw.wmnet with reason: Maintenance [18:39:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2138 (T357189)', diff saved to https://phabricator.wikimedia.org/P58134 and previous config saved to /var/cache/conftool/dbconfig/20240228-183915-arnaudb.json [18:40:49] (PuppetFailure) firing: (7) Puppet has failed on apt1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:40:49] cool [18:41:23] (03CR) 10Ssingh: "The above PCC is the outdated one; the correct one is https://puppet-compiler.wmflabs.org/output/980427/1515/dns6001.wikimedia.org/index.h" [puppet] - 10https://gerrit.wikimedia.org/r/980427 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [18:45:48] (PuppetFailure) firing: (2) Puppet has failed on puppetmaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:45:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138 (T357189)', diff saved to https://phabricator.wikimedia.org/P58135 and previous config saved to /var/cache/conftool/dbconfig/20240228-184552-arnaudb.json [18:46:00] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [18:46:30] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dbprov1006.mgmt.eqiad.wmnet with reboot policy FORCED [18:48:52] !log dani@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply [18:49:14] !log dani@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [18:49:15] !log dani@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [18:49:44] !log dani@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [18:49:45] !log dani@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply [18:50:09] !log dani@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [18:50:49] (PuppetFailure) firing: (7) Puppet has failed on apt1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [18:51:08] 06SRE, 10ops-eqiad, 06DC-Ops, 06Data-Persistence: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9585085 (10VRiley-WMF) [18:55:48] (PuppetFailure) firing: (7) Puppet has failed on apt1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:00:05] dduvall and jeena: How many deployers does it take to do MediaWiki train - Utc-7 Version deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240228T1900). [19:00:49] (PuppetFailure) resolved: (2) Puppet has failed on puppetmaster1001:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:00:53] (PuppetFailure) firing: (6) Puppet has failed on apt1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:00:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138', diff saved to https://phabricator.wikimedia.org/P58136 and previous config saved to /var/cache/conftool/dbconfig/20240228-190059-arnaudb.json [19:02:48] (PuppetFailure) resolved: (2) Puppet has failed on cumin1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:04:14] (03PS1) 10TrainBranchBot: group1 wikis to 1.42.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1007421 (https://phabricator.wikimedia.org/T354438) [19:04:16] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.42.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1007421 (https://phabricator.wikimedia.org/T354438) (owner: 10TrainBranchBot) [19:04:29] (03CR) 10Dzahn: [C: 03+2] "entire module was deleted in https://gerrit.wikimedia.org/r/c/operations/puppet/+/739658 and passwords::tor is also gone and deleted in pr" [labs/private] - 10https://gerrit.wikimedia.org/r/1007010 (owner: 10Dzahn) [19:05:04] (03Merged) 10jenkins-bot: group1 wikis to 1.42.0-wmf.20 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1007421 (https://phabricator.wikimedia.org/T354438) (owner: 10TrainBranchBot) [19:05:49] (PuppetFailure) firing: (6) Puppet has failed on apt1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:08:12] (03PS8) 10Ssingh: P:dns::auth: add support for depooling services via confd/confctl [puppet] - 10https://gerrit.wikimedia.org/r/980427 (https://phabricator.wikimedia.org/T347054) [19:09:28] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/980427 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [19:09:58] (03CR) 10Ssingh: "Wrapped service depooling in the confd_enabled conditional so that it's an actual NOOP on dns1004." [puppet] - 10https://gerrit.wikimedia.org/r/980427 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [19:10:24] (03PS2) 10Dzahn: delete passwords::tendril and passwords::bugzilla [labs/private] - 10https://gerrit.wikimedia.org/r/1007009 [19:11:25] (03CR) 10Dzahn: [C: 03+2] "service gone and passwords don't exist anymore in the private repo" [labs/private] - 10https://gerrit.wikimedia.org/r/1007009 (owner: 10Dzahn) [19:11:30] (03CR) 10Dzahn: [V: 03+2 C: 03+2] delete passwords::tendril and passwords::bugzilla [labs/private] - 10https://gerrit.wikimedia.org/r/1007009 (owner: 10Dzahn) [19:13:45] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/output/1007017/1517/contint1003.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/1007017 (https://phabricator.wikimedia.org/T358237) (owner: 10Dzahn) [19:14:18] !log dduvall@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.42.0-wmf.20 refs T354438 [19:14:31] T354438: 1.42.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T354438 [19:14:52] !log marostegui@cumin1002 END (PASS) - Cookbook sre.mysql.clone (exit_code=0) of db2118.codfw.wmnet onto db2218.codfw.wmnet [19:15:49] (PuppetFailure) resolved: (3) Puppet has failed on apt1002:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:15:51] (03CR) 10Dzahn: "Looking at the compiler output.. a lot of monitoring is added by this role. Given this is a test host might be worth it to first add a par" [puppet] - 10https://gerrit.wikimedia.org/r/1007017 (https://phabricator.wikimedia.org/T358237) (owner: 10Dzahn) [19:16:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138', diff saved to https://phabricator.wikimedia.org/P58138 and previous config saved to /var/cache/conftool/dbconfig/20240228-191605-arnaudb.json [19:18:02] 06SRE, 10Continuous-Integration-Infrastructure, 06collaboration-services, 10vm-requests, 13Patch-For-Review: Ganeti VM for contint migration - https://phabricator.wikimedia.org/T358237#9585216 (10Dzahn) The VM has been created and releng-roots already have shell access. The "ci" prod role is not applied... [19:22:55] !log dduvall@deploy2002 Synchronized php: group1 wikis to 1.42.0-wmf.20 refs T354438 (duration: 08m 37s) [19:23:02] T354438: 1.42.0-wmf.20 deployment blockers - https://phabricator.wikimedia.org/T354438 [19:23:41] (03PS9) 10Ssingh: P:dns::auth: add support for depooling services via confd/confctl [puppet] - 10https://gerrit.wikimedia.org/r/980427 (https://phabricator.wikimedia.org/T347054) [19:24:53] (03CR) 10CI reject: [V: 04-1] P:dns::auth: add support for depooling services via confd/confctl [puppet] - 10https://gerrit.wikimedia.org/r/980427 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [19:24:58] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/980427 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [19:28:04] (03PS1) 10Dzahn: contint: move hiera hosts file to correct hostname for test host [puppet] - 10https://gerrit.wikimedia.org/r/1007426 (https://phabricator.wikimedia.org/T358237) [19:28:36] (03CR) 10Dzahn: [C: 03+2] "This also gives shell access to releng-roots who have the same on existing contint*" [puppet] - 10https://gerrit.wikimedia.org/r/1007426 (https://phabricator.wikimedia.org/T358237) (owner: 10Dzahn) [19:31:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138 (T357189)', diff saved to https://phabricator.wikimedia.org/P58139 and previous config saved to /var/cache/conftool/dbconfig/20240228-193111-arnaudb.json [19:31:14] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2148.codfw.wmnet with reason: Maintenance [19:31:27] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [19:31:28] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2148.codfw.wmnet with reason: Maintenance [19:31:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2148 (T357189)', diff saved to https://phabricator.wikimedia.org/P58140 and previous config saved to /var/cache/conftool/dbconfig/20240228-193133-arnaudb.json [19:31:48] (03PS10) 10Ssingh: P:dns::auth: add support for depooling services via confd/confctl [puppet] - 10https://gerrit.wikimedia.org/r/980427 (https://phabricator.wikimedia.org/T347054) [19:33:02] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/980427 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [19:33:34] 06SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T358417#9585282 (10phaultfinder) [19:34:13] (03PS1) 10Dzahn: hieradata: delete hosts/contint2001 [puppet] - 10https://gerrit.wikimedia.org/r/1007427 [19:35:24] (03PS2) 10Dzahn: hieradata: delete hosts/contint2001 [puppet] - 10https://gerrit.wikimedia.org/r/1007427 (https://phabricator.wikimedia.org/T342017) [19:35:32] (03CR) 10Dzahn: [C: 03+2] "T342017" [puppet] - 10https://gerrit.wikimedia.org/r/1007427 (https://phabricator.wikimedia.org/T342017) (owner: 10Dzahn) [19:38:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T357189)', diff saved to https://phabricator.wikimedia.org/P58141 and previous config saved to /var/cache/conftool/dbconfig/20240228-193854-arnaudb.json [19:39:01] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [19:39:03] (03PS1) 10Dzahn: contint: ensure zuul-merger is disabled on test host initially [puppet] - 10https://gerrit.wikimedia.org/r/1007428 (https://phabricator.wikimedia.org/T358237) [19:41:06] (03CR) 10Dzahn: [C: 03+2] contint: ensure zuul-merger is disabled on test host initially [puppet] - 10https://gerrit.wikimedia.org/r/1007428 (https://phabricator.wikimedia.org/T358237) (owner: 10Dzahn) [19:47:59] (03PS11) 10Ssingh: P:dns::auth: add support for depooling services via confd/confctl [puppet] - 10https://gerrit.wikimedia.org/r/980427 (https://phabricator.wikimedia.org/T347054) [19:49:07] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/980427 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [19:50:55] (03CR) 10Ssingh: [V: 03+1] "https://puppet-compiler.wmflabs.org/output/980427/1521/dns6001.wikimedia.org/fulldiff.html watch_keys and gets is more restricted now, giv" [puppet] - 10https://gerrit.wikimedia.org/r/980427 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [19:52:09] (03Abandoned) 10Cathal Mooney: Add ferm rule to mark all server traffic as DSCP 0 [puppet] - 10https://gerrit.wikimedia.org/r/931263 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [19:54:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P58142 and previous config saved to /var/cache/conftool/dbconfig/20240228-195400-arnaudb.json [20:07:17] (03PS1) 10Dzahn: contint: allow data rsyncing to contint1003 [puppet] - 10https://gerrit.wikimedia.org/r/1007433 (https://phabricator.wikimedia.org/T358237) [20:07:50] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T352010)', diff saved to https://phabricator.wikimedia.org/P58143 and previous config saved to /var/cache/conftool/dbconfig/20240228-200748-ladsgroup.json [20:08:10] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [20:09:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P58144 and previous config saved to /var/cache/conftool/dbconfig/20240228-200906-arnaudb.json [20:14:08] 06SRE, 10ops-eqiad, 06DC-Ops, 06Traffic: Decommission task for old cp hosts (cp1075-1090) - https://phabricator.wikimedia.org/T352253#9585418 (10dr0ptp4kt) @bking , @RKemper , and I met today. @bking has an action on this here ticket (@bking LMK in case I need to chime in on anything!). Thanks! [20:15:42] (03PS1) 10Dzahn: contint: create ci_test role for zuul-only and apply on contint1003 [puppet] - 10https://gerrit.wikimedia.org/r/1007434 (https://phabricator.wikimedia.org/T358237) [20:16:12] 06SRE, 10ops-eqiad, 06DC-Ops, 06Traffic: Decommission task for old cp hosts (cp1075-1090) - https://phabricator.wikimedia.org/T352253#9585422 (10bking) @wiki_willy I'm going to take over this work from @dr0ptp4kt . l'll make a phab task with the data you requested shortly. [20:22:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P58145 and previous config saved to /var/cache/conftool/dbconfig/20240228-202256-ladsgroup.json [20:24:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T357189)', diff saved to https://phabricator.wikimedia.org/P58146 and previous config saved to /var/cache/conftool/dbconfig/20240228-202413-arnaudb.json [20:24:16] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2175.codfw.wmnet with reason: Maintenance [20:24:30] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2175.codfw.wmnet with reason: Maintenance [20:24:30] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [20:24:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2175 (T357189)', diff saved to https://phabricator.wikimedia.org/P58147 and previous config saved to /var/cache/conftool/dbconfig/20240228-202435-arnaudb.json [20:32:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T357189)', diff saved to https://phabricator.wikimedia.org/P58148 and previous config saved to /var/cache/conftool/dbconfig/20240228-203241-arnaudb.json [20:32:48] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [20:33:07] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 10Thumbor, and 5 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#9585432 (10JScherer-WMF) I spoke with @DTorsani-WMF, @RHo and others on the WMF design team on this. Assuming the change only affects thumbnails that are... [20:38:03] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P58149 and previous config saved to /var/cache/conftool/dbconfig/20240228-203802-ladsgroup.json [21:18:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T357189)', diff saved to https://phabricator.wikimedia.org/P58153 and previous config saved to /var/cache/conftool/dbconfig/20240228-211801-arnaudb.json [21:18:04] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2189.codfw.wmnet with reason: Maintenance [21:18:08] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [21:18:17] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2189.codfw.wmnet with reason: Maintenance [21:18:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2189 (T357189)', diff saved to https://phabricator.wikimedia.org/P58154 and previous config saved to /var/cache/conftool/dbconfig/20240228-211823-arnaudb.json [21:34:17] (03PS1) 10Cathal Mooney: WIP: Add DSCP marking options to current firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/1007437 (https://phabricator.wikimedia.org/T339850) [21:35:27] (03CR) 10CI reject: [V: 04-1] WIP: Add DSCP marking options to current firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/1007437 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [21:39:00] (03PS2) 10Cathal Mooney: WIP: Add DSCP marking options to current firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/1007437 (https://phabricator.wikimedia.org/T339850) [21:40:18] (03CR) 10CI reject: [V: 04-1] WIP: Add DSCP marking options to current firewall classes [puppet] - 10https://gerrit.wikimedia.org/r/1007437 (https://phabricator.wikimedia.org/T339850) (owner: 10Cathal Mooney) [21:58:10] 10ops-eqiad, 06Data-Platform-SRE, 10Wikidata-Query-Service: Reclaim recently-decommed CP host for WDQS (see T352253) - https://phabricator.wikimedia.org/T358727 (10bking) [21:59:58] 10SRE-swift-storage, 10ops-codfw, 10ops-eqiad, 06DC-Ops: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9585646 (10ssingh) Hi folks: Just wondering if there is a path forward on this task as we hit the same issue last week while reimaging cp4052. No PXE... [22:00:05] Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240228T2200) [22:00:10] 10ops-eqiad, 06Data-Platform-SRE, 10Wikidata-Query-Service: Reclaim recently-decommed CP host for WDQS (see T352253) - https://phabricator.wikimedia.org/T358727#9585647 (10bking) [22:00:28] 10SRE-swift-storage, 10ops-codfw, 10ops-eqiad, 06DC-Ops, 06Traffic: Reimage cookbook on new eqiad hosts stuck at PXE booting - https://phabricator.wikimedia.org/T350179#9585648 (10ssingh) [22:02:10] 10ops-eqiad, 06Data-Platform-SRE, 10Wikidata-Query-Service: Reclaim recently-decommed CP host for WDQS (see T352253) - https://phabricator.wikimedia.org/T358727#9585651 (10wiki_willy) ++ @VRiley-WMF and @Jclark-ctr - can one of you pick up this request? We'll be repurposing one of the previously decommissio... [22:08:59] (03PS1) 10Ahmon Dancy: scap.cfg.erb: Set require_security_patches: False in beta [puppet] - 10https://gerrit.wikimedia.org/r/1007441 (https://phabricator.wikimedia.org/T350070) [22:13:57] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic2043*,2044*,2079*,2080* for switch maintenance - bking@cumin2002 - T355872 [22:14:00] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic2043*,2044*,2079*,2080* for switch maintenance - bking@cumin2002 - T355872 [22:14:04] T355872: Migrate servers in codfw rack B7 from asw-b7-codfw to lsw1-b7-codfw - https://phabricator.wikimedia.org/T355872 [22:14:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T357189)', diff saved to https://phabricator.wikimedia.org/P58155 and previous config saved to /var/cache/conftool/dbconfig/20240228-221456-arnaudb.json [22:15:05] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [22:20:33] (03CR) 10Jdlrobson: [C: 04-1] "I'd suggest scoping this to a page that uses the CSS rule more frequently. On this page $('[style*="background"]').length has only 38 hits" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1007352 (https://phabricator.wikimedia.org/T358240) (owner: 10Mabualruz) [22:30:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P58156 and previous config saved to /var/cache/conftool/dbconfig/20240228-223002-arnaudb.json [22:30:12] (03PS1) 10Bking: site.pp: fix wdqs duplicate role assignment [puppet] - 10https://gerrit.wikimedia.org/r/1007443 (https://phabricator.wikimedia.org/T342660) [22:30:40] (03PS2) 10Bking: site.pp: fix wdqs duplicate role assignment [puppet] - 10https://gerrit.wikimedia.org/r/1007443 (https://phabricator.wikimedia.org/T342660) [22:32:37] (03CR) 10Ryan Kemper: [C: 03+1] site.pp: fix wdqs duplicate role assignment [puppet] - 10https://gerrit.wikimedia.org/r/1007443 (https://phabricator.wikimedia.org/T342660) (owner: 10Bking) [22:32:41] (03PS1) 10Andrew Bogott: wmcs-puppetcertleaks: Use puppet7 syntax [puppet] - 10https://gerrit.wikimedia.org/r/1007444 (https://phabricator.wikimedia.org/T351455) [22:32:44] (03PS1) 10Andrew Bogott: wmf_sink: Use puppet7 syntax [puppet] - 10https://gerrit.wikimedia.org/r/1007445 (https://phabricator.wikimedia.org/T351455) [22:32:52] (03CR) 10Bking: [C: 03+2] site.pp: fix wdqs duplicate role assignment [puppet] - 10https://gerrit.wikimedia.org/r/1007443 (https://phabricator.wikimedia.org/T342660) (owner: 10Bking) [22:45:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P58157 and previous config saved to /var/cache/conftool/dbconfig/20240228-224508-arnaudb.json [22:52:43] (03PS1) 10Ahmon Dancy: logstash_checker.py: Handle missing mediawiki_deployments_file [puppet] - 10https://gerrit.wikimedia.org/r/1007449 (https://phabricator.wikimedia.org/T357402) [22:54:38] (03CR) 10Ahmon Dancy: [C: 03+1] logstash_checker.py: Handle missing mediawiki_deployments_file [puppet] - 10https://gerrit.wikimedia.org/r/1007449 (https://phabricator.wikimedia.org/T357402) (owner: 10Ahmon Dancy) [22:54:43] (03CR) 10Ahmon Dancy: [C: 03+1] scap.cfg.erb: Set require_security_patches: False in beta [puppet] - 10https://gerrit.wikimedia.org/r/1007441 (https://phabricator.wikimedia.org/T350070) (owner: 10Ahmon Dancy) [23:00:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T357189)', diff saved to https://phabricator.wikimedia.org/P58158 and previous config saved to /var/cache/conftool/dbconfig/20240228-230015-arnaudb.json [23:00:31] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [23:08:01] PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - AS6939/IPv4: Idle - HE, AS6939/IPv6: Idle - HE https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:18:20] 06SRE, 10ops-codfw: Netbox errors caused by system board replacement - https://phabricator.wikimedia.org/T358542#9585781 (10wiki_willy) Hey @Volans - much appreciated for your feedback and for the suggestions. I was wondering since the physical serial number listed on the chassis doesn't change (it's only fro... [23:27:41] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1201.eqiad.wmnet with reason: Maintenance [23:27:55] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1201.eqiad.wmnet with reason: Maintenance [23:28:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1201 (T352010)', diff saved to https://phabricator.wikimedia.org/P58159 and previous config saved to /var/cache/conftool/dbconfig/20240228-232800-ladsgroup.json [23:28:07] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [23:28:39] (CirrusSearchNodeIndexingNotIncreasing) firing: Elasticsearch instance elastic2043-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [23:29:36] (03PS2) 10Dzahn: contint: create ci_test role for zuul-only and apply on contint1003 [puppet] - 10https://gerrit.wikimedia.org/r/1007434 (https://phabricator.wikimedia.org/T358237)