[00:01:14] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1031596 (owner: 10TrainBranchBot) [00:01:42] (03CR) 10Pppery: [C:03+1] "Follow up at https://gitlab.wikimedia.org/repos/phabricator/arcanist/-/merge_requests/2" [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1028887 (https://phabricator.wikimedia.org/T364426) (owner: 10Aklapper) [00:05:25] FIRING: SystemdUnitFailed: ncmonitor.service on ncmonitor1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:05:49] ^ignore [00:33:01] (03PS1) 10Pppery: Re-extract i18n to pick up latest changes [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1032094 (https://phabricator.wikimedia.org/T363188) [00:33:02] FIRING: [3x] SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:12:09] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main2009.codfw.wmnet with OS bullseye [01:12:16] 10ops-codfw, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main200[6789] & kafka-main2010 - https://phabricator.wikimedia.org/T363209#9802674 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host kafka-main2009.codfw.wmnet with OS bullseye executed... [01:16:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T352010)', diff saved to https://phabricator.wikimedia.org/P62430 and previous config saved to /var/cache/conftool/dbconfig/20240516-011613-ladsgroup.json [01:16:18] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [01:31:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P62431 and previous config saved to /var/cache/conftool/dbconfig/20240516-013122-ladsgroup.json [01:43:47] (03CR) 10Eevans: [C:04-1] "I think that was —implicitly— my intention (I was basically proposing a role name and permissions, and had elided the rest). But now that" [puppet] - 10https://gerrit.wikimedia.org/r/1032034 (https://phabricator.wikimedia.org/T364921) (owner: 10Eevans) [01:45:53] (03CR) 10Eevans: [C:04-1] cassandra: add data_gateway Cassandra role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1032034 (https://phabricator.wikimedia.org/T364921) (owner: 10Eevans) [01:46:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194', diff saved to https://phabricator.wikimedia.org/P62432 and previous config saved to /var/cache/conftool/dbconfig/20240516-014630-ladsgroup.json [01:50:22] (03PS1) 10BCornwall: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1032101 [01:51:15] (03Abandoned) 10BCornwall: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1032101 (owner: 10BCornwall) [02:01:38] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1194 (T352010)', diff saved to https://phabricator.wikimedia.org/P62433 and previous config saved to /var/cache/conftool/dbconfig/20240516-020137-ladsgroup.json [02:01:40] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance [02:01:43] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [02:01:53] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1202.eqiad.wmnet with reason: Maintenance [02:02:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1202 (T352010)', diff saved to https://phabricator.wikimedia.org/P62434 and previous config saved to /var/cache/conftool/dbconfig/20240516-020200-ladsgroup.json [02:03:02] FIRING: [3x] SystemdUnitFailed: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:04:18] (03PS1) 10BCornwall: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1032103 [02:05:12] (03CR) 10CI reject: [V:04-1] Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1032103 (owner: 10BCornwall) [02:07:22] (03Abandoned) 10BCornwall: Automated MarkMonitor domain sync [dns] - 10https://gerrit.wikimedia.org/r/1032103 (owner: 10BCornwall) [02:18:02] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:35:23] (03PS1) 10CDobbins: purged: add Puppet overrides to use cfssl for certs in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1032106 (https://phabricator.wikimedia.org/T360506) [02:38:02] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:58:04] FIRING: [2x] HelmReleaseBadStatus: Helm release datasets-config/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [03:03:02] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:03:02] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:08:02] FIRING: [3x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:10:39] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [03:35:29] (03CR) 10RLazarus: [C:03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1031505 (https://phabricator.wikimedia.org/T364880) (owner: 10Lucas Werkmeister (WMDE)) [03:40:49] (03CR) 10Dzahn: ":) thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1031505 (https://phabricator.wikimedia.org/T364880) (owner: 10Lucas Werkmeister (WMDE)) [03:52:31] PROBLEM - carbon-cache write error on graphite1005 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [8.0] https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting https://grafana.wikimedia.org/d/000000020/graphite-eqiad?orgId=1&viewPanel=30 [04:00:39] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:03:02] FIRING: [2x] SystemdUnitFailed: docker-reporter-k8s-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:19:31] RECOVERY - carbon-cache write error on graphite1005 is OK: OK: Less than 80.00% above the threshold [1.0] https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting https://grafana.wikimedia.org/d/000000020/graphite-eqiad?orgId=1&viewPanel=30 [04:58:23] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: Primary switchover s6 T364523 [04:58:28] T364523: Switchover s6 master (db1173 -> db1231) - https://phabricator.wikimedia.org/T364523 [04:58:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db1231 with weight 0 T364523', diff saved to https://phabricator.wikimedia.org/P62435 and previous config saved to /var/cache/conftool/dbconfig/20240516-045831-marostegui.json [04:58:45] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s6 T364523 [04:59:26] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db1231 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/1028935 (https://phabricator.wikimedia.org/T364523) (owner: 10Gerrit maintenance bot) [05:03:57] (03CR) 10Marostegui: "Yes, let's drop this in production first." [puppet] - 10https://gerrit.wikimedia.org/r/1031608 (https://phabricator.wikimedia.org/T364435) (owner: 10Zabe) [05:08:42] (03PS271) 10Marostegui: mariadb: cookbook draft to clone multiinstance [cookbooks] - 10https://gerrit.wikimedia.org/r/976709 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [05:17:28] !log Starting s6 eqiad failover from db1173 to db1231 - T364523 [05:17:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:17:33] T364523: Switchover s6 master (db1173 -> db1231) - https://phabricator.wikimedia.org/T364523 [05:17:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set s6 eqiad as read-only for maintenance - T364523', diff saved to https://phabricator.wikimedia.org/P62436 and previous config saved to /var/cache/conftool/dbconfig/20240516-051746-marostegui.json [05:17:49] marostegui@cumin1002: Failed to log message to wiki. Somebody should check the error logs. [05:18:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db1231 to s6 primary and set section read-write T364523', diff saved to https://phabricator.wikimedia.org/P62437 and previous config saved to /var/cache/conftool/dbconfig/20240516-051808-marostegui.json [05:18:38] (03CR) 10Marostegui: [C:03+2] wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/1028936 (https://phabricator.wikimedia.org/T364523) (owner: 10Gerrit maintenance bot) [05:18:42] (03PS2) 10Gerrit maintenance bot: wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/1028936 (https://phabricator.wikimedia.org/T364523) [05:18:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1173 T364523', diff saved to Unable to send diff to phaste and previous config saved to /var/cache/conftool/dbconfig/20240516-051853-root.json [05:19:14] (03CR) 10Marostegui: [V:03+2 C:03+2] wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/1028936 (https://phabricator.wikimedia.org/T364523) (owner: 10Gerrit maintenance bot) [05:22:38] (03PS1) 10Marostegui: db1173: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1032119 [05:23:13] !log Deploy schema change dbmaint db1173 eqiad s6 T355609 [05:23:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:23:18] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [05:23:36] (03CR) 10Marostegui: [C:03+2] db1173: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1032119 (owner: 10Marostegui) [05:27:11] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1173.eqiad.wmnet with reason: Maintenance [05:27:13] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1173.eqiad.wmnet with reason: Maintenance [05:33:02] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:35:36] (03PS1) 10Marostegui: es4 mariadb: Make the hosts standalone [puppet] - 10https://gerrit.wikimedia.org/r/1032123 (https://phabricator.wikimedia.org/T364447) [05:36:00] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 6 hosts with reason: Making es4 standalone T364447 [05:36:04] T364447: Make es4 and es5 RO - https://phabricator.wikimedia.org/T364447 [05:36:17] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 6 hosts with reason: Making es4 standalone T364447 [05:37:27] (03PS2) 10Marostegui: es4 mariadb: Make the hosts standalone [puppet] - 10https://gerrit.wikimedia.org/r/1032123 (https://phabricator.wikimedia.org/T364447) [05:37:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Increase es1021 weight', diff saved to https://phabricator.wikimedia.org/P62439 and previous config saved to /var/cache/conftool/dbconfig/20240516-053746-marostegui.json [05:41:58] (03CR) 10Marostegui: [C:03+2] es4 mariadb: Make the hosts standalone [puppet] - 10https://gerrit.wikimedia.org/r/1032123 (https://phabricator.wikimedia.org/T364447) (owner: 10Marostegui) [05:43:11] !log Make es4 standalone and disconnect replication T364447 [05:43:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:43:18] T364447: Make es4 and es5 RO - https://phabricator.wikimedia.org/T364447 [05:52:40] (03PS1) 10Marostegui: site.pp: Reorganize es4 definitions [puppet] - 10https://gerrit.wikimedia.org/r/1032124 [05:53:22] (03CR) 10Marostegui: [C:03+2] site.pp: Reorganize es4 definitions [puppet] - 10https://gerrit.wikimedia.org/r/1032124 (owner: 10Marostegui) [05:58:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1173 (re)pooling @ 10%: After schema change', diff saved to https://phabricator.wikimedia.org/P62440 and previous config saved to /var/cache/conftool/dbconfig/20240516-055759-root.json [06:00:04] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240516T0600) [06:00:05] kormat, marostegui, Amir1, and arnaudb: Time to do the Primary database switchover deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240516T0600). [06:03:02] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on seaborgium:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:05:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es1020 to es4 primary master T364816', diff saved to https://phabricator.wikimedia.org/P62441 and previous config saved to /var/cache/conftool/dbconfig/20240516-060532-marostegui.json [06:05:42] T364816: Switchover es4 master (es1021 -> es1020) - https://phabricator.wikimedia.org/T364816 [06:10:02] (03PS1) 10Mabualruz: Correct behaviour of ConfigHelper, add tests [skins/Vector] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1032126 (https://phabricator.wikimedia.org/T365084) [06:10:15] (03PS1) 10Marostegui: Revert "db1173: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1032127 [06:10:43] (03CR) 10Marostegui: [C:03+2] Revert "db1173: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1032127 (owner: 10Marostegui) [06:13:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1173 (re)pooling @ 25%: After schema change', diff saved to https://phabricator.wikimedia.org/P62442 and previous config saved to /var/cache/conftool/dbconfig/20240516-061306-root.json [06:16:11] (03PS1) 10Marostegui: es5: Make hosts standalone [puppet] - 10https://gerrit.wikimedia.org/r/1032147 (https://phabricator.wikimedia.org/T364447) [06:16:12] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on 6 hosts with reason: Making es5 standalone T364447 [06:16:16] T364447: Make es4 and es5 RO - https://phabricator.wikimedia.org/T364447 [06:16:29] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Making es5 standalone T364447 [06:17:19] (03CR) 10Marostegui: [C:03+2] es5: Make hosts standalone [puppet] - 10https://gerrit.wikimedia.org/r/1032147 (https://phabricator.wikimedia.org/T364447) (owner: 10Marostegui) [06:18:13] !log Make es5 standalone and disconnect replication T364447 [06:18:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:20:13] (03PS1) 10Marostegui: es1025: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1032148 (https://phabricator.wikimedia.org/T364447) [06:20:44] (03CR) 10Marostegui: [C:03+2] es1025: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1032148 (https://phabricator.wikimedia.org/T364447) (owner: 10Marostegui) [06:23:38] (03Abandoned) 10Mabualruz: Correct behaviour of ConfigHelper, add tests [skins/Vector] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1032126 (https://phabricator.wikimedia.org/T365084) (owner: 10Mabualruz) [06:24:24] (03PS1) 10Mabualruz: Correct behaviour of ConfigHelper, add tests [skins/Vector] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1032128 (https://phabricator.wikimedia.org/T365084) [06:25:27] (03PS1) 10Marostegui: site.pp: Reorganize es5 definitions [puppet] - 10https://gerrit.wikimedia.org/r/1032150 (https://phabricator.wikimedia.org/T364447) [06:26:56] (03PS1) 10Marostegui: wmnet: Update es4-master [dns] - 10https://gerrit.wikimedia.org/r/1032151 (https://phabricator.wikimedia.org/T365094) [06:27:44] (03PS2) 10Marostegui: wmnet: Update es4-master [dns] - 10https://gerrit.wikimedia.org/r/1032151 (https://phabricator.wikimedia.org/T364816) [06:28:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1173 (re)pooling @ 50%: After schema change', diff saved to https://phabricator.wikimedia.org/P62443 and previous config saved to /var/cache/conftool/dbconfig/20240516-062812-root.json [06:28:47] (03CR) 10Marostegui: [C:03+2] wmnet: Update es4-master [dns] - 10https://gerrit.wikimedia.org/r/1032151 (https://phabricator.wikimedia.org/T364816) (owner: 10Marostegui) [06:29:01] (03CR) 10Marostegui: [C:03+2] "This is really a NOOP" [dns] - 10https://gerrit.wikimedia.org/r/1032151 (https://phabricator.wikimedia.org/T364816) (owner: 10Marostegui) [06:29:46] (03CR) 10Marostegui: [C:03+2] site.pp: Reorganize es5 definitions [puppet] - 10https://gerrit.wikimedia.org/r/1032150 (https://phabricator.wikimedia.org/T364447) (owner: 10Marostegui) [06:33:46] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on 6 hosts with reason: Making es5 standalone T364447 [06:33:47] (03CR) 10Mabualruz: [C:03+1] Correct behaviour of ConfigHelper, add tests [skins/Vector] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1032128 (https://phabricator.wikimedia.org/T365084) (owner: 10Mabualruz) [06:33:51] T364447: Make es4 and es5 RO - https://phabricator.wikimedia.org/T364447 [06:33:53] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Making es5 standalone T364447 [06:34:08] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on 6 hosts with reason: Making es4 standalone T364447 [06:34:14] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: Making es4 standalone T364447 [06:35:57] (03Restored) 10Mabualruz: Correct behaviour of ConfigHelper, add tests [skins/Vector] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1032126 (https://phabricator.wikimedia.org/T365084) (owner: 10Mabualruz) [06:36:06] (03CR) 10Mabualruz: [C:03+1] Correct behaviour of ConfigHelper, add tests [skins/Vector] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1032126 (https://phabricator.wikimedia.org/T365084) (owner: 10Mabualruz) [06:36:26] PROBLEM - MariaDB read only es4 on es2020 is CRITICAL: CRIT: read_only: True, expected False: OK: Version 10.6.17-MariaDB-log, Uptime 601515s, event_scheduler: True, 244.29 QPS, connection latency: 0.026578s, query latency: 0.000689s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [06:36:26] PROBLEM - MariaDB read only es4 on es2022 is CRITICAL: CRIT: read_only: True, expected False: OK: Version 10.6.17-MariaDB-log, Uptime 691291s, event_scheduler: True, 154.27 QPS, connection latency: 0.035276s, query latency: 0.000675s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [06:36:53] I am investigating that [06:36:58] It must be a puppet issue [06:43:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1173 (re)pooling @ 75%: After schema change', diff saved to https://phabricator.wikimedia.org/P62444 and previous config saved to /var/cache/conftool/dbconfig/20240516-064317-root.json [06:44:13] Morning o/ [06:57:21] (03PS1) 10Marostegui: mariadb.yaml: Add es4 and es5 [puppet] - 10https://gerrit.wikimedia.org/r/1032290 (https://phabricator.wikimedia.org/T364447) [06:58:04] FIRING: [2x] HelmReleaseBadStatus: Helm release datasets-config/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [06:58:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1173 (re)pooling @ 100%: After schema change', diff saved to https://phabricator.wikimedia.org/P62445 and previous config saved to /var/cache/conftool/dbconfig/20240516-065823-root.json [07:00:05] Amir1 and Urbanecm: gettimeofday() says it's time for UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240516T0700) [07:00:05] irc-mo_abualruz: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:00:44] morning I am here [07:03:02] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:05:08] (03CR) 10JMeybohm: [V:03+2 C:03+2] "Thanks!" [software/envoyproxy/ratelimiter] - 10https://gerrit.wikimedia.org/r/1029205 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [07:07:08] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mabualruz@deploy1002 using scap backport" [skins/Vector] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1032126 (https://phabricator.wikimedia.org/T365084) (owner: 10Mabualruz) [07:17:42] (03PS1) 10JMeybohm: ratelimit: Add CertProvider to hot reload TLS certs for gRPC service [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1032293 (https://phabricator.wikimedia.org/T362310) [07:19:00] (03CR) 10JMeybohm: [V:03+2 C:03+2] ratelimit: Add CertProvider to hot reload TLS certs for gRPC service [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1032293 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [07:19:40] FIRING: [2x] KubernetesAPINotScrapable: k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [07:23:46] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db1173.eqiad.wmnet with reason: Maintenance [07:23:48] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1173.eqiad.wmnet with reason: Maintenance [07:23:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1173 (T360332)', diff saved to https://phabricator.wikimedia.org/P62446 and previous config saved to /var/cache/conftool/dbconfig/20240516-072355-arnaudb.json [07:23:59] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [07:25:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T364299)', diff saved to https://phabricator.wikimedia.org/P62447 and previous config saved to /var/cache/conftool/dbconfig/20240516-072521-marostegui.json [07:25:27] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [07:25:43] (03Merged) 10jenkins-bot: Correct behaviour of ConfigHelper, add tests [skins/Vector] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1032126 (https://phabricator.wikimedia.org/T365084) (owner: 10Mabualruz) [07:26:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T360332)', diff saved to https://phabricator.wikimedia.org/P62448 and previous config saved to /var/cache/conftool/dbconfig/20240516-072614-arnaudb.json [07:26:39] (03PS3) 10Slyngshede: P:ganeti Prometheus monitoring of ganeti noded services. [puppet] - 10https://gerrit.wikimedia.org/r/1031834 (https://phabricator.wikimedia.org/T350694) [07:26:40] !log mabualruz@deploy1002 Started scap: Backport for [[gerrit:1032126|Correct behaviour of ConfigHelper, add tests (T365084)]] [07:26:43] T365084: Night mode exclude list doesn't appear to be working with various pages (including Special:AbuseLog or diff pages) - https://phabricator.wikimedia.org/T365084 [07:28:13] (03CR) 10Marostegui: [C:03+2] mariadb.yaml: Add es4 and es5 [puppet] - 10https://gerrit.wikimedia.org/r/1032290 (https://phabricator.wikimedia.org/T364447) (owner: 10Marostegui) [07:29:59] !log mabualruz@deploy1002 mabualruz: Backport for [[gerrit:1032126|Correct behaviour of ConfigHelper, add tests (T365084)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:30:02] !log mabualruz@deploy1002 mabualruz: Continuing with sync [07:30:13] (03CR) 10Slyngshede: P:ganeti Prometheus monitoring of ganeti noded services. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1031834 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [07:35:48] RECOVERY - MariaDB read only es4 on es2020 is OK: Version 10.6.17-MariaDB-log, Uptime 605077s, read_only: True, event_scheduler: True, 164.65 QPS, connection latency: 0.029290s, query latency: 0.000642s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [07:36:50] RECOVERY - MariaDB read only es4 on es2022 is OK: Version 10.6.17-MariaDB-log, Uptime 694914s, read_only: True, event_scheduler: True, 162.58 QPS, connection latency: 0.028151s, query latency: 0.000638s https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Master_comes_back_in_read_only [07:37:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote es1025 to es5 primary master T365094', diff saved to https://phabricator.wikimedia.org/P62449 and previous config saved to /var/cache/conftool/dbconfig/20240516-073719-marostegui.json [07:37:25] T365094: Switchover es5 master (es1024 -> es1025) - https://phabricator.wikimedia.org/T365094 [07:37:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Increase es1024 weight', diff saved to https://phabricator.wikimedia.org/P62450 and previous config saved to /var/cache/conftool/dbconfig/20240516-073750-marostegui.json [07:40:13] (03PS1) 10Marostegui: wmnet: Update es5-master [dns] - 10https://gerrit.wikimedia.org/r/1032384 (https://phabricator.wikimedia.org/T365094) [07:40:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P62451 and previous config saved to /var/cache/conftool/dbconfig/20240516-074030-marostegui.json [07:41:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P62452 and previous config saved to /var/cache/conftool/dbconfig/20240516-074121-arnaudb.json [07:41:51] (03CR) 10Marostegui: [C:03+2] wmnet: Update es5-master [dns] - 10https://gerrit.wikimedia.org/r/1032384 (https://phabricator.wikimedia.org/T365094) (owner: 10Marostegui) [07:44:11] !log mabualruz@deploy1002 Finished scap: Backport for [[gerrit:1032126|Correct behaviour of ConfigHelper, add tests (T365084)]] (duration: 17m 31s) [07:44:15] T365084: Night mode exclude list doesn't appear to be working with various pages (including Special:AbuseLog or diff pages) - https://phabricator.wikimedia.org/T365084 [07:44:24] (03CR) 10TrainBranchBot: [C:03+2] "Approved by mabualruz@deploy1002 using scap backport" [skins/Vector] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1032128 (https://phabricator.wikimedia.org/T365084) (owner: 10Mabualruz) [07:46:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Increase es1024 weight', diff saved to https://phabricator.wikimedia.org/P62453 and previous config saved to /var/cache/conftool/dbconfig/20240516-074625-marostegui.json [07:48:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1021 T364289', diff saved to https://phabricator.wikimedia.org/P62454 and previous config saved to /var/cache/conftool/dbconfig/20240516-074837-root.json [07:48:41] T364289: Reimage external store hosts with Bookworm - https://phabricator.wikimedia.org/T364289 [07:48:56] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 26 hosts with reason: Primary switchover s2 T364814 [07:49:00] T364814: Switchover s2 master (db2204 -> db2207) - https://phabricator.wikimedia.org/T364814 [07:49:18] (03PS1) 10Marostegui: es1021: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1032385 (https://phabricator.wikimedia.org/T364289) [07:49:19] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 26 hosts with reason: Primary switchover s2 T364814 [07:49:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set db2207 with weight 0 T364814', diff saved to https://phabricator.wikimedia.org/P62455 and previous config saved to /var/cache/conftool/dbconfig/20240516-074927-arnaudb.json [07:49:43] (03CR) 10Marostegui: [C:03+2] es1021: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1032385 (https://phabricator.wikimedia.org/T364289) (owner: 10Marostegui) [07:50:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Remove db2207 from API/vslow/dump T364814', diff saved to https://phabricator.wikimedia.org/P62456 and previous config saved to /var/cache/conftool/dbconfig/20240516-075024-arnaudb.json [07:51:03] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es1021.eqiad.wmnet with OS bookworm [07:55:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P62457 and previous config saved to /var/cache/conftool/dbconfig/20240516-075537-marostegui.json [07:56:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P62458 and previous config saved to /var/cache/conftool/dbconfig/20240516-075628-arnaudb.json [07:58:47] (03PS1) 10Marostegui: Revert "es1021: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1032129 [08:00:04] hashar and andre: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240516T0800) [08:00:12] o/ [08:03:09] o/ [08:04:18] (03PS1) 10Ayounsi: Add export-format state-data json compact [homer/public] - 10https://gerrit.wikimedia.org/r/1032386 (https://phabricator.wikimedia.org/T362523) [08:05:29] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es1021.eqiad.wmnet with reason: host reimage [08:06:53] (03PS1) 10TrainBranchBot: group2 wikis to 1.43.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032387 (https://phabricator.wikimedia.org/T361399) [08:06:55] (03CR) 10TrainBranchBot: [C:03+2] group2 wikis to 1.43.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032387 (https://phabricator.wikimedia.org/T361399) (owner: 10TrainBranchBot) [08:07:44] jouncebot: now [08:07:44] For the next 1 hour(s) and 52 minute(s): MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240516T0800) [08:07:48] (03Merged) 10jenkins-bot: Correct behaviour of ConfigHelper, add tests [skins/Vector] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1032128 (https://phabricator.wikimedia.org/T365084) (owner: 10Mabualruz) [08:07:51] (03Merged) 10jenkins-bot: group2 wikis to 1.43.0-wmf.5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032387 (https://phabricator.wikimedia.org/T361399) (owner: 10TrainBranchBot) [08:08:18] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1021.eqiad.wmnet with reason: host reimage [08:09:01] (03CR) 10Brouberol: "Eevans: a good starting point is https://wikitech.wikimedia.org/wiki/Kubernetes/Deployment_Charts#Enabling_egress_to_services_external_to_" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030175 (owner: 10Eevans) [08:09:43] mo_abualruz: sorry it looks like nobody was around for the backport window :/ [08:10:01] andre and I are currently promoting all wikis to 1.43.0-wmf.5 [08:10:10] and I guess once it is done I will do your patch [08:10:39] I self deployed one of them the other is now blocked by the train I can do it in the other window it is only for the remaining 1.43.0-wmf.4 deployments [08:10:45] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T364299)', diff saved to https://phabricator.wikimedia.org/P62460 and previous config saved to /var/cache/conftool/dbconfig/20240516-081044-marostegui.json [08:10:46] ohhh [08:10:47] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance [08:10:47] great :) [08:10:48] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [08:11:00] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance [08:11:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2165 (T364299)', diff saved to https://phabricator.wikimedia.org/P62461 and previous config saved to /var/cache/conftool/dbconfig/20240516-081107-marostegui.json [08:11:08] (03CR) 10Brouberol: [C:03+1] "Nice, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031593 (https://phabricator.wikimedia.org/T287491) (owner: 10Btullis) [08:11:13] mo_abualruz: well as soon as the train has completed there will no more be any 1.43.0-wmf.4 wikis left :) [08:11:36] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T360332)', diff saved to https://phabricator.wikimedia.org/P62462 and previous config saved to /var/cache/conftool/dbconfig/20240516-081136-arnaudb.json [08:11:40] T360332: Make the cupe_actor column nullable on WMF wikis - https://phabricator.wikimedia.org/T360332 [08:11:46] great then I can stop this one and the work will be done here [08:12:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2126 depool', diff saved to https://phabricator.wikimedia.org/P62463 and previous config saved to /var/cache/conftool/dbconfig/20240516-081207-arnaudb.json [08:12:37] and somehow scap did not `!log` anything here which is confusing [08:13:26] (03PS2) 10Majavah: P:openstack: neutron: add required control plane config for OVS [puppet] - 10https://gerrit.wikimedia.org/r/1031880 (https://phabricator.wikimedia.org/T326373) [08:13:26] (03PS3) 10Majavah: site: Move cloudnet2006-dev to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1029498 (https://phabricator.wikimedia.org/T358761) [08:13:26] (03PS1) 10Majavah: P:openstack: neutron: add ovs config to eqiad1 profiles [puppet] - 10https://gerrit.wikimedia.org/r/1032388 (https://phabricator.wikimedia.org/T326373) [08:13:29] (03PS1) 10Majavah: O:wmcs::openstack: add eqiad1 net_ovs role [puppet] - 10https://gerrit.wikimedia.org/r/1032389 (https://phabricator.wikimedia.org/T364459) [08:13:33] (03PS1) 10Majavah: site: Move cloudnet1005 to insetup_noferm to prep for OVS [puppet] - 10https://gerrit.wikimedia.org/r/1032390 (https://phabricator.wikimedia.org/T364459) [08:13:36] (03PS1) 10Majavah: site: Move cloudnet1005 to insetup_noferm to OVS agent [puppet] - 10https://gerrit.wikimedia.org/r/1032391 (https://phabricator.wikimedia.org/T364459) [08:14:16] (03CR) 10CI reject: [V:04-1] O:wmcs::openstack: add eqiad1 net_ovs role [puppet] - 10https://gerrit.wikimedia.org/r/1032389 (https://phabricator.wikimedia.org/T364459) (owner: 10Majavah) [08:15:39] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2466/co" [puppet] - 10https://gerrit.wikimedia.org/r/1032388 (https://phabricator.wikimedia.org/T326373) (owner: 10Majavah) [08:18:09] (03CR) 10Filippo Giunchedi: postfix: prometheus ops config for mx-out boxes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1019116 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway) [08:18:34] (03PS2) 10Majavah: P:openstack: neutron: add ovs config to eqiad1 profiles [puppet] - 10https://gerrit.wikimedia.org/r/1032388 (https://phabricator.wikimedia.org/T326373) [08:18:34] (03PS2) 10Majavah: O:wmcs::openstack: add eqiad1 net_ovs role [puppet] - 10https://gerrit.wikimedia.org/r/1032389 (https://phabricator.wikimedia.org/T364459) [08:18:34] (03PS2) 10Majavah: site: Move cloudnet1005 to insetup_noferm to prep for OVS [puppet] - 10https://gerrit.wikimedia.org/r/1032390 (https://phabricator.wikimedia.org/T364459) [08:18:34] (03PS2) 10Majavah: site: Move cloudnet1005 to insetup_noferm to OVS agent [puppet] - 10https://gerrit.wikimedia.org/r/1032391 (https://phabricator.wikimedia.org/T364459) [08:18:35] (03PS4) 10Majavah: site: Move cloudnet2006-dev to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1029498 (https://phabricator.wikimedia.org/T358761) [08:18:37] (03PS1) 10Majavah: openstack: neutron: ovs_agent: Restart on config file change [puppet] - 10https://gerrit.wikimedia.org/r/1032392 (https://phabricator.wikimedia.org/T326373) [08:19:36] (03CR) 10CI reject: [V:04-1] openstack: neutron: ovs_agent: Restart on config file change [puppet] - 10https://gerrit.wikimedia.org/r/1032392 (https://phabricator.wikimedia.org/T326373) (owner: 10Majavah) [08:21:16] (03CR) 10Filippo Giunchedi: "Yes the zk package doesn't seem overly maintained, which might explain the drift/rot wrt upstream and logging configurations." [puppet] - 10https://gerrit.wikimedia.org/r/1031465 (owner: 10Filippo Giunchedi) [08:21:31] PROBLEM - carbon-frontend-relay metric drops on graphite1005 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [100.0] https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting https://grafana.wikimedia.org/d/000000020/graphite-eqiad?orgId=1&viewPanel=21 https://grafana.wikimedia.org/d/000000337/graphite-codfw?orgId=1&viewPanel=21 [08:22:31] RECOVERY - carbon-frontend-relay metric drops on graphite1005 is OK: OK: Less than 80.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting https://grafana.wikimedia.org/d/000000020/graphite-eqiad?orgId=1&viewPanel=21 https://grafana.wikimedia.org/d/000000337/graphite-codfw?orgId=1&viewPanel=21 [08:23:10] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: group2 wikis to 1.43.0-wmf.5 refs T361399 [08:23:16] T361399: 1.43.0-wmf.5 deployment blockers - https://phabricator.wikimedia.org/T361399 [08:24:16] (03CR) 10Effie Mouzeli: [C:03+2] rdf-streaming-updater: Remove duplicate definition of k8s and zk [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031810 (https://phabricator.wikimedia.org/T287491) (owner: 10JMeybohm) [08:24:52] (03PS2) 10Majavah: openstack: neutron: ovs_agent: Restart on config file change [puppet] - 10https://gerrit.wikimedia.org/r/1032392 (https://phabricator.wikimedia.org/T326373) [08:24:52] (03PS3) 10Majavah: P:openstack: neutron: add ovs config to eqiad1 profiles [puppet] - 10https://gerrit.wikimedia.org/r/1032388 (https://phabricator.wikimedia.org/T326373) [08:24:52] (03PS3) 10Majavah: O:wmcs::openstack: add eqiad1 net_ovs role [puppet] - 10https://gerrit.wikimedia.org/r/1032389 (https://phabricator.wikimedia.org/T364459) [08:24:52] (03PS3) 10Majavah: site: Move cloudnet1005 to insetup_noferm to prep for OVS [puppet] - 10https://gerrit.wikimedia.org/r/1032390 (https://phabricator.wikimedia.org/T364459) [08:24:53] (03PS3) 10Majavah: site: Move cloudnet1005 to insetup_noferm to OVS agent [puppet] - 10https://gerrit.wikimedia.org/r/1032391 (https://phabricator.wikimedia.org/T364459) [08:24:56] (03PS5) 10Majavah: site: Move cloudnet2006-dev to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1029498 (https://phabricator.wikimedia.org/T358761) [08:25:10] (03CR) 10Effie Mouzeli: [C:03+1] Remove kubernetesMasters definition from dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031593 (https://phabricator.wikimedia.org/T287491) (owner: 10Btullis) [08:25:14] (03Merged) 10jenkins-bot: rdf-streaming-updater: Remove duplicate definition of k8s and zk [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031810 (https://phabricator.wikimedia.org/T287491) (owner: 10JMeybohm) [08:30:04] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2467/co" [puppet] - 10https://gerrit.wikimedia.org/r/1032388 (https://phabricator.wikimedia.org/T326373) (owner: 10Majavah) [08:30:49] (03PS4) 10Majavah: site: Move cloudnet1005 to OVS agent [puppet] - 10https://gerrit.wikimedia.org/r/1032391 (https://phabricator.wikimedia.org/T364459) [08:30:50] (03PS6) 10Majavah: site: Move cloudnet2006-dev to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1029498 (https://phabricator.wikimedia.org/T358761) [08:31:02] (03CR) 10Btullis: [C:03+2] Remove kubernetesMasters definition from dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031593 (https://phabricator.wikimedia.org/T287491) (owner: 10Btullis) [08:32:23] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2468/co" [puppet] - 10https://gerrit.wikimedia.org/r/1032391 (https://phabricator.wikimedia.org/T364459) (owner: 10Majavah) [08:32:40] (03PS2) 10Stevemunene: Move datahub and datahub-staging helfile deployments to dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028761 (https://phabricator.wikimedia.org/T363300) [08:33:09] !log jiji@deploy1002 helmfile [staging] START helmfile.d/services/rdf-streaming-updater: apply [08:33:16] !log jiji@deploy1002 helmfile [staging] DONE helmfile.d/services/rdf-streaming-updater: apply [08:33:26] !log jiji@deploy1002 helmfile [eqiad] START helmfile.d/services/rdf-streaming-updater: apply [08:33:32] !log jiji@deploy1002 helmfile [eqiad] DONE helmfile.d/services/rdf-streaming-updater: apply [08:33:35] (03PS5) 10Filippo Giunchedi: zookeeper: fix logging on Bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1031465 [08:33:45] !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/rdf-streaming-updater: apply [08:33:53] !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/rdf-streaming-updater: apply [08:39:17] (03PS1) 10Stevemunene: Enable ingress for the datahub server services via the dse ingress gw [dns] - 10https://gerrit.wikimedia.org/r/1032393 (https://phabricator.wikimedia.org/T363299) [08:39:18] (03CR) 10Arnaudb: [C:03+2] mariadb: Promote db2207 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1030561 (https://phabricator.wikimedia.org/T364814) (owner: 10Gerrit maintenance bot) [08:41:06] (03PS1) 10JMeybohm: prometheus/ops: Refactor etcd scraping [puppet] - 10https://gerrit.wikimedia.org/r/1032394 (https://phabricator.wikimedia.org/T363307) [08:41:23] (03CR) 10Marostegui: [C:03+2] Revert "es1021: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1032129 (owner: 10Marostegui) [08:41:27] (03CR) 10CI reject: [V:04-1] prometheus/ops: Refactor etcd scraping [puppet] - 10https://gerrit.wikimedia.org/r/1032394 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [08:41:28] (03PS2) 10Stevemunene: Enable ingress for the datahub server services via the dse ingress gw [dns] - 10https://gerrit.wikimedia.org/r/1032393 (https://phabricator.wikimedia.org/T363299) [08:41:37] !log Starting s2 codfw failover from db2204 to db2207 - T364814 [08:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:41:46] T364814: Switchover s2 master (db2204 -> db2207) - https://phabricator.wikimedia.org/T364814 [08:42:03] (03PS2) 10JMeybohm: prometheus/ops: Refactor etcd scraping [puppet] - 10https://gerrit.wikimedia.org/r/1032394 (https://phabricator.wikimedia.org/T363307) [08:42:03] (03PS3) 10Stevemunene: Enable ingress for the datahub server services via the dse ingress gw [dns] - 10https://gerrit.wikimedia.org/r/1032393 (https://phabricator.wikimedia.org/T363299) [08:42:32] (03CR) 10Effie Mouzeli: [C:03+2] Remove kubernetesMasters definition from all wikikube values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031811 (https://phabricator.wikimedia.org/T287491) (owner: 10JMeybohm) [08:42:44] (03Abandoned) 10Filippo Giunchedi: profile: fix kafka::broker typo [puppet] - 10https://gerrit.wikimedia.org/r/1031463 (owner: 10Filippo Giunchedi) [08:42:58] (03CR) 10CI reject: [V:04-1] Enable ingress for the datahub server services via the dse ingress gw [dns] - 10https://gerrit.wikimedia.org/r/1032393 (https://phabricator.wikimedia.org/T363299) (owner: 10Stevemunene) [08:43:36] (03CR) 10Btullis: [C:03+1] "Thanks Filippo. This looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/1031465 (owner: 10Filippo Giunchedi) [08:44:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2207 to s2 primary T364814', diff saved to https://phabricator.wikimedia.org/P62465 and previous config saved to /var/cache/conftool/dbconfig/20240516-084420-root.json [08:44:35] PROBLEM - carbon-frontend-relay metric drops on graphite1005 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [100.0] https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting https://grafana.wikimedia.org/d/000000020/graphite-eqiad?orgId=1&viewPanel=21 https://grafana.wikimedia.org/d/000000337/graphite-codfw?orgId=1&viewPanel=21 [08:45:16] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2469/co" [puppet] - 10https://gerrit.wikimedia.org/r/1032394 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [08:45:33] (03Merged) 10jenkins-bot: Remove kubernetesMasters definition from all wikikube values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031811 (https://phabricator.wikimedia.org/T287491) (owner: 10JMeybohm) [08:47:27] (03CR) 10Stevemunene: Move datahub and datahub-staging helfile deployments to dse-k8s (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028761 (https://phabricator.wikimedia.org/T363300) (owner: 10Stevemunene) [08:50:49] (03PS3) 10JMeybohm: prometheus/ops: Refactor etcd scraping [puppet] - 10https://gerrit.wikimedia.org/r/1032394 (https://phabricator.wikimedia.org/T363307) [08:51:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set db2204 with weight 500 T364814', diff saved to https://phabricator.wikimedia.org/P62466 and previous config saved to /var/cache/conftool/dbconfig/20240516-085123-arnaudb.json [08:51:27] T364814: Switchover s2 master (db2204 -> db2207) - https://phabricator.wikimedia.org/T364814 [08:52:10] (03CR) 10JMeybohm: "@ltoscano@wikimedia.org just FYI" [puppet] - 10https://gerrit.wikimedia.org/r/1032394 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [08:52:27] (03PS4) 10Stevemunene: Enable ingress for the datahub server services via the dse ingress gw [dns] - 10https://gerrit.wikimedia.org/r/1032393 (https://phabricator.wikimedia.org/T363299) [08:53:35] RECOVERY - carbon-frontend-relay metric drops on graphite1005 is OK: OK: Less than 80.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting https://grafana.wikimedia.org/d/000000020/graphite-eqiad?orgId=1&viewPanel=21 https://grafana.wikimedia.org/d/000000337/graphite-codfw?orgId=1&viewPanel=21 [08:53:45] (03CR) 10JMeybohm: "This will enable scraping of etcd nodes for aux, dse and ml-staging cluster which have not been scraped at all as of now." [puppet] - 10https://gerrit.wikimedia.org/r/1032394 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [08:54:10] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2470/co" [puppet] - 10https://gerrit.wikimedia.org/r/1032394 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [08:54:33] I'm taking a look at the graphite alerts [08:58:31] !log Starting MediaModeration scanning script on `medium.dblist` - https://wikitech.wikimedia.org/wiki/MediaModeration [08:58:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:09] !log Scanning `enwiki` with MediaModeration script - https://wikitech.wikimedia.org/wiki/MediaModeration [08:59:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:07] mmhh looks like we're creating a statsd metric per page, obviously that's not going to work [09:01:14] MediaWiki.rest_api_latency [09:02:15] investigating further [09:02:35] PROBLEM - carbon-frontend-relay metric drops on graphite1005 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [100.0] https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting https://grafana.wikimedia.org/d/000000020/graphite-eqiad?orgId=1&viewPanel=21 https://grafana.wikimedia.org/d/000000337/graphite-codfw?orgId=1&viewPanel=21 [09:03:21] !log Stopping MediaModeration scanning script on `enwiki` [09:03:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:32] !log Stopping MediaModeration scanning script on `medium.dblist` [09:03:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:35] RECOVERY - carbon-frontend-relay metric drops on graphite1005 is OK: OK: Less than 80.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting https://grafana.wikimedia.org/d/000000020/graphite-eqiad?orgId=1&viewPanel=21 https://grafana.wikimedia.org/d/000000337/graphite-codfw?orgId=1&viewPanel=21 [09:04:26] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1031880 (https://phabricator.wikimedia.org/T326373) (owner: 10Majavah) [09:05:22] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1032392 (https://phabricator.wikimedia.org/T326373) (owner: 10Majavah) [09:05:29] (03PS1) 10DDesouza: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032397 (https://phabricator.wikimedia.org/T349774) [09:06:11] FIRING: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:06:34] (03CR) 10Arturo Borrero Gonzalez: [C:04-1] P:openstack: neutron: add ovs config to eqiad1 profiles (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1032388 (https://phabricator.wikimedia.org/T326373) (owner: 10Majavah) [09:07:33] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Group test removal', diff saved to https://phabricator.wikimedia.org/P62468 and previous config saved to /var/cache/conftool/dbconfig/20240516-090732-arnaudb.json [09:07:54] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Group test readd', diff saved to https://phabricator.wikimedia.org/P62469 and previous config saved to /var/cache/conftool/dbconfig/20240516-090753-arnaudb.json [09:08:21] (03CR) 10Arturo Borrero Gonzalez: O:wmcs::openstack: add eqiad1 net_ovs role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1032389 (https://phabricator.wikimedia.org/T364459) (owner: 10Majavah) [09:09:18] (03CR) 10DDesouza: [C:03+2] miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032397 (https://phabricator.wikimedia.org/T349774) (owner: 10DDesouza) [09:09:23] (03CR) 10Arturo Borrero Gonzalez: "LGTM. But please don't merge until the operation window." [puppet] - 10https://gerrit.wikimedia.org/r/1032390 (https://phabricator.wikimedia.org/T364459) (owner: 10Majavah) [09:09:52] (03CR) 10Arturo Borrero Gonzalez: "LGTM. Please only merge during the operation window." [puppet] - 10https://gerrit.wikimedia.org/r/1032391 (https://phabricator.wikimedia.org/T364459) (owner: 10Majavah) [09:10:13] (03Merged) 10jenkins-bot: miscweb(wikiworkshop): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032397 (https://phabricator.wikimedia.org/T349774) (owner: 10DDesouza) [09:11:11] RESOLVED: ProbeDown: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_collab_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:11:57] (03PS1) 10Mabualruz: Add Watchlist to exclude list from dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032398 (https://phabricator.wikimedia.org/T365084) [09:11:57] (03CR) 10Arturo Borrero Gonzalez: site: Move cloudnet2006-dev to OVS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1029498 (https://phabricator.wikimedia.org/T358761) (owner: 10Majavah) [09:14:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote db2204 to vslow/dump T364814', diff saved to https://phabricator.wikimedia.org/P62470 and previous config saved to /var/cache/conftool/dbconfig/20240516-091400-arnaudb.json [09:14:04] T364814: Switchover s2 master (db2204 -> db2207) - https://phabricator.wikimedia.org/T364814 [09:15:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'vslow/dump T364814 fix', diff saved to https://phabricator.wikimedia.org/P62471 and previous config saved to /var/cache/conftool/dbconfig/20240516-091515-arnaudb.json [09:15:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1021 (re)pooling @ 10%: After reimage', diff saved to https://phabricator.wikimedia.org/P62472 and previous config saved to /var/cache/conftool/dbconfig/20240516-091522-root.json [09:16:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'vslow/dump T364814 fix', diff saved to https://phabricator.wikimedia.org/P62473 and previous config saved to /var/cache/conftool/dbconfig/20240516-091613-arnaudb.json [09:17:01] (03CR) 10Volans: "This cookbook is becoming quite large, it will really benefit from being migrated to the class-based API [1] instead of the old and legacy" [cookbooks] - 10https://gerrit.wikimedia.org/r/1031933 (https://phabricator.wikimedia.org/T349069) (owner: 10DCausse) [09:17:25] !log dani@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [09:17:28] (03PS1) 10Stevemunene: Change datahub service to use dse ingress [puppet] - 10https://gerrit.wikimedia.org/r/1032399 (https://phabricator.wikimedia.org/T363450) [09:17:47] !log dani@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [09:17:49] !log dani@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [09:18:24] !log dani@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [09:18:25] !log dani@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [09:18:51] !log dani@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [09:19:47] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:20:27] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:21:14] (03PS1) 10Vgutierrez: prometheus::lvs: Fetch MSS using getsockopt() [puppet] - 10https://gerrit.wikimedia.org/r/1032400 (https://phabricator.wikimedia.org/T365101) [09:21:32] (03PS1) 10Filippo Giunchedi: graphite: blackhole MediaWiki.rest_api_latency [puppet] - 10https://gerrit.wikimedia.org/r/1032401 (https://phabricator.wikimedia.org/T365111) [09:21:46] (03CR) 10CI reject: [V:04-1] prometheus::lvs: Fetch MSS using getsockopt() [puppet] - 10https://gerrit.wikimedia.org/r/1032400 (https://phabricator.wikimedia.org/T365101) (owner: 10Vgutierrez) [09:22:19] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51923 bytes in 0.070 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:22:39] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.507 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:22:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2126 (re)pooling @ 25%: post fix repool', diff saved to https://phabricator.wikimedia.org/P62474 and previous config saved to /var/cache/conftool/dbconfig/20240516-092257-arnaudb.json [09:25:18] (03PS2) 10Filippo Giunchedi: graphite: blackhole MediaWiki.rest_api metrics [puppet] - 10https://gerrit.wikimedia.org/r/1032401 (https://phabricator.wikimedia.org/T365111) [09:28:20] (03CR) 10Filippo Giunchedi: [C:03+2] graphite: blackhole MediaWiki.rest_api metrics [puppet] - 10https://gerrit.wikimedia.org/r/1032401 (https://phabricator.wikimedia.org/T365111) (owner: 10Filippo Giunchedi) [09:28:21] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [09:28:24] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] graphite: blackhole MediaWiki.rest_api metrics [puppet] - 10https://gerrit.wikimedia.org/r/1032401 (https://phabricator.wikimedia.org/T365111) (owner: 10Filippo Giunchedi) [09:28:25] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [09:30:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1021 (re)pooling @ 25%: After reimage', diff saved to https://phabricator.wikimedia.org/P62475 and previous config saved to /var/cache/conftool/dbconfig/20240516-093028-root.json [09:33:02] FIRING: [2x] SystemdUnitFailed: docker-reporter-base-images.service on build2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:33:49] (03PS1) 10Marostegui: es1021: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1032403 (https://phabricator.wikimedia.org/T364289) [09:34:18] (03CR) 10Marostegui: [C:03+2] es1021: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1032403 (https://phabricator.wikimedia.org/T364289) (owner: 10Marostegui) [09:37:56] (03PS3) 10Majavah: P:openstack: neutron: add required control plane config for OVS [puppet] - 10https://gerrit.wikimedia.org/r/1031880 (https://phabricator.wikimedia.org/T326373) [09:37:56] (03PS3) 10Majavah: openstack: neutron: ovs_agent: Restart on config file change [puppet] - 10https://gerrit.wikimedia.org/r/1032392 (https://phabricator.wikimedia.org/T326373) [09:37:56] (03PS4) 10Majavah: P:openstack: neutron: add ovs config to eqiad1 profiles [puppet] - 10https://gerrit.wikimedia.org/r/1032388 (https://phabricator.wikimedia.org/T326373) [09:37:56] (03PS4) 10Majavah: O:wmcs::openstack: add eqiad1 net_ovs role [puppet] - 10https://gerrit.wikimedia.org/r/1032389 (https://phabricator.wikimedia.org/T364459) [09:37:57] (03PS4) 10Majavah: site: Move cloudnet1005 to insetup_noferm to prep for OVS [puppet] - 10https://gerrit.wikimedia.org/r/1032390 (https://phabricator.wikimedia.org/T364459) [09:37:58] (03PS5) 10Majavah: site: Move cloudnet1005 to OVS agent [puppet] - 10https://gerrit.wikimedia.org/r/1032391 (https://phabricator.wikimedia.org/T364459) [09:38:02] (03PS7) 10Majavah: site: Move cloudnet2006-dev to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1029498 (https://phabricator.wikimedia.org/T358761) [09:38:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2126 (re)pooling @ 50%: post fix repool', diff saved to https://phabricator.wikimedia.org/P62476 and previous config saved to /var/cache/conftool/dbconfig/20240516-093803-arnaudb.json [09:38:06] (03PS1) 10Majavah: hieradata: codfw1dev: use facter to pick base interface [puppet] - 10https://gerrit.wikimedia.org/r/1032404 [09:38:10] (03PS1) 10Majavah: hieradata: stop overriding l3_agent_bridges for OVS [puppet] - 10https://gerrit.wikimedia.org/r/1032405 (https://phabricator.wikimedia.org/T358761) [09:39:16] (03CR) 10Marostegui: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1031033 (https://phabricator.wikimedia.org/T362786) (owner: 10Scott French) [09:40:06] (03CR) 10Marostegui: [C:03+2] conftool-data: bootstrap parser-cache sections and instances [puppet] - 10https://gerrit.wikimedia.org/r/1031033 (https://phabricator.wikimedia.org/T362786) (owner: 10Scott French) [09:40:40] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2471/console" [puppet] - 10https://gerrit.wikimedia.org/r/1032404 (owner: 10Majavah) [09:41:03] (03CR) 10Volans: [C:03+1] "Code looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1032400 (https://phabricator.wikimedia.org/T365101) (owner: 10Vgutierrez) [09:41:15] (03CR) 10Majavah: [V:03+1 C:03+2] hieradata: codfw1dev: use facter to pick base interface [puppet] - 10https://gerrit.wikimedia.org/r/1032404 (owner: 10Majavah) [09:43:14] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2472/console" [puppet] - 10https://gerrit.wikimedia.org/r/1032405 (https://phabricator.wikimedia.org/T358761) (owner: 10Majavah) [09:43:36] (03CR) 10Majavah: [V:03+1 C:03+2] hieradata: stop overriding l3_agent_bridges for OVS [puppet] - 10https://gerrit.wikimedia.org/r/1032405 (https://phabricator.wikimedia.org/T358761) (owner: 10Majavah) [09:44:01] !log clean up MediaWiki.rest_api_latency and MediaWiki.rest_api_errors - T365111 [09:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:05] T365111: Per-page graphite metrics created for MediaWiki.rest_api_latency / rest_api_errors - https://phabricator.wikimedia.org/T365111 [09:45:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1021 (re)pooling @ 50%: After reimage', diff saved to https://phabricator.wikimedia.org/P62478 and previous config saved to /var/cache/conftool/dbconfig/20240516-094534-root.json [09:45:50] (03CR) 10Majavah: P:openstack: neutron: add ovs config to eqiad1 profiles (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1032388 (https://phabricator.wikimedia.org/T326373) (owner: 10Majavah) [09:46:22] (03CR) 10Majavah: O:wmcs::openstack: add eqiad1 net_ovs role (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1032389 (https://phabricator.wikimedia.org/T364459) (owner: 10Majavah) [09:46:39] (03PS1) 10Klausman: team-dcops/mgmt: Change runbook link to one with BMC info [alerts] - 10https://gerrit.wikimedia.org/r/1032406 [09:46:40] (03CR) 10Filippo Giunchedi: [C:03+1] "Untested but LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1032394 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [09:47:07] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance [09:47:10] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance [09:47:13] (03CR) 10Majavah: [C:04-2] "DNM until just before announced maintenance window" [puppet] - 10https://gerrit.wikimedia.org/r/1032390 (https://phabricator.wikimedia.org/T364459) (owner: 10Majavah) [09:47:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1173 (T352010)', diff saved to https://phabricator.wikimedia.org/P62479 and previous config saved to /var/cache/conftool/dbconfig/20240516-094717-ladsgroup.json [09:47:22] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [09:47:29] (03CR) 10Majavah: [C:04-2] "DNM until scheduled maintenance window" [puppet] - 10https://gerrit.wikimedia.org/r/1032391 (https://phabricator.wikimedia.org/T364459) (owner: 10Majavah) [09:48:00] (03CR) 10Majavah: site: Move cloudnet2006-dev to OVS (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1029498 (https://phabricator.wikimedia.org/T358761) (owner: 10Majavah) [09:48:08] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2473/co" [puppet] - 10https://gerrit.wikimedia.org/r/1032388 (https://phabricator.wikimedia.org/T326373) (owner: 10Majavah) [09:48:55] (03PS2) 10Vgutierrez: prometheus::lvs: Fetch MSS using getsockopt() [puppet] - 10https://gerrit.wikimedia.org/r/1032400 (https://phabricator.wikimedia.org/T365101) [09:48:56] (03CR) 10Majavah: [V:03+1 C:04-2] "PCC SUCCESS (CORE_DIFF 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2474/co" [puppet] - 10https://gerrit.wikimedia.org/r/1032391 (https://phabricator.wikimedia.org/T364459) (owner: 10Majavah) [09:50:06] (03CR) 10Vgutierrez: "thx for the review volans" [puppet] - 10https://gerrit.wikimedia.org/r/1032400 (https://phabricator.wikimedia.org/T365101) (owner: 10Vgutierrez) [09:51:37] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for sg912 - https://phabricator.wikimedia.org/T365118 (10SGupta-WMF) 03NEW [09:52:10] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for sg912 - https://phabricator.wikimedia.org/T365118#9803732 (10SGupta-WMF) a:03Eevans [09:52:37] (03PS1) 10Marostegui: dbconfig.schema: Add pc [puppet] - 10https://gerrit.wikimedia.org/r/1032407 (https://phabricator.wikimedia.org/T362786) [09:52:47] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for sg912 - https://phabricator.wikimedia.org/T365118#9803734 (10SGupta-WMF) @WDoranWMF Please approve. [09:52:55] (03CR) 10CI reject: [V:04-1] dbconfig.schema: Add pc [puppet] - 10https://gerrit.wikimedia.org/r/1032407 (https://phabricator.wikimedia.org/T362786) (owner: 10Marostegui) [09:54:59] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2126 (re)pooling @ 75%: post fix repool', diff saved to https://phabricator.wikimedia.org/P62480 and previous config saved to /var/cache/conftool/dbconfig/20240516-095459-arnaudb.json [09:56:15] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [09:56:16] !log ladsgroup@cumin1002 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 1 day, 0:00:00 on db1158.eqiad.wmnet with reason: Maintenance [09:57:46] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1227.eqiad.wmnet with reason: Maintenance [09:58:10] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1227.eqiad.wmnet with reason: Maintenance [09:58:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1227 (T352010)', diff saved to https://phabricator.wikimedia.org/P62481 and previous config saved to /var/cache/conftool/dbconfig/20240516-095817-ladsgroup.json [09:58:21] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [09:59:28] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1202 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P62482 and previous config saved to /var/cache/conftool/dbconfig/20240516-095927-ladsgroup.json [10:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240516T1000) [10:00:04] claime: A patch you scheduled for MediaWiki infrastructure (UTC mid-day) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [10:00:14] (03CR) 10Clément Goubert: [C:03+2] httpbb: Add tests for new redirects [puppet] - 10https://gerrit.wikimedia.org/r/1031874 (https://phabricator.wikimedia.org/T25216) (owner: 10Clément Goubert) [10:00:29] (03PS2) 10Effie Mouzeli: (WIP) memcached: make the service run under the memcache user [puppet] - 10https://gerrit.wikimedia.org/r/1026609 (https://phabricator.wikimedia.org/T273950) [10:00:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1021 (re)pooling @ 75%: After reimage', diff saved to https://phabricator.wikimedia.org/P62483 and previous config saved to /var/cache/conftool/dbconfig/20240516-100040-root.json [10:02:17] !log cumin 'A:all-mw' "disable-puppet 'New redirects T25216 T204830 T31186 - cgoubert'" [10:02:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:25] T25216: Move the Nourmande Wikipedia from nrm to nrf - https://phabricator.wikimedia.org/T25216 [10:02:25] T204830: Temporarily redirect sgs.wikipedia.org to bat-smg.wikipedia.org until bat-smg->sgs move can be done - https://phabricator.wikimedia.org/T204830 [10:02:25] T31186: Rename Võro Wikipedia, fiu-vro -> vro - https://phabricator.wikimedia.org/T31186 [10:03:02] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on seaborgium:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:03:27] PROBLEM - Host mr1-magru.oob is DOWN: PING CRITICAL - Packet loss = 100% [10:04:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add pc1011 to pc1 T362786', diff saved to https://phabricator.wikimedia.org/P62484 and previous config saved to /var/cache/conftool/dbconfig/20240516-100418-marostegui.json [10:04:22] T362786: Enable dbctl for parsercache - https://phabricator.wikimedia.org/T362786 [10:05:38] (03PS1) 10Fabfur: cache:haproxy: %HP variable in log-format to log also invalid uri [puppet] - 10https://gerrit.wikimedia.org/r/1032410 (https://phabricator.wikimedia.org/T365117) [10:05:42] (03Abandoned) 10Marostegui: dbconfig.schema: Add pc [puppet] - 10https://gerrit.wikimedia.org/r/1032407 (https://phabricator.wikimedia.org/T362786) (owner: 10Marostegui) [10:06:13] (03CR) 10Clément Goubert: [C:03+2] Add 'nrf' as alias for 'nrm' [puppet] - 10https://gerrit.wikimedia.org/r/527909 (https://phabricator.wikimedia.org/T25216) (owner: 10Fomafix) [10:06:24] (03CR) 10Clément Goubert: [C:03+2] Add redirects from 'sgs' to 'bat-smg' [puppet] - 10https://gerrit.wikimedia.org/r/481540 (https://phabricator.wikimedia.org/T204830) (owner: 10Fomafix) [10:06:31] PROBLEM - Host mr1-magru.oob IPv6 is DOWN: CRITICAL - Host Unreachable (2804:ad4:ff12:19::84) [10:06:37] (03PS10) 10Fomafix: Add redirects from 'sgs' to 'bat-smg' [puppet] - 10https://gerrit.wikimedia.org/r/481540 (https://phabricator.wikimedia.org/T204830) [10:07:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add pc2011 to pc1 T362786', diff saved to https://phabricator.wikimedia.org/P62485 and previous config saved to /var/cache/conftool/dbconfig/20240516-100744-marostegui.json [10:07:54] (03CR) 10Clément Goubert: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/481540 (https://phabricator.wikimedia.org/T204830) (owner: 10Fomafix) [10:08:29] RECOVERY - Host mr1-magru.oob is UP: PING OK - Packet loss = 0%, RTA = 118.98 ms [10:09:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add pc2012 and pc1012 to pc2 T362786', diff saved to https://phabricator.wikimedia.org/P62486 and previous config saved to /var/cache/conftool/dbconfig/20240516-100858-marostegui.json [10:09:17] (03CR) 10Brouberol: Move datahub and datahub-staging helfile deployments to dse-k8s (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028761 (https://phabricator.wikimedia.org/T363300) (owner: 10Stevemunene) [10:10:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add pc2013 and pc1013 to pc2 T362786', diff saved to https://phabricator.wikimedia.org/P62487 and previous config saved to /var/cache/conftool/dbconfig/20240516-101009-marostegui.json [10:10:16] T362786: Enable dbctl for parsercache - https://phabricator.wikimedia.org/T362786 [10:10:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2126 (re)pooling @ 100%: post fix repool', diff saved to https://phabricator.wikimedia.org/P62488 and previous config saved to /var/cache/conftool/dbconfig/20240516-101018-arnaudb.json [10:10:59] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:11:03] (03CR) 10Clément Goubert: [C:03+2] Add 'vro' as alias for 'fiu-vro' [puppet] - 10https://gerrit.wikimedia.org/r/527915 (https://phabricator.wikimedia.org/T31186) (owner: 10Fomafix) [10:11:16] (03PS6) 10Fomafix: Add 'vro' as alias for 'fiu-vro' [puppet] - 10https://gerrit.wikimedia.org/r/527915 (https://phabricator.wikimedia.org/T31186) [10:11:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add pc2014 and pc1014 to pc4 T362786', diff saved to https://phabricator.wikimedia.org/P62489 and previous config saved to /var/cache/conftool/dbconfig/20240516-101122-marostegui.json [10:11:25] (03CR) 10Clément Goubert: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/527915 (https://phabricator.wikimedia.org/T31186) (owner: 10Fomafix) [10:11:33] RECOVERY - Host mr1-magru.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 132.74 ms [10:13:52] FIRING: SystemdUnitFailed: httpbb_hourly_appserver.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:15:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add pc2016 and pc1016 to pc4 T362786', diff saved to https://phabricator.wikimedia.org/P62490 and previous config saved to /var/cache/conftool/dbconfig/20240516-101543-marostegui.json [10:15:49] T362786: Enable dbctl for parsercache - https://phabricator.wikimedia.org/T362786 [10:15:49] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1202 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P62491 and previous config saved to /var/cache/conftool/dbconfig/20240516-101548-ladsgroup.json [10:15:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1021 (re)pooling @ 100%: After reimage', diff saved to https://phabricator.wikimedia.org/P62492 and previous config saved to /var/cache/conftool/dbconfig/20240516-101553-root.json [10:17:32] (03CR) 10Fabfur: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1032410 (https://phabricator.wikimedia.org/T365117) (owner: 10Fabfur) [10:18:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Add pc2015 and pc1015 to pc4 as depooled spares T362786', diff saved to https://phabricator.wikimedia.org/P62493 and previous config saved to /var/cache/conftool/dbconfig/20240516-101829-marostegui.json [10:19:05] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [10:19:50] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [10:21:12] !log New redirects ok on mwdebug - T25216 T204830 T31186 [10:21:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:18] T25216: Move the Nourmande Wikipedia from nrm to nrf - https://phabricator.wikimedia.org/T25216 [10:21:18] T204830: Temporarily redirect sgs.wikipedia.org to bat-smg.wikipedia.org until bat-smg->sgs move can be done - https://phabricator.wikimedia.org/T204830 [10:21:18] T31186: Rename Võro Wikipedia, fiu-vro -> vro - https://phabricator.wikimedia.org/T31186 [10:22:16] !log cgoubert@deploy1002 Started scap: Deploy new redirects to mw-on-k8s - T25216 T204830 T31186 [10:28:25] (03PS2) 10Fabfur: cache:haproxy: use %HP in log-format to log absolute-form reqs [puppet] - 10https://gerrit.wikimedia.org/r/1032410 (https://phabricator.wikimedia.org/T365117) [10:29:59] !log cgoubert@deploy1002 Finished scap: Deploy new redirects to mw-on-k8s - T25216 T204830 T31186 (duration: 08m 06s) [10:30:08] T25216: Move the Nourmande Wikipedia from nrm to nrf - https://phabricator.wikimedia.org/T25216 [10:30:09] T204830: Temporarily redirect sgs.wikipedia.org to bat-smg.wikipedia.org until bat-smg->sgs move can be done - https://phabricator.wikimedia.org/T204830 [10:30:09] T31186: Rename Võro Wikipedia, fiu-vro -> vro - https://phabricator.wikimedia.org/T31186 [10:30:18] (03CR) 10FNegri: [C:03+2] wikireplicas: Drop gu_salt from maintain-views [puppet] - 10https://gerrit.wikimedia.org/r/1029709 (https://phabricator.wikimedia.org/T364435) (owner: 10Zabe) [10:30:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Test pc4 master switch', diff saved to https://phabricator.wikimedia.org/P62494 and previous config saved to /var/cache/conftool/dbconfig/20240516-103039-marostegui.json [10:30:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1202 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P62495 and previous config saved to /var/cache/conftool/dbconfig/20240516-103055-ladsgroup.json [10:31:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Test pc4 master switch', diff saved to https://phabricator.wikimedia.org/P62496 and previous config saved to /var/cache/conftool/dbconfig/20240516-103148-marostegui.json [10:31:53] !log cumin 'A:all-mw' "enable-puppet 'New redirects T25216 T204830 T31186 - cgoubert'" [10:32:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:44] !log cumin 'A:all-mw' -b30 "run-puppet-agent -q" - T25216 T204830 T31186 [10:32:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:10] (03PS1) 10Jelto: gitlab: bump exporter version to v1.0.4 [puppet] - 10https://gerrit.wikimedia.org/r/1032414 (https://phabricator.wikimedia.org/T354656) [10:34:21] (03CR) 10Mabualruz: [C:03+1] Disable font size configuration on talk pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032088 (https://phabricator.wikimedia.org/T364887) (owner: 10Jdlrobson) [10:36:30] (03CR) 10Jelto: [C:03+2] gitlab: bump exporter version to v1.0.4 [puppet] - 10https://gerrit.wikimedia.org/r/1032414 (https://phabricator.wikimedia.org/T354656) (owner: 10Jelto) [10:37:57] !log fnegri@cumin1002 START - Cookbook sre.wikireplicas.update-views [10:40:59] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:41:06] (03PS1) 10Clément Goubert: Revert "httpbb: Add tests for new redirects" [puppet] - 10https://gerrit.wikimedia.org/r/1032145 [10:43:02] RESOLVED: SystemdUnitFailed: httpbb_hourly_appserver.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:43:53] !log New redirects for T25216 T204830 T31186 operational [10:43:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:00] T25216: Move the Nourmande Wikipedia from nrm to nrf - https://phabricator.wikimedia.org/T25216 [10:44:00] T204830: Temporarily redirect sgs.wikipedia.org to bat-smg.wikipedia.org until bat-smg->sgs move can be done - https://phabricator.wikimedia.org/T204830 [10:44:01] T31186: Rename Võro Wikipedia, fiu-vro -> vro - https://phabricator.wikimedia.org/T31186 [10:46:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'db1202 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P62497 and previous config saved to /var/cache/conftool/dbconfig/20240516-104601-ladsgroup.json [10:47:49] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:48:04] !log fnegri@cumin1002 END (PASS) - Cookbook sre.wikireplicas.update-views (exit_code=0) [10:48:29] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:52:39] (03CR) 10Majavah: [C:03+2] P:openstack: neutron: add required control plane config for OVS [puppet] - 10https://gerrit.wikimedia.org/r/1031880 (https://phabricator.wikimedia.org/T326373) (owner: 10Majavah) [10:56:11] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:58:01] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 13 Aug 2024 12:55:14 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:58:04] FIRING: [2x] HelmReleaseBadStatus: Helm release datasets-config/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [10:58:27] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51925 bytes in 6.168 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:58:39] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.254 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [10:59:14] (03CR) 10Majavah: [C:03+2] openstack: neutron: ovs_agent: Restart on config file change [puppet] - 10https://gerrit.wikimedia.org/r/1032392 (https://phabricator.wikimedia.org/T326373) (owner: 10Majavah) [11:00:53] (03PS5) 10Majavah: P:openstack: neutron: add ovs config to eqiad1 profiles [puppet] - 10https://gerrit.wikimedia.org/r/1032388 (https://phabricator.wikimedia.org/T326373) [11:00:53] (03PS5) 10Majavah: O:wmcs::openstack: add eqiad1 net_ovs role [puppet] - 10https://gerrit.wikimedia.org/r/1032389 (https://phabricator.wikimedia.org/T364459) [11:00:53] (03PS5) 10Majavah: site: Move cloudnet1005 to insetup_noferm to prep for OVS [puppet] - 10https://gerrit.wikimedia.org/r/1032390 (https://phabricator.wikimedia.org/T364459) [11:00:53] (03PS6) 10Majavah: site: Move cloudnet1005 to OVS agent [puppet] - 10https://gerrit.wikimedia.org/r/1032391 (https://phabricator.wikimedia.org/T364459) [11:00:54] (03PS8) 10Majavah: site: Move cloudnet2006-dev to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1029498 (https://phabricator.wikimedia.org/T358761) [11:02:35] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2478/console" [puppet] - 10https://gerrit.wikimedia.org/r/1032388 (https://phabricator.wikimedia.org/T326373) (owner: 10Majavah) [11:03:02] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:04:35] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1032388 (https://phabricator.wikimedia.org/T326373) (owner: 10Majavah) [11:04:51] (03CR) 10Majavah: [V:03+1 C:03+2] P:openstack: neutron: add ovs config to eqiad1 profiles [puppet] - 10https://gerrit.wikimedia.org/r/1032388 (https://phabricator.wikimedia.org/T326373) (owner: 10Majavah) [11:04:57] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1032389 (https://phabricator.wikimedia.org/T364459) (owner: 10Majavah) [11:05:37] (03PS6) 10Majavah: O:wmcs::openstack: add eqiad1 net_ovs role [puppet] - 10https://gerrit.wikimedia.org/r/1032389 (https://phabricator.wikimedia.org/T364459) [11:05:37] (03PS6) 10Majavah: site: Move cloudnet1005 to insetup_noferm to prep for OVS [puppet] - 10https://gerrit.wikimedia.org/r/1032390 (https://phabricator.wikimedia.org/T364459) [11:05:37] (03PS7) 10Majavah: site: Move cloudnet1005 to OVS agent [puppet] - 10https://gerrit.wikimedia.org/r/1032391 (https://phabricator.wikimedia.org/T364459) [11:05:37] (03PS9) 10Majavah: site: Move cloudnet2006-dev to OVS [puppet] - 10https://gerrit.wikimedia.org/r/1029498 (https://phabricator.wikimedia.org/T358761) [11:08:49] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:09:31] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:09:33] (03CR) 10Majavah: [C:03+2] O:wmcs::openstack: add eqiad1 net_ovs role [puppet] - 10https://gerrit.wikimedia.org/r/1032389 (https://phabricator.wikimedia.org/T364459) (owner: 10Majavah) [11:09:49] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8617 bytes in 8.621 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:10:23] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51924 bytes in 0.176 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:19:55] FIRING: [2x] KubernetesAPINotScrapable: k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [11:22:51] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:23:35] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:24:52] (03PS3) 10Effie Mouzeli: (WIP) memcached: make the service run under the memcache user [puppet] - 10https://gerrit.wikimedia.org/r/1026609 (https://phabricator.wikimedia.org/T273950) [11:28:52] (03PS4) 10Effie Mouzeli: memcached: add memcache user option [puppet] - 10https://gerrit.wikimedia.org/r/1026609 (https://phabricator.wikimedia.org/T273950) [11:31:04] (03CR) 10Effie Mouzeli: "PCC: https://puppet-compiler.wmflabs.org/output/1026609/2479/" [puppet] - 10https://gerrit.wikimedia.org/r/1026609 (https://phabricator.wikimedia.org/T273950) (owner: 10Effie Mouzeli) [11:31:13] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:34:46] (03PS5) 10Effie Mouzeli: memcached: add memcache user option [puppet] - 10https://gerrit.wikimedia.org/r/1026609 (https://phabricator.wikimedia.org/T273950) [11:42:11] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 13 Aug 2024 12:55:14 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:45:51] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8617 bytes in 2.038 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:46:39] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51923 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [11:50:14] (03CR) 10JMeybohm: [V:03+1 C:03+2] prometheus/ops: Refactor etcd scraping [puppet] - 10https://gerrit.wikimedia.org/r/1032394 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [11:50:59] RECOVERY - snapshot of s7 in eqiad on backupmon1001 is OK: Last snapshot for s7 at eqiad (db1171) taken on 2024-05-16 11:00:48 (873 GiB, +0.2 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [11:59:56] (03PS1) 10JMeybohm: prometheus/ops: Refactor etcd scraping, fix hostnames [puppet] - 10https://gerrit.wikimedia.org/r/1032453 (https://phabricator.wikimedia.org/T363307) [12:00:04] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240516T1200) [12:00:15] (03CR) 10CI reject: [V:04-1] prometheus/ops: Refactor etcd scraping, fix hostnames [puppet] - 10https://gerrit.wikimedia.org/r/1032453 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [12:00:27] (03PS2) 10JMeybohm: prometheus/ops: Refactor etcd scraping, fix hostnames [puppet] - 10https://gerrit.wikimedia.org/r/1032453 (https://phabricator.wikimedia.org/T363307) [12:00:48] (03CR) 10CI reject: [V:04-1] prometheus/ops: Refactor etcd scraping, fix hostnames [puppet] - 10https://gerrit.wikimedia.org/r/1032453 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [12:01:51] (03PS3) 10JMeybohm: prometheus/ops: Refactor etcd scraping, fix hostnames [puppet] - 10https://gerrit.wikimedia.org/r/1032453 (https://phabricator.wikimedia.org/T363307) [12:03:02] FIRING: [2x] JobUnavailable: Reduced availability for job etcd in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:03:23] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data-Platform-SRE (2024.05.06 - 2024.05.26), 13Patch-For-Review: Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228#9804192 (10BTullis) @xcollazo added [[https://gerrit.wikimedia.org/r/c/operations/puppet/+/102... [12:04:13] (03CR) 10Filippo Giunchedi: [C:03+1] prometheus/ops: Refactor etcd scraping, fix hostnames [puppet] - 10https://gerrit.wikimedia.org/r/1032453 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [12:05:06] (03CR) 10JMeybohm: [C:03+2] prometheus/ops: Refactor etcd scraping, fix hostnames [puppet] - 10https://gerrit.wikimedia.org/r/1032453 (https://phabricator.wikimedia.org/T363307) (owner: 10JMeybohm) [12:13:02] FIRING: [2x] JobUnavailable: Reduced availability for job etcd in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:16:05] (03PS3) 10Slyngshede: LDAP Eventlog [software/bitu] - 10https://gerrit.wikimedia.org/r/1026919 (https://phabricator.wikimedia.org/T163478) [12:16:09] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data-Platform-SRE (2024.05.06 - 2024.05.26), 13Patch-For-Review: Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228#9804237 (10BTullis) Here is a one-liner to list the next scheduled runs of all of the timers f... [12:19:06] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1032459 (owner: 10L10n-bot) [12:19:57] (03PS4) 10Slyngshede: LDAP Eventlog [software/bitu] - 10https://gerrit.wikimedia.org/r/1026919 (https://phabricator.wikimedia.org/T163478) [12:32:23] (03CR) 10Filippo Giunchedi: [C:03+1] "I was finally able to test this and it works for me! On the second puppet run but still no intervention needed" [puppet] - 10https://gerrit.wikimedia.org/r/1002387 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [12:32:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T352010)', diff saved to https://phabricator.wikimedia.org/P62498 and previous config saved to /var/cache/conftool/dbconfig/20240516-123235-ladsgroup.json [12:32:39] (03CR) 10Filippo Giunchedi: [V:03+1 C:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2481/co" [puppet] - 10https://gerrit.wikimedia.org/r/1002387 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [12:36:02] (03CR) 10Filippo Giunchedi: [C:03+1] "I think this is ready to go, these hosts currently run postgresql::server" [puppet] - 10https://gerrit.wikimedia.org/r/1002387 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [12:36:03] (03PS3) 10Vgutierrez: prometheus::lvs: Fetch MSS using getsockopt() [puppet] - 10https://gerrit.wikimedia.org/r/1032400 (https://phabricator.wikimedia.org/T365101) [12:44:03] (03PS4) 10Vgutierrez: prometheus::lvs: Fetch MSS using getsockopt() [puppet] - 10https://gerrit.wikimedia.org/r/1032400 (https://phabricator.wikimedia.org/T365101) [12:45:35] (03CR) 10BBlack: [C:03+1] prometheus::lvs: Fetch MSS using getsockopt() [puppet] - 10https://gerrit.wikimedia.org/r/1032400 (https://phabricator.wikimedia.org/T365101) (owner: 10Vgutierrez) [12:47:26] (03CR) 10Vgutierrez: [C:03+2] prometheus::lvs: Fetch MSS using getsockopt() [puppet] - 10https://gerrit.wikimedia.org/r/1032400 (https://phabricator.wikimedia.org/T365101) (owner: 10Vgutierrez) [12:47:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P62499 and previous config saved to /var/cache/conftool/dbconfig/20240516-124743-ladsgroup.json [12:49:03] (03PS5) 10Fomafix: Add 'rup' as alias for 'roa-rup' [puppet] - 10https://gerrit.wikimedia.org/r/527917 (https://phabricator.wikimedia.org/T17988) [12:49:38] FIRING: [2x] ProbeDown: Service cloudidm2001-dev:443 has failed probes (http_cloudtestidm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/IDM/Runbook - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:49:40] RESOLVED: [2x] KubernetesAPINotScrapable: k8s-aux@eqiad is failing to scrape the k8s api - https://phabricator.wikimedia.org/T343529 - TODO - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPINotScrapable [12:50:56] (03PS3) 10Jsn.sherman: CommonSettings-labs: Correct wgAutoModeratorLiftWingBaseUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031999 (https://phabricator.wikimedia.org/T364034) [12:51:45] 06SRE, 10LDAP-Access-Requests: Grant Access to nda for Ricki Jay - https://phabricator.wikimedia.org/T365138 (10WMDE-leszek) 03NEW [12:55:50] (03PS3) 10Stevemunene: Move datahub and datahub-staging helfile deployments to dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028761 (https://phabricator.wikimedia.org/T363300) [12:59:09] (03PS1) 10Ayounsi: Junos: use "json compact" format [cookbooks] - 10https://gerrit.wikimedia.org/r/1032477 (https://phabricator.wikimedia.org/T362523) [13:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor My software never has bugs. It just develops random features. Rise for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240516T1300). [13:00:04] hnowlan and JSherman: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:26] \o here [13:00:42] o/ [13:00:48] I can’t deploy yet but will be available later in the window [13:00:48] o/ [13:00:52] * thcipriani lurks [13:01:01] (03PS2) 10Ayounsi: Junos: use "json compact" format [cookbooks] - 10https://gerrit.wikimedia.org/r/1032477 (https://phabricator.wikimedia.org/T362523) [13:01:03] also, hype for 1031028 \o/ [13:01:31] (03CR) 10Stevemunene: Move datahub and datahub-staging helfile deployments to dse-k8s (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028761 (https://phabricator.wikimedia.org/T363300) (owner: 10Stevemunene) [13:01:54] Lucas_WMDE: JSherman and I can deploy [13:02:25] sounds good 👍 [13:02:53] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:02:53] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P62500 and previous config saved to /var/cache/conftool/dbconfig/20240516-130252-ladsgroup.json [13:03:09] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:03:59] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8617 bytes in 1.292 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:04:45] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51923 bytes in 0.083 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [13:04:46] my meeting fell through anyway, so I can also deploy now ^^ [13:04:52] (03CR) 10CI reject: [V:04-1] Junos: use "json compact" format [cookbooks] - 10https://gerrit.wikimedia.org/r/1032477 (https://phabricator.wikimedia.org/T362523) (owner: 10Ayounsi) [13:05:04] we're looking at error logs right now, hnowlan: can you rebase your patch while we look at that? [13:06:04] JSherman: will do [13:06:11] (03PS3) 10Ayounsi: Junos: use "json compact" format [cookbooks] - 10https://gerrit.wikimedia.org/r/1032477 (https://phabricator.wikimedia.org/T362523) [13:06:30] (03PS2) 10Hnowlan: Enable async jobqueue-powered URL uploads on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031028 (https://phabricator.wikimedia.org/T295007) [13:09:38] (03PS1) 10Lucas Werkmeister (WMDE): Make EntitySchemaValue::getArrayValue() match EntityIdValue [extensions/EntitySchema] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1032429 (https://phabricator.wikimedia.org/T362955) [13:09:52] (03PS1) 10Lucas Werkmeister (WMDE): Make EntitySchemaValue::getArrayValue() match EntityIdValue [extensions/EntitySchema] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1032430 (https://phabricator.wikimedia.org/T362955) [13:10:07] ^ I’ll add those backports to the calendar in a moment [13:10:12] hey Lucas_WMDE could i please add a config patch to the backport window? [13:10:13] also… that’s a *lot* of errors in logspam-watch ._. [13:11:09] Lucas_WMDE: we just filed a task for this, it looks like some maintenance script may be upset about some database fiddling that's happening? [13:11:10] (03PS1) 10Marostegui: es1024: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1032479 (https://phabricator.wikimedia.org/T364289) [13:11:12] (afaict) [13:11:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool es1024 T364289', diff saved to https://phabricator.wikimedia.org/P62501 and previous config saved to /var/cache/conftool/dbconfig/20240516-131111-root.json [13:11:16] T364289: Reimage external store hosts with Bookworm - https://phabricator.wikimedia.org/T364289 [13:11:52] (03CR) 10Marostegui: [C:03+2] es1024: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1032479 (https://phabricator.wikimedia.org/T364289) (owner: 10Marostegui) [13:12:19] Jdlrobson: it’s looking pretty full already tbh :/ [13:12:24] marostegui: FYI I think https://phabricator.wikimedia.org/T365140 could be related to the maintenance you're doing [13:12:33] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host es1024.eqiad.wmnet with OS bookworm [13:12:59] hnowlan: about to run scap [13:13:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031028 (https://phabricator.wikimedia.org/T295007) (owner: 10Hnowlan) [13:13:40] thanks [13:13:53] (03Merged) 10jenkins-bot: Enable async jobqueue-powered URL uploads on commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031028 (https://phabricator.wikimedia.org/T295007) (owner: 10Hnowlan) [13:14:11] !log jsn@deploy1002 Started scap: Backport for [[gerrit:1031028|Enable async jobqueue-powered URL uploads on commons (T295007)]] [13:14:15] T295007: Upload by URL should use the job queue, possibly chunked with range requests - https://phabricator.wikimedia.org/T295007 [13:14:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'T364290 db2176', diff saved to https://phabricator.wikimedia.org/P62502 and previous config saved to /var/cache/conftool/dbconfig/20240516-131429-arnaudb.json [13:14:33] T364290: Upgrade s1 to MariaDB 10.6 - https://phabricator.wikimedia.org/T364290 [13:14:40] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "kicking off gate-and-submit ahead of deployment" [extensions/EntitySchema] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1032429 (https://phabricator.wikimedia.org/T362955) (owner: 10Lucas Werkmeister (WMDE)) [13:14:44] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "kicking off gate-and-submit ahead of deployment" [extensions/EntitySchema] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1032430 (https://phabricator.wikimedia.org/T362955) (owner: 10Lucas Werkmeister (WMDE)) [13:15:49] !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db2176.codfw.wmnet [13:15:56] !log arnaudb@cumin1002 END (ERROR) - Cookbook sre.mysql.upgrade (exit_code=97) for db2176.codfw.wmnet [13:16:40] (03PS1) 10Elukey: blubber: update to use buildkit [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/1032481 [13:16:40] (03PS1) 10Elukey: Add proxy_host setting to the S3 cache. [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/1032482 (https://phabricator.wikimedia.org/T344324) [13:16:52] !log jsn@deploy1002 jsn and hnowlan: Backport for [[gerrit:1031028|Enable async jobqueue-powered URL uploads on commons (T295007)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:17:00] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db2176.codfw.wmnet with OS bookworm [13:17:03] hnowlan: please test [13:17:46] testing [13:17:47] (03PS1) 10Vgutierrez: hiera: Enable IPIP on high-traffic2@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1032483 (https://phabricator.wikimedia.org/T357257) [13:17:48] JSherman, thcipriani: I feel like https://phabricator.wikimedia.org/T365140 should be UBN, what do you think? [13:17:49] (03PS1) 10Vgutierrez: hiera: Enable IPIP on upload@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/1032484 (https://phabricator.wikimedia.org/T357257) [13:17:56] half a million log messages per hour is a lot [13:18:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T352010)', diff saved to https://phabricator.wikimedia.org/P62503 and previous config saved to /var/cache/conftool/dbconfig/20240516-131800-ladsgroup.json [13:18:05] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [13:18:07] Lucas_WMDE: +1 seems reasonable [13:18:24] ok, priority applied [13:18:31] * Lucas_WMDE waits for people to magically materialize now ;) [13:18:33] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 37 probes of 801 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:18:48] (03PS2) 10Elukey: blubber: update to use buildkit [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/1032481 [13:18:48] (03PS2) 10Elukey: Add proxy_host setting to the S3 cache. [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/1032482 (https://phabricator.wikimedia.org/T344324) [13:19:01] JSherman: looks good [13:19:08] !log jsn@deploy1002 jsn and hnowlan: Continuing with sync [13:20:59] (03CR) 10Slyngshede: "Due to some query limitations in LDAP, the easiest way to do this seems to be to just return one month at a time." [software/bitu] - 10https://gerrit.wikimedia.org/r/1026919 (https://phabricator.wikimedia.org/T163478) (owner: 10Slyngshede) [13:21:01] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 3 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compile" [puppet] - 10https://gerrit.wikimedia.org/r/1032483 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [13:21:29] (03CR) 10Elukey: "Hi folks! I haven't fully tested this solution, I hoped to do it in k8s staging and take it from there. Lemme know if it makes sense or no" [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/1032482 (https://phabricator.wikimedia.org/T344324) (owner: 10Elukey) [13:23:33] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 28 probes of 801 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:23:59] (03PS1) 10Eevans: cassandra: add faux creds for data_gateway role [labs/private] - 10https://gerrit.wikimedia.org/r/1032485 (https://phabricator.wikimedia.org/T364921) [13:24:27] (03PS2) 10CDobbins: purged: add Puppet overrides to use cfssl for certs in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1032106 (https://phabricator.wikimedia.org/T360506) [13:24:43] (03PS2) 10Eevans: cassandra: add data_gateway Cassandra role [puppet] - 10https://gerrit.wikimedia.org/r/1032034 (https://phabricator.wikimedia.org/T364921) [13:25:22] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1032484 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [13:25:44] (03CR) 10Eevans: [V:03+2 C:03+2] cassandra: add faux creds for data_gateway role [labs/private] - 10https://gerrit.wikimedia.org/r/1032485 (https://phabricator.wikimedia.org/T364921) (owner: 10Eevans) [13:26:52] (03PS1) 10TChin: datasets-config: Change mesh port [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032486 (https://phabricator.wikimedia.org/T357434) [13:27:02] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es1024.eqiad.wmnet with reason: host reimage [13:27:56] (03PS1) 10TChin: datasets-config: Rename next pods to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032487 (https://phabricator.wikimedia.org/T357434) [13:28:20] (03PS3) 10Eevans: cassandra: add data_gateway Cassandra role [puppet] - 10https://gerrit.wikimedia.org/r/1032034 (https://phabricator.wikimedia.org/T364921) [13:28:28] (03PS1) 10Vgutierrez: depool upload@eqsin before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1032488 (https://phabricator.wikimedia.org/T357257) [13:31:00] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1024.eqiad.wmnet with reason: host reimage [13:32:29] !log jsn@deploy1002 Finished scap: Backport for [[gerrit:1031028|Enable async jobqueue-powered URL uploads on commons (T295007)]] (duration: 18m 18s) [13:32:34] T295007: Upload by URL should use the job queue, possibly chunked with range requests - https://phabricator.wikimedia.org/T295007 [13:32:58] hnowlan: you should be good to go [13:33:24] my backports are about to merge btw [13:33:39] (03PS2) 10Jdlrobson: Add Watchlist to exclude list from dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032398 (https://phabricator.wikimedia.org/T365084) (owner: 10Mabualruz) [13:34:00] (03PS3) 10Jdlrobson: Fix exclude list for dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032398 (https://phabricator.wikimedia.org/T365084) (owner: 10Mabualruz) [13:34:09] JSherman: brilliant, thank you! [13:34:32] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2176.codfw.wmnet with reason: host reimage [13:34:40] +2ing myself [13:34:45] (03CR) 10Jsn.sherman: [C:03+2] CommonSettings-labs: Correct wgAutoModeratorLiftWingBaseUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031999 (https://phabricator.wikimedia.org/T364034) (owner: 10Jsn.sherman) [13:35:25] (03Merged) 10jenkins-bot: CommonSettings-labs: Correct wgAutoModeratorLiftWingBaseUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031999 (https://phabricator.wikimedia.org/T364034) (owner: 10Jsn.sherman) [13:36:04] (03CR) 10Brouberol: [C:03+1] "Yes please!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032487 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [13:36:15] (03CR) 10Ssingh: "Looks good! Let's run PCC on this?" [puppet] - 10https://gerrit.wikimedia.org/r/1032106 (https://phabricator.wikimedia.org/T360506) (owner: 10CDobbins) [13:36:40] fast forwarded myself [13:36:53] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for sg912 - https://phabricator.wikimedia.org/T365118#9804622 (10Eevans) [13:37:27] (03PS2) 10TChin: datasets-config: Change mesh port [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032486 (https://phabricator.wikimedia.org/T357434) [13:37:30] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2176.codfw.wmnet with reason: host reimage [13:37:43] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1032034 (https://phabricator.wikimedia.org/T364921) (owner: 10Eevans) [13:38:08] (03CR) 10Brouberol: [C:03+1] "Looks good!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032486 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [13:38:24] Lucas_WMDE: would you like me to pull in your changes? [13:38:31] or handoff? [13:38:36] either works for me ^^ [13:38:47] (03Merged) 10jenkins-bot: Make EntitySchemaValue::getArrayValue() match EntityIdValue [extensions/EntitySchema] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1032429 (https://phabricator.wikimedia.org/T362955) (owner: 10Lucas Werkmeister (WMDE)) [13:38:48] okay, I'll go ahead and do it for the practice [13:38:53] sounds good, thanks! [13:38:54] (03Merged) 10jenkins-bot: Make EntitySchemaValue::getArrayValue() match EntityIdValue [extensions/EntitySchema] (wmf/1.43.0-wmf.4) - 10https://gerrit.wikimedia.org/r/1032430 (https://phabricator.wikimedia.org/T362955) (owner: 10Lucas Werkmeister (WMDE)) [13:40:27] Lucas_WMDE: syncing at the same time, assuming that's fine since different versions and one is no longer live [13:40:55] yeah [13:41:26] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for sg912 - https://phabricator.wikimedia.org/T365118#9804655 (10Eevans) [13:42:00] (03CR) 10Jdlrobson: [C:03+1] Fix exclude list for dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032398 (https://phabricator.wikimedia.org/T365084) (owner: 10Mabualruz) [13:43:53] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for sg912 - https://phabricator.wikimedia.org/T365118#9804665 (10Eevans) @KOfori as group approver for cassandra-staging-devs...do you? :) [13:44:20] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031599 [13:44:57] (03PS1) 10TChin: datasets-config-next: Change readiness_probe [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032491 (https://phabricator.wikimedia.org/T357434) [13:45:38] Jdlrobson: did you merge a skin update for wmf.4? looks like we pull that down, wmf.4 is no longer live, so we'll go ahead with sync. But note if there's a rollback it will be live now, is that ok? [13:46:04] (03PS69) 10AOkoth: prometheus: puppetise sql_exporter [puppet] - 10https://gerrit.wikimedia.org/r/945872 (https://phabricator.wikimedia.org/T310822) [13:46:04] (03PS1) 10AOkoth: vrts: upgrade via python script [puppet] - 10https://gerrit.wikimedia.org/r/1032492 [13:47:32] (03PS1) 10Marostegui: Revert "es1024: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1032432 [13:47:53] !log jsn@deploy1002 Started scap: Backport for [[gerrit:1032429|Make EntitySchemaValue::getArrayValue() match EntityIdValue (T362955 T362001)]], [[gerrit:1032430|Make EntitySchemaValue::getArrayValue() match EntityIdValue (T362955 T362001)]] [13:47:59] T362955: Add support for searching EntitySchema values by ID - https://phabricator.wikimedia.org/T362955 [13:47:59] T362001: [ES-M2]: Updating EntitySchema to make use of the new mechanism - https://phabricator.wikimedia.org/T362001 [13:48:26] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host es1024.eqiad.wmnet with OS bookworm [13:48:54] (03CR) 10Marostegui: [C:03+2] Revert "es1024: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1032432 (owner: 10Marostegui) [13:48:56] (03PS1) 10Ladsgroup: Stop writing to the old columns of pagelinks in s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032493 (https://phabricator.wikimedia.org/T352010) [13:49:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1024 (re)pooling @ 10%: After reimage', diff saved to https://phabricator.wikimedia.org/P62505 and previous config saved to /var/cache/conftool/dbconfig/20240516-134918-root.json [13:49:20] (03PS4) 10Jdrewniak: Fix exclude list for dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032398 (https://phabricator.wikimedia.org/T365084) (owner: 10Mabualruz) [13:50:23] !log jsn@deploy1002 jsn and lucaswerkmeister-wmde: Backport for [[gerrit:1032429|Make EntitySchemaValue::getArrayValue() match EntityIdValue (T362955 T362001)]], [[gerrit:1032430|Make EntitySchemaValue::getArrayValue() match EntityIdValue (T362955 T362001)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:50:54] Lucas_WMDE: please test [13:51:01] I don’t think I can test it, there’s an unrelated issue with the API I wanted to use [13:51:07] IMHO it’s low-risk enough to just sync it [13:51:11] okay, proceeding [13:51:14] !log jsn@deploy1002 jsn and lucaswerkmeister-wmde: Continuing with sync [13:51:19] thanks! [13:52:44] (03PS1) 10Effie Mouzeli: memcached: run as user memcache on mc-gp2003 [puppet] - 10https://gerrit.wikimedia.org/r/1032495 (https://phabricator.wikimedia.org/T273950) [13:53:04] (03CR) 10CI reject: [V:04-1] memcached: run as user memcache on mc-gp2003 [puppet] - 10https://gerrit.wikimedia.org/r/1032495 (https://phabricator.wikimedia.org/T273950) (owner: 10Effie Mouzeli) [13:54:09] (03PS2) 10Effie Mouzeli: memcached: run as user memcache on mc-gp2003 [puppet] - 10https://gerrit.wikimedia.org/r/1032495 (https://phabricator.wikimedia.org/T273950) [13:54:30] jouncebot: next [13:54:30] In 2 hour(s) and 5 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240516T1600) [13:54:36] I might run a maintenance script after this window then [13:57:37] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 37 probes of 801 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:57:41] (03CR) 10Vgutierrez: [C:03+2] team-traffic: Add runbook link to LVSRealserverMSS alert [alerts] - 10https://gerrit.wikimedia.org/r/1030057 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [13:58:10] 07sre-alert-triage, 10Data-Platform-SRE (2024.05.06 - 2024.05.26): Alert in need of triage: PybalBackendDown (instance elastic2090:0) - https://phabricator.wikimedia.org/T364528#9804778 (10bking) When I brought this host online a few weeks back, I accidentally added it to the psi pool. I've since [[ https://ge... [13:59:10] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2176.codfw.wmnet with OS bookworm [14:00:25] FIRING: SystemdUnitFailed: ceph-0e6a4c4c-138b-11ef-b973-bc97e1bb7c18@mon.moss-be1002.service on moss-be1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:01:39] (03PS1) 10Hnowlan: trafficserver: migrate 5% of traffic to commons [puppet] - 10https://gerrit.wikimedia.org/r/1032497 (https://phabricator.wikimedia.org/T362323) [14:02:37] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 33 probes of 801 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:03:02] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on seaborgium:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:03:33] !log depool, restart swift-proxy, repool ms-fe1010 as ~12% connection failures reported by envoy since late 14th May T360913 [14:03:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:43] T360913: Swift proxy server misbehaviour (no longer calling `accept`?) - https://phabricator.wikimedia.org/T360913 [14:04:05] !log jsn@deploy1002 Finished scap: Backport for [[gerrit:1032429|Make EntitySchemaValue::getArrayValue() match EntityIdValue (T362955 T362001)]], [[gerrit:1032430|Make EntitySchemaValue::getArrayValue() match EntityIdValue (T362955 T362001)]] (duration: 16m 11s) [14:04:09] (03CR) 10Alexandros Kosiaris: [C:03+1] trafficserver: migrate 5% of traffic to commons [puppet] - 10https://gerrit.wikimedia.org/r/1032497 (https://phabricator.wikimedia.org/T362323) (owner: 10Hnowlan) [14:04:15] T362955: Add support for searching EntitySchema values by ID - https://phabricator.wikimedia.org/T362955 [14:04:18] T362001: [ES-M2]: Updating EntitySchema to make use of the new mechanism - https://phabricator.wikimedia.org/T362001 [14:04:24] \o/ [14:04:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1024 (re)pooling @ 25%: After reimage', diff saved to https://phabricator.wikimedia.org/P62506 and previous config saved to /var/cache/conftool/dbconfig/20240516-140426-root.json [14:04:31] thanks for deploying JSherman! [14:04:32] (03CR) 10Effie Mouzeli: [C:03+1] trafficserver: migrate 5% of traffic to commons [puppet] - 10https://gerrit.wikimedia.org/r/1032497 (https://phabricator.wikimedia.org/T362323) (owner: 10Hnowlan) [14:04:44] (03PS2) 10Btullis: Move dumps::generation::worker::dumper_misc_crons_only role [puppet] - 10https://gerrit.wikimedia.org/r/1029220 (https://phabricator.wikimedia.org/T325228) [14:04:45] Lucas_WMDE: you should be good to go! [14:04:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2176 (re)pooling @ 1%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62507 and previous config saved to /var/cache/conftool/dbconfig/20240516-140451-arnaudb.json [14:05:19] Lucas_WMDE: no problem! [14:05:25] FIRING: [2x] SystemdUnitFailed: ceph-0e6a4c4c-138b-11ef-b973-bc97e1bb7c18@mon.moss-be1002.service on moss-be1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:06:16] (03CR) 10Jelto: "looks mostly good but I think there will be some unwanted Monitoring::Service for the new hosts with the current code. See inline comment" [puppet] - 10https://gerrit.wikimedia.org/r/1025741 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [14:06:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'T364290 db2174', diff saved to https://phabricator.wikimedia.org/P62508 and previous config saved to /var/cache/conftool/dbconfig/20240516-140620-arnaudb.json [14:06:24] T364290: Upgrade s1 to MariaDB 10.6 - https://phabricator.wikimedia.org/T364290 [14:07:17] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2174.codfw.wmnet with reason: reimage [14:07:23] jouncebot: now [14:07:23] No deployments scheduled for the next 1 hour(s) and 52 minute(s) [14:07:30] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2174.codfw.wmnet with reason: reimage [14:08:17] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db2174.codfw.wmnet with OS bookworm [14:08:58] (03PS3) 10Btullis: Move dumps::generation::worker::dumper_misc_crons_only role [puppet] - 10https://gerrit.wikimedia.org/r/1029220 (https://phabricator.wikimedia.org/T325228) [14:09:19] !log START lucaswerkmeister-wmde@mwmaint1002:~$ time mwscript extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --wiki enwiki --current --all --start '["76318767"]' 2>&1 | tee -a ~/T315510-enwiki-5; date [14:09:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:21] there is some migrateLinksTable maintenance script that caused roughly 2 millions of logs for zhwiki :) [14:10:48] hashar: already taken care of [14:10:58] Lucas_WMDE: tell me once you're done [14:11:03] yeah it has vanished :) [14:11:14] (if you're deploying) [14:11:22] I’m not deploying, just running a maintenance script [14:11:25] (which might take days) [14:11:34] as far as I’m concerned you’re good to go [14:11:42] awesome [14:11:46] (03CR) 10Ladsgroup: [C:03+2] Stop writing to the old columns of pagelinks in s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032493 (https://phabricator.wikimedia.org/T352010) (owner: 10Ladsgroup) [14:11:55] I might want to try out something on mwdebug at some point but not at the moment [14:12:30] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032493 (https://phabricator.wikimedia.org/T352010) (owner: 10Ladsgroup) [14:12:46] (03Merged) 10jenkins-bot: Stop writing to the old columns of pagelinks in s6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032493 (https://phabricator.wikimedia.org/T352010) (owner: 10Ladsgroup) [14:12:48] I'll be done soon [14:13:05] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1032493|Stop writing to the old columns of pagelinks in s6 (T352010)]] [14:13:13] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [14:13:25] (03CR) 10Hnowlan: [C:03+2] trafficserver: migrate 5% of traffic to commons [puppet] - 10https://gerrit.wikimedia.org/r/1032497 (https://phabricator.wikimedia.org/T362323) (owner: 10Hnowlan) [14:13:58] (03PS2) 10Hnowlan: trafficserver: migrate 5% of commons traffic to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1032497 (https://phabricator.wikimedia.org/T362323) [14:15:39] !log ladsgroup@deploy1002 ladsgroup: Backport for [[gerrit:1032493|Stop writing to the old columns of pagelinks in s6 (T352010)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:15:50] !log ladsgroup@deploy1002 ladsgroup: Continuing with sync [14:18:52] (03CR) 10Hnowlan: [V:03+2 C:03+2] trafficserver: migrate 5% of commons traffic to k8s [puppet] - 10https://gerrit.wikimedia.org/r/1032497 (https://phabricator.wikimedia.org/T362323) (owner: 10Hnowlan) [14:19:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1024 (re)pooling @ 50%: After reimage', diff saved to https://phabricator.wikimedia.org/P62509 and previous config saved to /var/cache/conftool/dbconfig/20240516-141932-root.json [14:19:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2176 (re)pooling @ 2%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62510 and previous config saved to /var/cache/conftool/dbconfig/20240516-141957-arnaudb.json [14:21:04] (03PS6) 10Effie Mouzeli: memcached: add memcache user option [puppet] - 10https://gerrit.wikimedia.org/r/1026609 (https://phabricator.wikimedia.org/T273950) [14:23:21] (03PS14) 10EoghanGaffney: lists: Add lists role to list2001 [puppet] - 10https://gerrit.wikimedia.org/r/1025741 (https://phabricator.wikimedia.org/T331706) [14:23:41] (03CR) 10CI reject: [V:04-1] lists: Add lists role to list2001 [puppet] - 10https://gerrit.wikimedia.org/r/1025741 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [14:25:42] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2174.codfw.wmnet with reason: host reimage [14:26:06] (03CR) 10Eevans: [C:03+1] cassandra: add data_gateway Cassandra role [puppet] - 10https://gerrit.wikimedia.org/r/1032034 (https://phabricator.wikimedia.org/T364921) (owner: 10Eevans) [14:28:03] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2174.codfw.wmnet with reason: host reimage [14:28:37] !log migrated 5% of commons traffic to k8s [14:28:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:48] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1032493|Stop writing to the old columns of pagelinks in s6 (T352010)]] (duration: 15m 42s) [14:28:51] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [14:29:51] (03PS3) 10Effie Mouzeli: memcached: run as user memcache on mc-gp2003 [puppet] - 10https://gerrit.wikimedia.org/r/1032495 (https://phabricator.wikimedia.org/T273950) [14:30:25] FIRING: [2x] SystemdUnitFailed: ceph-0e6a4c4c-138b-11ef-b973-bc97e1bb7c18@mon.moss-be1002.service on moss-be1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:30:54] (03CR) 10EoghanGaffney: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1025741 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [14:34:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1024 (re)pooling @ 75%: After reimage', diff saved to https://phabricator.wikimedia.org/P62511 and previous config saved to /var/cache/conftool/dbconfig/20240516-143439-root.json [14:35:04] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2176 (re)pooling @ 5%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62512 and previous config saved to /var/cache/conftool/dbconfig/20240516-143503-arnaudb.json [14:35:25] RESOLVED: [2x] SystemdUnitFailed: ceph-0e6a4c4c-138b-11ef-b973-bc97e1bb7c18@mon.moss-be1002.service on moss-be1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:36:16] (03PS7) 10Effie Mouzeli: memcached: add memcache user option [puppet] - 10https://gerrit.wikimedia.org/r/1026609 (https://phabricator.wikimedia.org/T273950) [14:37:37] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for xcollazo - https://phabricator.wikimedia.org/T364588#9805033 (10xcollazo) Confirming I can access. Thanks @Eevans ! [14:38:02] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:22] (03CR) 10CI reject: [V:04-1] memcached: add memcache user option [puppet] - 10https://gerrit.wikimedia.org/r/1026609 (https://phabricator.wikimedia.org/T273950) (owner: 10Effie Mouzeli) [14:39:37] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 38 probes of 801 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:40:40] (03PS1) 10Jsn.sherman: CommonSettings-labs: Correct wgAutoModeratorLiftWingBaseUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032502 (https://phabricator.wikimedia.org/T364034) [14:41:22] (03CR) 10Hashar: "Adjustments look good but I have to test it :]" [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) (owner: 10Paladox) [14:41:27] jouncebot: now [14:41:27] No deployments scheduled for the next 1 hour(s) and 18 minute(s) [14:42:43] 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install krb1002 - https://phabricator.wikimedia.org/T365165 (10RobH) 03NEW [14:43:02] !log dzahn@cumin2002 START - Cookbook sre.hosts.downtime for 3:00:00 on contint2002.wikimedia.org with reason: T334517 [14:43:06] T334517: upgrade contint servers to bullseye - https://phabricator.wikimedia.org/T334517 [14:43:13] 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install krb1002 - https://phabricator.wikimedia.org/T365165#9805094 (10RobH) [14:43:14] (03CR) 10EoghanGaffney: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2489/co" [puppet] - 10https://gerrit.wikimedia.org/r/1025741 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [14:43:18] !log dzahn@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on contint2002.wikimedia.org with reason: T334517 [14:43:46] I was wondering if I could followup to my labs config change after Amir is done? I was not able to test until after the backport window ended. https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1032502 [14:44:37] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 28 probes of 801 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [14:45:01] JSherman: labs-only backports are quick so it should be ok [14:45:36] dancy: thank you [14:47:23] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data-Platform-SRE (2024.05.06 - 2024.05.26), 13Patch-For-Review: Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228#9805121 (10xcollazo) > Xabriel, what do you think? Is this workable to try to get the host rol... [14:47:50] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2174.codfw.wmnet with OS bookworm [14:48:39] (03PS8) 10Effie Mouzeli: memcached: add memcache user option [puppet] - 10https://gerrit.wikimedia.org/r/1026609 (https://phabricator.wikimedia.org/T273950) [14:48:55] (03PS4) 10Effie Mouzeli: memcached: run as user memcache on mc-gp2003 [puppet] - 10https://gerrit.wikimedia.org/r/1032495 (https://phabricator.wikimedia.org/T273950) [14:49:02] 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations: Q#:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167 (10RobH) 03NEW [14:49:43] 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations: Q#:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#9805145 (10RobH) [14:49:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'es1024 (re)pooling @ 100%: After reimage', diff saved to https://phabricator.wikimedia.org/P62513 and previous config saved to /var/cache/conftool/dbconfig/20240516-144945-root.json [14:49:46] 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install sretest2001 - https://phabricator.wikimedia.org/T365167#9805146 (10RobH) [14:50:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2176 (re)pooling @ 10%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62514 and previous config saved to /var/cache/conftool/dbconfig/20240516-145009-arnaudb.json [14:53:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2174 (re)pooling @ 1%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62515 and previous config saved to /var/cache/conftool/dbconfig/20240516-145330-arnaudb.json [14:54:31] (03CR) 10Hashar: Allow users to recheck tests in checkers (031 comment) [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) (owner: 10Paladox) [14:55:52] (03PS22) 10Paladox: Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) [14:56:09] (03CR) 10Paladox: Allow users to recheck tests in checkers (031 comment) [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) (owner: 10Paladox) [14:56:27] (03CR) 10CI reject: [V:04-1] Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) (owner: 10Paladox) [14:56:54] (03PS23) 10Paladox: Allow users to recheck tests in checkers [software/gerrit] (deploy/wmf/stable-3.8) - 10https://gerrit.wikimedia.org/r/1026135 (https://phabricator.wikimedia.org/T363918) [14:57:08] Amir1: I see the backport you were running finished a while back; am I good to run a labs-only change? [14:57:32] JSherman: sure, you can just rebase it, no need to backport [14:57:57] thanks! [14:57:59] (03CR) 10Jsn.sherman: [C:03+2] CommonSettings-labs: Correct wgAutoModeratorLiftWingBaseUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032502 (https://phabricator.wikimedia.org/T364034) (owner: 10Jsn.sherman) [14:58:04] FIRING: [2x] HelmReleaseBadStatus: Helm release datasets-config/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [14:58:39] (03Merged) 10jenkins-bot: CommonSettings-labs: Correct wgAutoModeratorLiftWingBaseUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032502 (https://phabricator.wikimedia.org/T364034) (owner: 10Jsn.sherman) [14:59:18] done [14:59:20] 06SRE, 06Infrastructure-Foundations, 10netops: Switch BGP (EVPN) topology between rows/spines at core sites - https://phabricator.wikimedia.org/T365169 (10cmooney) 03NEW p:05Triage→03Low [15:00:09] (03PS5) 10JHathaway: postfix: prometheus ops config for mx-out boxes [puppet] - 10https://gerrit.wikimedia.org/r/1019116 (https://phabricator.wikimedia.org/T325395) [15:00:43] (03CR) 10JHathaway: postfix: prometheus ops config for mx-out boxes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1019116 (https://phabricator.wikimedia.org/T325395) (owner: 10JHathaway) [15:01:31] (03PS9) 10Effie Mouzeli: memcached: add memcache user option [puppet] - 10https://gerrit.wikimedia.org/r/1026609 (https://phabricator.wikimedia.org/T273950) [15:03:02] FIRING: [2x] JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:03:15] !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host contint2002.wikimedia.org with OS bullseye [15:03:25] (03PS1) 10Cathal Mooney: Set AS number for BGP EVPN devices globally at site level [homer/public] - 10https://gerrit.wikimedia.org/r/1032505 (https://phabricator.wikimedia.org/T365169) [15:04:02] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Codfw row C/D switch installation & configuration - https://phabricator.wikimedia.org/T364095#9805245 (10cmooney) [15:04:06] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Switch BGP (EVPN) topology between rows/spines at core sites - https://phabricator.wikimedia.org/T365169#9805246 (10cmooney) [15:04:51] (03CR) 10Vgutierrez: [C:03+1] cache:haproxy: use %HP in log-format to log absolute-form reqs [puppet] - 10https://gerrit.wikimedia.org/r/1032410 (https://phabricator.wikimedia.org/T365117) (owner: 10Fabfur) [15:05:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2176 (re)pooling @ 25%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62516 and previous config saved to /var/cache/conftool/dbconfig/20240516-150515-arnaudb.json [15:05:35] (03CR) 10Ssingh: [C:03+1] "$deityspeed" [puppet] - 10https://gerrit.wikimedia.org/r/1032410 (https://phabricator.wikimedia.org/T365117) (owner: 10Fabfur) [15:06:57] (03CR) 10Fabfur: [C:03+2] cache:haproxy: use %HP in log-format to log absolute-form reqs [puppet] - 10https://gerrit.wikimedia.org/r/1032410 (https://phabricator.wikimedia.org/T365117) (owner: 10Fabfur) [15:08:31] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Switch BGP (EVPN) topology between rows/spines at core sites - https://phabricator.wikimedia.org/T365169#9805265 (10cmooney) [15:08:37] (03PS1) 10Elukey: profile::amd_gpu: refactor configurations for k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/1032506 (https://phabricator.wikimedia.org/T363191) [15:08:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2174 (re)pooling @ 2%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62517 and previous config saved to /var/cache/conftool/dbconfig/20240516-150837-arnaudb.json [15:08:54] (03PS1) 10Ahmon Dancy: docker-gc: Use 1.3.0 image [puppet] - 10https://gerrit.wikimedia.org/r/1032507 (https://phabricator.wikimedia.org/T350478) [15:10:41] (03CR) 10JHathaway: [C:03+1] "looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1020191 (https://phabricator.wikimedia.org/T345740) (owner: 10Effie Mouzeli) [15:12:35] (03PS1) 10Fabfur: Revert "cache:haproxy: use %HP in log-format to log absolute-form reqs" [puppet] - 10https://gerrit.wikimedia.org/r/1032433 [15:12:44] (03PS3) 10Dzahn: ci: set puppet7 at role level [puppet] - 10https://gerrit.wikimedia.org/r/1032023 (https://phabricator.wikimedia.org/T334517) [15:13:05] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 DIFF 3 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compile" [puppet] - 10https://gerrit.wikimedia.org/r/1032506 (https://phabricator.wikimedia.org/T363191) (owner: 10Elukey) [15:13:14] (03CR) 10Dzahn: [C:03+2] ci: set puppet7 at role level [puppet] - 10https://gerrit.wikimedia.org/r/1032023 (https://phabricator.wikimedia.org/T334517) (owner: 10Dzahn) [15:13:27] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Switch BGP (EVPN) topology between rows/spines at core sites - https://phabricator.wikimedia.org/T365169#9805297 (10cmooney) [15:13:39] (03PS2) 10Elukey: profile::amd_gpu: refactor configurations for k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/1032506 (https://phabricator.wikimedia.org/T363191) [15:14:13] (03CR) 10JHathaway: [C:03+1] "great!" [puppet] - 10https://gerrit.wikimedia.org/r/1002387 (https://phabricator.wikimedia.org/T352640) (owner: 10Filippo Giunchedi) [15:14:35] (03CR) 10Dzahn: [C:03+1] docker-gc: Use 1.3.0 image [puppet] - 10https://gerrit.wikimedia.org/r/1032507 (https://phabricator.wikimedia.org/T350478) (owner: 10Ahmon Dancy) [15:15:52] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Switch BGP (EVPN) topology between rows/spines at core sites - https://phabricator.wikimedia.org/T365169#9805309 (10cmooney) p:05Low→03Medium [15:15:59] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 DIFF 4 NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compile" [puppet] - 10https://gerrit.wikimedia.org/r/1032506 (https://phabricator.wikimedia.org/T363191) (owner: 10Elukey) [15:16:04] (03CR) 10Fabfur: [C:03+2] Revert "cache:haproxy: use %HP in log-format to log absolute-form reqs" [puppet] - 10https://gerrit.wikimedia.org/r/1032433 (owner: 10Fabfur) [15:16:37] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 40 probes of 801 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:17:09] (03PS3) 10Elukey: profile::amd_gpu: refactor configurations for k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/1032506 (https://phabricator.wikimedia.org/T363191) [15:19:26] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (DIFF 6 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1032506 (https://phabricator.wikimedia.org/T363191) (owner: 10Elukey) [15:20:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2176 (re)pooling @ 50%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62518 and previous config saved to /var/cache/conftool/dbconfig/20240516-152021-arnaudb.json [15:21:37] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 27 probes of 801 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:22:25] (03PS4) 10Elukey: profile::amd_gpu: refactor configurations for k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/1032506 (https://phabricator.wikimedia.org/T363191) [15:23:02] (03CR) 10JHathaway: memcached: add memcache user option (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1026609 (https://phabricator.wikimedia.org/T273950) (owner: 10Effie Mouzeli) [15:23:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2174 (re)pooling @ 5%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62519 and previous config saved to /var/cache/conftool/dbconfig/20240516-152343-arnaudb.json [15:24:35] (03CR) 10Elukey: [V:03+1] "PCC SUCCESS (DIFF 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2495/console" [puppet] - 10https://gerrit.wikimedia.org/r/1032506 (https://phabricator.wikimedia.org/T363191) (owner: 10Elukey) [15:24:50] !log dzahn@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host contint2002.wikimedia.org with OS bullseye [15:25:02] (03CR) 10Elukey: profile::amd_gpu: refactor configurations for k8s nodes [puppet] - 10https://gerrit.wikimedia.org/r/1032506 (https://phabricator.wikimedia.org/T363191) (owner: 10Elukey) [15:25:35] !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host contint2002.wikimedia.org with OS bullseye [15:31:17] (03PS10) 10Effie Mouzeli: memcached: add memcache user option [puppet] - 10https://gerrit.wikimedia.org/r/1026609 (https://phabricator.wikimedia.org/T273950) [15:31:59] (03CR) 10Eevans: [C:03+2] "This is awesome, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1030175 (owner: 10Eevans) [15:32:56] (03CR) 10Effie Mouzeli: memcached: add memcache user option (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1026609 (https://phabricator.wikimedia.org/T273950) (owner: 10Effie Mouzeli) [15:33:20] (03PS1) 10Ilias Sarantopoulos: ml-services: increase viwiki-reverted replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032517 (https://phabricator.wikimedia.org/T362503) [15:33:24] (03PS11) 10Effie Mouzeli: memcached: add memcache user option [puppet] - 10https://gerrit.wikimedia.org/r/1026609 (https://phabricator.wikimedia.org/T273950) [15:33:34] (03CR) 10CI reject: [V:04-1] ml-services: increase viwiki-reverted replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032517 (https://phabricator.wikimedia.org/T362503) (owner: 10Ilias Sarantopoulos) [15:34:04] (03PS2) 10Jdlrobson: [beta] Disable font size configuration on talk pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032088 (https://phabricator.wikimedia.org/T364887) [15:35:24] (03PS2) 10Ilias Sarantopoulos: ml-services: increase viwiki-reverted replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032517 (https://phabricator.wikimedia.org/T362503) [15:35:24] (03PS5) 10Effie Mouzeli: memcached: run as user memcache on mc-gp2003 [puppet] - 10https://gerrit.wikimedia.org/r/1032495 (https://phabricator.wikimedia.org/T273950) [15:35:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2176 (re)pooling @ 75%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62520 and previous config saved to /var/cache/conftool/dbconfig/20240516-153527-arnaudb.json [15:35:35] (03CR) 10CI reject: [V:04-1] ml-services: increase viwiki-reverted replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032517 (https://phabricator.wikimedia.org/T362503) (owner: 10Ilias Sarantopoulos) [15:36:43] (03PS1) 10Scott French: push-notifications: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032519 (https://phabricator.wikimedia.org/T362978) [15:36:52] (03CR) 10CI reject: [V:04-1] push-notifications: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032519 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [15:37:21] (03CR) 10JHathaway: [C:03+1] memcached: add memcache user option (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1026609 (https://phabricator.wikimedia.org/T273950) (owner: 10Effie Mouzeli) [15:37:44] (03PS2) 10Cathal Mooney: Set AS number for BGP EVPN devices globally at site level [homer/public] - 10https://gerrit.wikimedia.org/r/1032505 (https://phabricator.wikimedia.org/T365169) [15:38:47] (03CR) 10Jdrewniak: [C:03+2] [beta] Disable font size configuration on talk pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032088 (https://phabricator.wikimedia.org/T364887) (owner: 10Jdlrobson) [15:38:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2174 (re)pooling @ 10%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62521 and previous config saved to /var/cache/conftool/dbconfig/20240516-153850-arnaudb.json [15:39:41] (03Merged) 10jenkins-bot: [beta] Disable font size configuration on talk pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032088 (https://phabricator.wikimedia.org/T364887) (owner: 10Jdlrobson) [15:40:24] (03PS1) 10Cathal Mooney: Announce Wikidough Anycast ranges to internet from magru [homer/public] - 10https://gerrit.wikimedia.org/r/1032520 (https://phabricator.wikimedia.org/T362421) [15:42:55] (03CR) 10Ssingh: [C:03+1] Announce Wikidough Anycast ranges to internet from magru [homer/public] - 10https://gerrit.wikimedia.org/r/1032520 (https://phabricator.wikimedia.org/T362421) (owner: 10Cathal Mooney) [15:43:01] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031601 [15:43:13] (03Abandoned) 10Ilias Sarantopoulos: ml-services: increase viwiki-reverted replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032517 (https://phabricator.wikimedia.org/T362503) (owner: 10Ilias Sarantopoulos) [15:43:14] (03CR) 10CI reject: [V:04-1] mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031601 (owner: 10PipelineBot) [15:44:19] (03CR) 10Ssingh: [C:03+1] "For clarity: we are doing this before we start announcing ns2 from magru and as a test to make sure everything is fine with the setup." [homer/public] - 10https://gerrit.wikimedia.org/r/1032520 (https://phabricator.wikimedia.org/T362421) (owner: 10Cathal Mooney) [15:45:08] !log systemctl restart mariadb@s4.service on clouddb1015 (using too much RAM) T365164 [15:45:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:45:19] T365164: [wikireplicas] clouddb* free memory decreases over time - https://phabricator.wikimedia.org/T365164 [15:47:26] (03CR) 10TChin: [C:03+2] datasets-config: Rename next pods to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032487 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [15:47:34] (03CR) 10TChin: [C:03+2] datasets-config: Change mesh port [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032486 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [15:47:34] (03CR) 10CI reject: [V:04-1] datasets-config: Rename next pods to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032487 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [15:50:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2176 (re)pooling @ 100%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62522 and previous config saved to /var/cache/conftool/dbconfig/20240516-155034-arnaudb.json [15:53:09] (03CR) 10EoghanGaffney: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2498/co" [puppet] - 10https://gerrit.wikimedia.org/r/1025741 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [15:53:17] (03CR) 10Ssingh: "not urgent, just so that we do it" [homer/public] - 10https://gerrit.wikimedia.org/r/1032522 (owner: 10Ssingh) [15:53:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2174 (re)pooling @ 25%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62523 and previous config saved to /var/cache/conftool/dbconfig/20240516-155356-arnaudb.json [15:56:28] (03PS1) 10JMeybohm: zotero: Update various modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032523 (https://phabricator.wikimedia.org/T362978) [15:56:40] (03CR) 10CI reject: [V:04-1] zotero: Update various modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032523 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [15:56:50] (03PS1) 10RLazarus: tegola-vector-tiles: Dependency updates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032524 (https://phabricator.wikimedia.org/T362978) [15:56:59] (03CR) 10CI reject: [V:04-1] tegola-vector-tiles: Dependency updates [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032524 (https://phabricator.wikimedia.org/T362978) (owner: 10RLazarus) [15:58:58] (03CR) 10Ryan Kemper: [C:03+2] CirrusBackendErrorRateTooHigh: soften threshold [alerts] - 10https://gerrit.wikimedia.org/r/1031543 (owner: 10Ryan Kemper) [15:59:15] (03CR) 10Mabualruz: [C:03+1] Fix exclude list for dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032398 (https://phabricator.wikimedia.org/T365084) (owner: 10Mabualruz) [16:00:05] jhathaway and rzl: #bothumor My software never has bugs. It just develops random features. Rise for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240516T1600). [16:00:05] No Gerrit patches in the queue for this window AFAICS. [16:00:15] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data-Platform-SRE (2024.05.06 - 2024.05.26), 13Patch-For-Review: Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228#9805537 (10BTullis) I have disabled the timers on snapshot1008 with the following. ` btullis@s... [16:00:43] (03Merged) 10jenkins-bot: CirrusBackendErrorRateTooHigh: soften threshold [alerts] - 10https://gerrit.wikimedia.org/r/1031543 (owner: 10Ryan Kemper) [16:01:09] (03Restored) 10Ilias Sarantopoulos: ml-services: increase viwiki-reverted replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032517 (https://phabricator.wikimedia.org/T362503) (owner: 10Ilias Sarantopoulos) [16:01:15] (03PS1) 10Clément Goubert: miscweb: Update various modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032525 (https://phabricator.wikimedia.org/T362978) [16:03:20] (03PS3) 10Ilias Sarantopoulos: ml-services: increase revscoring-reverted replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032517 (https://phabricator.wikimedia.org/T362503) [16:03:29] (03CR) 10CI reject: [V:04-1] ml-services: increase revscoring-reverted replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032517 (https://phabricator.wikimedia.org/T362503) (owner: 10Ilias Sarantopoulos) [16:05:45] (03PS1) 10Dzahn: gerrit: remove NRPE process monitoring [puppet] - 10https://gerrit.wikimedia.org/r/1032526 [16:06:32] 06SRE, 06Data-Engineering, 10Dumps-Generation, 10Data-Platform-SRE (2024.05.06 - 2024.05.26), 13Patch-For-Review: Migrate Dumps Snapshot hosts from Buster to Bullseye - https://phabricator.wikimedia.org/T325228#9805593 (10BTullis) I stopped the timers with: ` btullis@snapshot1008:~$ for t in $(cat timers... [16:06:51] (03PS2) 10JMeybohm: zotero: Ensure containers have a securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032523 (https://phabricator.wikimedia.org/T362978) [16:07:01] (03CR) 10CI reject: [V:04-1] zotero: Ensure containers have a securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032523 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [16:07:16] (03CR) 10Cathal Mooney: [C:03+2] Announce Wikidough Anycast ranges to internet from magru [homer/public] - 10https://gerrit.wikimedia.org/r/1032520 (https://phabricator.wikimedia.org/T362421) (owner: 10Cathal Mooney) [16:07:47] (03Merged) 10jenkins-bot: Announce Wikidough Anycast ranges to internet from magru [homer/public] - 10https://gerrit.wikimedia.org/r/1032520 (https://phabricator.wikimedia.org/T362421) (owner: 10Cathal Mooney) [16:08:27] (03PS4) 10Ilias Sarantopoulos: ml-services: increase revscoring-reverted replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032517 (https://phabricator.wikimedia.org/T362503) [16:08:37] (03CR) 10CI reject: [V:04-1] ml-services: increase revscoring-reverted replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032517 (https://phabricator.wikimedia.org/T362503) (owner: 10Ilias Sarantopoulos) [16:09:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2174 (re)pooling @ 50%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62525 and previous config saved to /var/cache/conftool/dbconfig/20240516-160902-arnaudb.json [16:12:35] (03PS1) 10Elukey: ml-services: update autoscaling settings for editquality reverted [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032531 (https://phabricator.wikimedia.org/T362503) [16:12:45] (03CR) 10CI reject: [V:04-1] ml-services: update autoscaling settings for editquality reverted [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032531 (https://phabricator.wikimedia.org/T362503) (owner: 10Elukey) [16:12:45] !log announcing wikidough anycast ranges to Inernet (transit) in magru T362421 [16:12:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:50] T362421: magru network setup - https://phabricator.wikimedia.org/T362421 [16:13:46] (03CR) 10TChin: [C:03+2] "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032486 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [16:13:52] hashar: o/ around? [16:13:57] (03CR) 10CI reject: [V:04-1] datasets-config: Change mesh port [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032486 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [16:14:36] or dancy :) [16:14:40] I'm around. [16:14:43] (03CR) 10TChin: [C:03+2] "seems like there's some transient linting problems? I dunno" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032486 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [16:14:52] (03CR) 10CI reject: [V:04-1] datasets-config: Change mesh port [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032486 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [16:15:01] If you're having CI troubles with helm-lint, that's my fault.. Investigating [16:15:10] ah yes I was about to ask! Thanks :) [16:15:32] * tchin Thanks for the heads up! [16:16:02] Sorry about that folks. [16:17:11] np! It happens, thanks for working on it! [16:17:26] (03Abandoned) 10Elukey: ml-services: update autoscaling settings for editquality reverted [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032531 (https://phabricator.wikimedia.org/T362503) (owner: 10Elukey) [16:21:13] elukey: Ready for re-test [16:22:26] (03CR) 10Ahmon Dancy: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032486 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [16:22:52] nope.. still not there yet. [16:24:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2174 (re)pooling @ 75%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62526 and previous config saved to /var/cache/conftool/dbconfig/20240516-162408-arnaudb.json [16:30:01] (03CR) 10Ahmon Dancy: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032486 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [16:30:12] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['contint2002'] [16:31:18] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['contint2002'] [16:31:30] (03CR) 10Ahmon Dancy: [C:03+2] datasets-config: Change mesh port [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032486 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [16:31:54] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['contint2002'] [16:32:13] (03PS2) 10Scott French: push-notifications: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032519 (https://phabricator.wikimedia.org/T362978) [16:32:23] (03Merged) 10jenkins-bot: datasets-config: Change mesh port [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032486 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [16:32:32] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['contint2002'] [16:33:48] 06SRE, 06Infrastructure-Foundations, 10netops: magru network setup - https://phabricator.wikimedia.org/T362421#9805776 (10ssingh) Thanks to @cmooney for rolling the above out. For further context, we (Traffic and netops) decided to try out the anycast range in magru for the Wikidough service before doing it... [16:35:06] elukey and tchin: Things are working again. Lemme know if you have further troubles. [16:35:27] Thanks! [16:35:33] 14SRE-Sprint-Week-Sustainability-March2023, 06DBA, 10Sustainability (Incident Followup): Implement (or refactor) a script to move slaves when the master is not available - https://phabricator.wikimedia.org/T196366#9805812 (10Aklapper) a:05Ladsgroup→03None @Ladsgroup: Removing task assignee as this open... [16:35:44] (03CR) 10TChin: [C:03+2] "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032487 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [16:36:36] (03Merged) 10jenkins-bot: datasets-config: Rename next pods to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032487 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [16:37:06] 06SRE, 10Wikimedia-Mailing-lists: wikimediacz-l does not hold all posts for moderation - https://phabricator.wikimedia.org/T298729#9805814 (10Aklapper) a:05Ladsgroup→03None @Ladsgroup: Removing task assignee as this open task has been assigned for more than two years - see the email sent to all task assig... [16:37:11] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['contint2002'] [16:37:20] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['contint2002'] [16:37:30] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['contint2002'] [16:37:40] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['contint2002'] [16:38:24] !log robh@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['contint2002'] [16:39:12] !log robh@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['contint2002'] [16:39:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2174 (re)pooling @ 100%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62528 and previous config saved to /var/cache/conftool/dbconfig/20240516-163915-arnaudb.json [16:40:08] !log dzahn@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host contint2002.wikimedia.org with OS bullseye [16:40:55] 06SRE, 10Wikimedia-Mailing-lists: wikimediacz-l does not hold all posts for moderation - https://phabricator.wikimedia.org/T298729#9805877 (10Dzahn) a:03Urbanecm Well, the last question was to Urbanecm anyways, so re-assigning I guess. [16:41:44] !log ryankemper@cumin2002 START - Cookbook sre.hadoop.roll-restart-workers restart workers for Hadoop test cluster: Roll restart of jvm daemons for openjdk upgrade. [16:41:53] !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host contint2002.wikimedia.org with OS bullseye [16:45:42] (03PS1) 10Cwhite: logstash: translate k8s audit logs to ECS [puppet] - 10https://gerrit.wikimedia.org/r/1031602 (https://phabricator.wikimedia.org/T290020) [16:49:38] FIRING: [2x] ProbeDown: Service cloudidm2001-dev:443 has failed probes (http_cloudtestidm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/IDM/Runbook - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:51:45] 06SRE, 06Infrastructure-Foundations, 10netops: magru network setup - https://phabricator.wikimedia.org/T362421#9806067 (10cmooney) And fwiw announcement looks good, all 3 of our transits are learning it ok, and I see it on other carriers from those sources as well. We also see live requests on the doh servers. [16:51:56] 06SRE, 06Infrastructure-Foundations, 10Mail: Evaluate whether and how to route abuse@ emails to Legal - https://phabricator.wikimedia.org/T302549#9806065 (10Aklapper) a:05RLazarus→03None @RLazarus: Removing task assignee as this open task has been assigned for more than two years - see the email sent to... [16:54:08] 06SRE, 07Wikimedia-Incident: upstream connect error or disconnect/reset before headers. reset reason: overflow - https://phabricator.wikimedia.org/T301505#9806068 (10Aklapper) a:05RLazarus→03None @RLazarus: Removing task assignee as this open task has been assigned for more than two years - see the email s... [16:54:21] (03PS1) 10BryanDavis: developer-portal: Bump container to 2024-04-25-122307-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032535 [16:54:57] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Switch BGP (EVPN) topology between rows/spines at core sites - https://phabricator.wikimedia.org/T365169#9806104 (10cmooney) [16:55:50] (03CR) 10BryanDavis: [C:03+2] developer-portal: Bump container to 2024-04-25-122307-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032535 (owner: 10BryanDavis) [16:56:29] 10SRE-swift-storage: Monitoring (?+alerting) for Swift capacity - https://phabricator.wikimedia.org/T294019#9806121 (10Aklapper) a:05MatthewVernon→03None @MatthewVernon: Removing task assignee as this open task has been assigned for more than two years - see the email sent to all task assignees on 2024-04-15... [16:56:31] 10SRE-swift-storage: Storage request for datasets published by research team - https://phabricator.wikimedia.org/T294380#9806119 (10Aklapper) a:05MatthewVernon→03None @MatthewVernon: Removing task assignee as this open task has been assigned for more than two years - see the email sent to all task assignees... [16:56:36] (03Merged) 10jenkins-bot: developer-portal: Bump container to 2024-04-25-122307-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032535 (owner: 10BryanDavis) [16:57:00] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hadoop.roll-restart-workers (exit_code=0) restart workers for Hadoop test cluster: Roll restart of jvm daemons for openjdk upgrade. [16:57:35] !log dzahn@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host contint2002.wikimedia.org with OS bullseye [16:57:45] !log ryankemper@cumin2002 START - Cookbook sre.hadoop.roll-restart-workers restart workers for Hadoop analytics cluster: Roll restart of jvm daemons for openjdk upgrade. [16:58:58] !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host contint2002.wikimedia.org with OS buster [16:59:58] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance [17:00:05] bd808: gettimeofday() says it's time for Cloud Services/Technical Documentation weekly deploy (Toolhub, Developer portal, Striker). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240516T1700) [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240516T1700) [17:00:11] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance [17:00:13] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [17:00:28] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [17:00:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1165 (T352010)', diff saved to https://phabricator.wikimedia.org/P62529 and previous config saved to /var/cache/conftool/dbconfig/20240516-170035-ladsgroup.json [17:00:51] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [17:00:58] !log bd808@deploy1002 helmfile [staging] START helmfile.d/services/developer-portal: apply [17:01:21] !log bd808@deploy1002 helmfile [staging] DONE helmfile.d/services/developer-portal: apply [17:01:43] (03PS2) 10Cwhite: logstash: reformat k8s audit logs to ECS [puppet] - 10https://gerrit.wikimedia.org/r/1031602 (https://phabricator.wikimedia.org/T290020) [17:01:50] !log bd808@deploy1002 helmfile [codfw] START helmfile.d/services/developer-portal: apply [17:02:14] !log bd808@deploy1002 helmfile [codfw] DONE helmfile.d/services/developer-portal: apply [17:02:27] !log bd808@deploy1002 helmfile [eqiad] START helmfile.d/services/developer-portal: apply [17:02:54] !log bd808@deploy1002 helmfile [eqiad] DONE helmfile.d/services/developer-portal: apply [17:03:08] (03CR) 10RLazarus: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032524 (https://phabricator.wikimedia.org/T362978) (owner: 10RLazarus) [17:05:40] (03PS2) 10RLazarus: tegola-vector-tiles: Add securityContext and update dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032524 (https://phabricator.wikimedia.org/T362978) [17:08:12] 14SRE-Sprint-Week-Sustainability-March2023, 06Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Optimise WMF WAN Network Configuration - https://phabricator.wikimedia.org/T297355#9806267 (10Aklapper) a:05cmooney→03None @cmooney: Removing task assignee as this open task has bee... [17:09:07] 06SRE, 06SRE-OnFire, 10observability: Automated uploads of minimal & comprehensible timeseries metrics for statuspage display - https://phabricator.wikimedia.org/T285569#9806277 (10Aklapper) a:05CDanis→03None @CDanis: Removing task assignee as this open task has been assigned for more than two years - se... [17:09:26] 06SRE, 06Traffic-Icebox, 07HTTPS, 13Patch-Needs-Improvement: Provide acme-chief/TLS SNI list support in compile_redirects() - https://phabricator.wikimedia.org/T225096#9806284 (10Aklapper) a:05Vgutierrez→03None @Vgutierrez: Removing task assignee as this open task has been assigned for more than two ye... [17:11:09] 06SRE, 06Infrastructure-Foundations, 10Puppet CI, 13Patch-Needs-Improvement: Hieradata yaml style checking - https://phabricator.wikimedia.org/T236954#9806306 (10Aklapper) a:05jhathaway→03None @jhathaway: Removing task assignee as this open task has been assigned for more than two years - see the email... [17:13:11] (03CR) 10Scott French: [C:03+1] "Got it, thanks! Either way sounds good to me, though doing it all in one change (as long as the prerequisites are ready, e.g. private) has" [puppet] - 10https://gerrit.wikimedia.org/r/1032034 (https://phabricator.wikimedia.org/T364921) (owner: 10Eevans) [17:16:27] (03PS1) 10Ryan Kemper: hadoop: remove outdated ref to backup cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/1032537 [17:17:55] (03PS2) 10Ryan Kemper: hadoop: remove outdated ref to backup cluster [cookbooks] - 10https://gerrit.wikimedia.org/r/1032537 [17:26:30] (03CR) 10Scott French: "rebuild" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032523 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [17:26:39] (03CR) 10Scott French: "rebuild" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032525 (https://phabricator.wikimedia.org/T362978) (owner: 10Clément Goubert) [17:27:00] (03CR) 10Scott French: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032523 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [17:27:06] (03CR) 10Scott French: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032525 (https://phabricator.wikimedia.org/T362978) (owner: 10Clément Goubert) [17:28:43] (03PS1) 10DDesouza: miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032538 (https://phabricator.wikimedia.org/T219903) [17:29:41] (03CR) 10DDesouza: [C:03+2] miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032538 (https://phabricator.wikimedia.org/T219903) (owner: 10DDesouza) [17:32:12] (03Merged) 10jenkins-bot: miscweb(research-landing-page): bump version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032538 (https://phabricator.wikimedia.org/T219903) (owner: 10DDesouza) [17:33:23] !log dani@deploy1002 helmfile [staging] START helmfile.d/services/miscweb: apply [17:33:42] !log dani@deploy1002 helmfile [staging] DONE helmfile.d/services/miscweb: apply [17:33:44] !log dani@deploy1002 helmfile [eqiad] START helmfile.d/services/miscweb: apply [17:34:09] !log dani@deploy1002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply [17:34:10] (03CR) 10Scott French: [C:03+1] "LGTM. Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032523 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [17:34:10] !log dani@deploy1002 helmfile [codfw] START helmfile.d/services/miscweb: apply [17:34:32] !log dani@deploy1002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply [17:41:57] (03CR) 10Ilias Sarantopoulos: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032517 (https://phabricator.wikimedia.org/T362503) (owner: 10Ilias Sarantopoulos) [17:45:59] !log vriley@cumin1002 START - Cookbook sre.hosts.reimage for host kafka-main1006.eqiad.wmnet with OS bullseye [17:46:10] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9806449 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by vriley@cumin1002 for host kafka-main1006.eqiad.wmnet with OS bullseye [17:52:47] !log brennen@deploy1002 Started deploy [phabricator/deployment@7d858df]: test scap deployment with keyholder key misconfigured for T313624 [17:52:50] T313624: scap should have a better error message when it can't find keyholder key - https://phabricator.wikimedia.org/T313624 [17:53:25] !log brennen@deploy1002 Finished deploy [phabricator/deployment@7d858df]: test scap deployment with keyholder key misconfigured for T313624 (duration: 00m 38s) [18:03:02] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on seaborgium:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [18:04:55] !log dzahn@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host contint2002.wikimedia.org with OS buster [18:13:02] !log cmooney@cumin1002 START - Cookbook sre.hosts.dhcp for host contint2002.wikimedia.org [18:13:34] (03CR) 10Scott French: "Thanks! Wow, that's a lot of releases :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032525 (https://phabricator.wikimedia.org/T362978) (owner: 10Clément Goubert) [18:15:28] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host contint2002.wikimedia.org [18:17:09] !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host contint2002.wikimedia.org with OS bullseye [18:22:48] (03CR) 10DCausse: "yes this cookbook is becoming monstruous... 😞" [cookbooks] - 10https://gerrit.wikimedia.org/r/1031933 (https://phabricator.wikimedia.org/T349069) (owner: 10DCausse) [18:23:17] (03PS7) 10DCausse: wdqs.data-reload: support HDFS as a source [cookbooks] - 10https://gerrit.wikimedia.org/r/1031933 (https://phabricator.wikimedia.org/T349069) [18:23:17] (03PS1) 10DCausse: wdqs: extract categories reload to its own cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1032544 [18:24:39] (03PS1) 10Jsn.sherman: CommonSettings-labs: Correct AutoModeratorLiftWing settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032545 [18:26:33] (03CR) 10Scott French: [C:03+1] "Sounds good. Yeah, it seems challenging to anticipate what an eventual systemd unit offered by the package might look like (and how to mak" [puppet] - 10https://gerrit.wikimedia.org/r/1031465 (owner: 10Filippo Giunchedi) [18:30:20] (03PS3) 10Scott French: push-notifications: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032519 (https://phabricator.wikimedia.org/T362978) [18:32:49] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-main1006.eqiad.wmnet with OS bullseye [18:33:07] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops, 13Patch-For-Review: Q3:rack/setup/install dbprov100[56] - https://phabricator.wikimedia.org/T355353#9806592 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by vriley@cumin1002 for host kafka-main1006.eqiad.wmnet with OS bullseye ex... [18:46:13] !log dzahn@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host contint2002.wikimedia.org with OS bullseye [18:58:04] FIRING: [2x] HelmReleaseBadStatus: Helm release datasets-config/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [18:58:05] !log dzahn@cumin2002 START - Cookbook sre.hosts.reimage for host contint2002.wikimedia.org with OS buster [19:00:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T364299)', diff saved to https://phabricator.wikimedia.org/P62535 and previous config saved to /var/cache/conftool/dbconfig/20240516-190024-marostegui.json [19:00:38] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [19:03:02] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:05:11] 14SRE-Sprint-Week-Sustainability-March2023, 06Infrastructure-Foundations, 10netops, 10Sustainability (Incident Followup): Optimise WMF WAN Network Configuration - https://phabricator.wikimedia.org/T297355#9806679 (10cmooney) p:05Medium→03Low Thanks. It is very much something we wish to do but unfortun... [19:15:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P62536 and previous config saved to /var/cache/conftool/dbconfig/20240516-191532-marostegui.json [19:20:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T352010)', diff saved to https://phabricator.wikimedia.org/P62537 and previous config saved to /var/cache/conftool/dbconfig/20240516-192027-ladsgroup.json [19:20:32] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [19:30:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P62538 and previous config saved to /var/cache/conftool/dbconfig/20240516-193040-marostegui.json [19:35:37] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P62539 and previous config saved to /var/cache/conftool/dbconfig/20240516-193535-ladsgroup.json [19:43:54] 06SRE, 06DC-Ops, 06Infrastructure-Foundations, 06SRE Observability: SCS CPU monitoring issue - https://phabricator.wikimedia.org/T285229#9806799 (10andrea.denisse) Hello, I sent a patch for this on commit [[ https://gerrit.wikimedia.org/r/plugins/gitiles/operations/software/librenms/+/refs/heads/upstream-2... [19:44:36] (03PS1) 10C. Scott Ananian: [JsonCodec, ParserCache] Improve debugging of serializability failures [core] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1032435 (https://phabricator.wikimedia.org/T365036) [19:45:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T364299)', diff saved to https://phabricator.wikimedia.org/P62540 and previous config saved to /var/cache/conftool/dbconfig/20240516-194548-marostegui.json [19:45:52] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance [19:45:55] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [19:46:05] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance [19:46:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2166 (T364299)', diff saved to https://phabricator.wikimedia.org/P62541 and previous config saved to /var/cache/conftool/dbconfig/20240516-194613-marostegui.json [19:47:36] (03CR) 10Isabelle Hurbain-Palatin: [C:03+1] [JsonCodec, ParserCache] Improve debugging of serializability failures [core] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1032435 (https://phabricator.wikimedia.org/T365036) (owner: 10C. Scott Ananian) [19:47:49] here for the backports [19:50:45] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P62542 and previous config saved to /var/cache/conftool/dbconfig/20240516-195044-ladsgroup.json [19:50:57] (03CR) 10Eevans: [C:03+2] cassandra: add data_gateway Cassandra role [puppet] - 10https://gerrit.wikimedia.org/r/1032034 (https://phabricator.wikimedia.org/T364921) (owner: 10Eevans) [19:54:22] 06SRE, 10Cloud Services Proposals, 06Infrastructure-Foundations, 10netops: Separate WMCS control and management plane traffic - https://phabricator.wikimedia.org/T314847#9806824 (10cmooney) 05Open→03Resolved This has been implemented and the new vlan setup is recorded [[ https://wikitech.wikimedia.... [19:55:20] !log dzahn@cumin1002 START - Cookbook sre.hosts.reimage for host contint2002.wikimedia.org with OS bullseye [19:58:05] Also here for backports [19:58:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T352010)', diff saved to https://phabricator.wikimedia.org/P62543 and previous config saved to /var/cache/conftool/dbconfig/20240516-195817-ladsgroup.json [19:58:22] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [19:59:15] I too am here for the backports [19:59:34] such a party [19:59:41] backport party! [19:59:50] * taavi feels left out since he does not have anything to backport [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: May I have your attention please! UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240516T2000) [20:00:05] jdrewniak, edsanders, JSherman, and cscott: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:44] o/ [20:00:53] I am happy to self deploy when it my turn as I have a labs-only change [20:02:06] PROBLEM - Hadoop DataNode on an-worker1172 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [20:03:04] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.hadoop.roll-restart-workers (exit_code=99) restart workers for Hadoop analytics cluster: Roll restart of jvm daemons for openjdk upgrade. [20:03:20] ^ looking [20:03:25] I would usually self-deploy but I'm in the middle of a meeting right now so if someone else could, that'd be appreciated. [20:04:49] I am happy to pitch in if none of the scheduled deployers is available. This would be my first solo backport. [20:05:00] Ok in the interest of time I'm gonna self-deploy mine :P [20:05:02] taavi: ^ [20:05:55] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T352010)', diff saved to https://phabricator.wikimedia.org/P62544 and previous config saved to /var/cache/conftool/dbconfig/20240516-200552-ladsgroup.json [20:05:57] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance [20:06:00] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [20:06:10] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance [20:06:10] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032398 (https://phabricator.wikimedia.org/T365084) (owner: 10Mabualruz) [20:06:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1168 (T352010)', diff saved to https://phabricator.wikimedia.org/P62545 and previous config saved to /var/cache/conftool/dbconfig/20240516-200618-ladsgroup.json [20:06:57] (03Merged) 10jenkins-bot: Fix exclude list for dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032398 (https://phabricator.wikimedia.org/T365084) (owner: 10Mabualruz) [20:07:06] RECOVERY - Hadoop DataNode on an-worker1172 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.hdfs.server.datanode.DataNode https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23HDFS_Datanode_process [20:08:26] !log jdrewniak@deploy1002 Started scap: Backport for [[gerrit:1032398|Fix exclude list for dark mode (T365084)]] [20:08:30] T365084: Night mode exclude list doesn't appear to be working with various pages (including Special:AbuseLog or diff pages) - https://phabricator.wikimedia.org/T365084 [20:08:44] !log [Hadoop] Restarted `hadoop-hdfs-datanode` on `an-worker1172` [20:08:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:23] (03PS5) 10Scott French: configure parsercache servers via dbconfig in etcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031583 (https://phabricator.wikimedia.org/T362786) [20:11:04] !log jdrewniak@deploy1002 jdrewniak and mabualruz: Backport for [[gerrit:1032398|Fix exclude list for dark mode (T365084)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:11:48] (03PS1) 10Eevans: cassandra_dev: add data_gateway Cassandra role to cluster [puppet] - 10https://gerrit.wikimedia.org/r/1032570 (https://phabricator.wikimedia.org/T364921) [20:11:49] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on contint2002.wikimedia.org with reason: host reimage [20:12:22] !log jdrewniak@deploy1002 jdrewniak and mabualruz: Continuing with sync [20:12:56] (03PS1) 10Esanders: Update VE core submodule to master (27296e0e3) [extensions/VisualEditor] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1032571 (https://phabricator.wikimedia.org/T230323) [20:13:20] (03CR) 10Eevans: [C:03+2] cassandra_dev: add data_gateway Cassandra role to cluster [puppet] - 10https://gerrit.wikimedia.org/r/1032570 (https://phabricator.wikimedia.org/T364921) (owner: 10Eevans) [20:13:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P62546 and previous config saved to /var/cache/conftool/dbconfig/20240516-201326-ladsgroup.json [20:14:43] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on contint2002.wikimedia.org with reason: host reimage [20:24:29] just fyi `1 apaches had sync errors` `snapshot1008.eqiad.wmnet` [20:25:38] (03PS2) 10TChin: datasets-config: Remove service-runner config and update default config [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032491 (https://phabricator.wikimedia.org/T357434) [20:28:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227', diff saved to https://phabricator.wikimedia.org/P62547 and previous config saved to /var/cache/conftool/dbconfig/20240516-202834-ladsgroup.json [20:29:49] JSherman: that would be helpful [20:30:55] edsanders: ack; jan_drewniak: is your backport still running? [20:31:03] !log jdrewniak@deploy1002 Finished scap: Backport for [[gerrit:1032398|Fix exclude list for dark mode (T365084)]] (duration: 22m 36s) [20:31:12] there's the answer [20:31:14] T365084: Night mode exclude list doesn't appear to be working with various pages (including Special:AbuseLog or diff pages) - https://phabricator.wikimedia.org/T365084 [20:31:19] JSherman: same here [20:31:40] cscott: ack [20:32:23] hi JSherman, yes it finished, it had 1 error though, `snapshot1008.eqiad.wmnet port 22: Connection timed out` so don't be surprised if you encounter that. [20:32:43] jan_drewniak: ack [20:32:46] not sure how big of an issue that is... [20:33:07] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host contint2002.wikimedia.org with OS bullseye [20:33:19] !log contint2002 - as usual have to manually "a2dismod mpm_event" on a machine using apache that has just been installed to fix the race condition with apache modules [20:33:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:38] !log eevans@cumin1002 START - Cookbook sre.hosts.remove-downtime for aqs1013.eqiad.wmnet [20:33:39] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.remove-downtime (exit_code=0) for aqs1013.eqiad.wmnet [20:33:49] the top errors I'm seeing in logspam are unrelated (looking) timeouts [20:34:15] edsanders: I'm getting yours started [20:34:25] thanks [20:35:28] oh yeah, even toolforge seems to be having connectivity issues [20:36:09] we'll see what scap does [20:39:32] hmm, scap is throwing me a warning/sanity check message `Change '1032516', project 'mediawiki/extensions/VisualEditor', branch 'master' not found in any deployed wikiversion. Deployed wikiversions: ['1.43.0-wmf.5']` I'm going to check the docs to make sure this is nothing scary [20:42:26] (03CR) 10Dzahn: [V:03+1 C:03+2] admin: add Dennis Mburugu to ldap_only users (wmf) [puppet] - 10https://gerrit.wikimedia.org/r/1032047 (https://phabricator.wikimedia.org/T364320) (owner: 10Dzahn) [20:42:31] It's probably due to the weirdness of the git submodule which isn't specifically branched [20:42:32] okay, I figured it out [20:42:43] And updates to it commonly don't happen in backports etc [20:42:44] it looks like I should use change 1032571 for the backport [20:42:56] https://gerrit.wikimedia.org/r/c/mediawiki/extensions/VisualEditor/+/1032571 [20:43:21] at least thats available when I list out the available backports [20:43:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1227 (T352010)', diff saved to https://phabricator.wikimedia.org/P62548 and previous config saved to /var/cache/conftool/dbconfig/20240516-204342-ladsgroup.json [20:43:45] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [20:43:50] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [20:43:58] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [20:46:03] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy1002 using scap backport" [extensions/VisualEditor] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1032571 (https://phabricator.wikimedia.org/T230323) (owner: 10Esanders) [20:49:38] FIRING: [2x] ProbeDown: Service cloudidm2001-dev:443 has failed probes (http_cloudtestidm_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/IDM/Runbook - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:51:17] (03CR) 10Kimberly Sarabia: "Thank you!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024813 (https://phabricator.wikimedia.org/T361962) (owner: 10Kimberly Sarabia) [20:51:55] jouncebot: next [20:51:55] In 9 hour(s) and 8 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240517T0600) [20:53:05] edsanders: zuul says the eta for gate-and-submit for this is 14 minutes, meaning we're going to run over. can you stick around for testing? [20:53:12] sure [20:54:14] (03PS1) 10Cory Massaro: Update orchestrator image. [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032579 [20:58:19] (03PS2) 10DCausse: wdqs: extract categories reload to its own cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1032544 [20:58:19] (03PS8) 10DCausse: wdqs.data-reload: support HDFS as a source [cookbooks] - 10https://gerrit.wikimedia.org/r/1031933 (https://phabricator.wikimedia.org/T349069) [21:01:35] edsanders: fyi, when you want to backport code (instead of config), you should use the change id of the cherry-pick on the release branch rather than the original commit in your master branch. Most deployers are probably so experienced that they don't even have to think about it, but I had to go look it up in the docs https://wikitech.wikimedia.org/wiki/Backport_windows#How_to_submit_a_patch_for_backport [21:02:27] JSherman: thanks, yeah I only created the cherry pick because I was worried about CI [21:05:19] I'm glad you did, maybe backporting against master would have been fine, but this let me take the paved path (which is the only one I'm currently capable of taking) [21:05:49] (03Merged) 10jenkins-bot: Update VE core submodule to master (27296e0e3) [extensions/VisualEditor] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1032571 (https://phabricator.wikimedia.org/T230323) (owner: 10Esanders) [21:05:54] !log LDAP - added uid dmuthuri to group wmf T364320 [21:05:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:05:57] T364320: LDAP access to the wmf group for Dennis Mburugu - https://phabricator.wikimedia.org/T364320 [21:06:07] !log jsn@deploy1002 Started scap: Backport for [[gerrit:1032571|Update VE core submodule to master (27296e0e3) (T230323 T365052)]] [21:06:13] T230323: Use MutationObservers to detect structural changes - https://phabricator.wikimedia.org/T230323 [21:06:14] T365052: Changing a table Content Cell into a Header Cell removes all cell content - https://phabricator.wikimedia.org/T365052 [21:06:34] JSherman: i've got a backport of code as well; I *think* I did it correctly? [21:07:35] cscott: I'll have a look while this is syncing [21:08:31] 06SRE, 10LDAP-Access-Requests, 13Patch-For-Review: LDAP access to the wmf group for Dennis Mburugu - https://phabricator.wikimedia.org/T364320#9807103 (10Dzahn) 05In progress→03Resolved a:03Dzahn @DMburugu You have been added to the wmf group as requested. Things should work now. [21:08:49] !log jsn@deploy1002 jsn and esanders: Backport for [[gerrit:1032571|Update VE core submodule to master (27296e0e3) (T230323 T365052)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:09:08] edsanders: please test [21:09:13] !log LDAP - added uid rickijay to group nda (T365138) [21:09:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:17] T365138: Grant Access to nda for Ricki Jay - https://phabricator.wikimedia.org/T365138 [21:09:24] testing [21:10:45] 06SRE, 10LDAP-Access-Requests: Grant Access to nda for Ricki Jay - https://phabricator.wikimedia.org/T365138#9807118 (10Dzahn) 05Open→03Resolved a:03Dzahn Confirmed Ricki Jay is in the "NDA/MOU" spreadsheet which confirms they signed NDA and is in the wmde group. Added to nda group. done. [21:10:47] JSherman: works! [21:10:49] thanks [21:10:49] (03PS9) 10DCausse: wdqs.data-reload: support HDFS as a source [cookbooks] - 10https://gerrit.wikimedia.org/r/1031933 (https://phabricator.wikimedia.org/T349069) [21:10:56] cscott: I'm just sanity checking: this was tested / is testable on beta? [21:11:36] edsanders: glad to help! [21:11:48] !log jsn@deploy1002 jsn and esanders: Continuing with sync [21:12:18] JSherman: that's an excellent question, let me see if I can bang on beta here for a bit. [21:12:19] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for milimetric - https://phabricator.wikimedia.org/T365074#9807138 (10Dzahn) @KOfori Here is another one for cassandra-staging-devs group approver [21:21:31] Yeah, I can trigger the JSON failure on beta and my patch appears to add some additional debug info as desired: https://beta-logs.wmcloud.org/app/discover#/doc/5f0c9be0-0b6f-11ec-9cde-3f4490e09a26/logstash-mediawiki-1-7.0.0-1-2024.05.16?id=R5FFg48BNnUNJvYPQE8W [21:22:07] but the logstash configuration on beta seems different from the production logstash?  And/or I don't understand how logging works. [21:23:59] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Problem re-imaging hosts on row-wide vlan on EVPN switches - https://phabricator.wikimedia.org/T365204 (10cmooney) 03NEW p:05Triage→03High [21:24:10] cscott: that seems good enough to me; Will you be able to trigger the failure on the debug host or does the change have to roll out all the way? [21:25:15] fyi I also encountered an error on 1 host why syncing apaches [21:25:15] `21:23:03 ['/usr/bin/scap', 'pull', '--no-php-restart', '--no-update-l10n', '--exclude-wikiversions.php', 'mw2300.codfw.wmnet', 'mw1366.eqiad.wmnet', 'mw2259.codfw.wmnet', 'deploy1002.eqiad.wmnet', 'deploy2002.codfw.wmnet', 'mw1420.eqiad.wmnet', 'mw1407.eqiad.wmnet', 'mw1398.eqiad.wmnet', 'mw1404.eqiad.wmnet', 'mw2289.codfw.wmnet'] (ran as mwdeploy@snapshot1008.eqiad.wmnet) returned [255]: ssh: connect to host [21:25:15] snapshot1008.eqiad.wmnet port 22: Connection timed out` [21:25:38] JSherman: I'm not entirely sure -- I can't remember if the rest APIs respect the X-Debug header or not.  But I can try. [21:26:52] I'm going to go ahead and +2 your cherry pick to get some ci out of the way. Let's try to test on debug, but not block on it [21:27:04] great thanks [21:27:08] (03CR) 10Jsn.sherman: [C:03+2] [JsonCodec, ParserCache] Improve debugging of serializability failures [core] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1032435 (https://phabricator.wikimedia.org/T365036) (owner: 10C. Scott Ananian) [21:27:45] (03CR) 10Jsn.sherman: [JsonCodec, ParserCache] Improve debugging of serializability failures [core] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1032435 (https://phabricator.wikimedia.org/T365036) (owner: 10C. Scott Ananian) [21:28:01] yeah, that was the wrong thing to do [21:28:14] I reset the vote [21:30:19] (03CR) 10Jsn.sherman: [C:04-1] "-1 just for safety to prevent merge for now" [core] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1032435 (https://phabricator.wikimedia.org/T365036) (owner: 10C. Scott Ananian) [21:31:15] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Problem re-imaging hosts on row-wide vlan on EVPN switches - https://phabricator.wikimedia.org/T365204#9807207 (10cmooney) [21:31:18] !log jsn@deploy1002 Finished scap: Backport for [[gerrit:1032571|Update VE core submodule to master (27296e0e3) (T230323 T365052)]] (duration: 25m 10s) [21:31:23] T230323: Use MutationObservers to detect structural changes - https://phabricator.wikimedia.org/T230323 [21:31:24] T365052: Changing a table Content Cell into a Header Cell removes all cell content - https://phabricator.wikimedia.org/T365052 [21:31:39] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Problem re-imaging hosts on row-wide vlan on EVPN switches - https://phabricator.wikimedia.org/T365204#9807214 (10cmooney) [21:33:21] hmm, edsanders: backport errored out; it looks like we had timeouts during sync at the end [21:34:03] JSherman: I see a decom task for snapshot2008 so it's probably fine https://phabricator.wikimedia.org/T364455 [21:34:31] For that specific error [21:34:50] would that impact snapshot1008 too? [21:34:57] 1008 [21:35:00] I meant that [21:35:17] ah, whew [21:35:26] thank you! [21:35:57] In that case, I'm going to proceed with my labs only config change [21:36:35] (03CR) 10Jdlrobson: Introduce sample overrides to web_ui_actions (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024813 (https://phabricator.wikimedia.org/T361962) (owner: 10Kimberly Sarabia) [21:37:33] (03CR) 10Jsn.sherman: [C:03+2] CommonSettings-labs: Correct AutoModeratorLiftWing settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032545 (owner: 10Jsn.sherman) [21:37:41] (03PS3) 10Jdlrobson: Disable wgParserEnableLegacyMediaDOM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031610 (https://phabricator.wikimedia.org/T363597) [21:38:14] (03Merged) 10jenkins-bot: CommonSettings-labs: Correct AutoModeratorLiftWing settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032545 (owner: 10Jsn.sherman) [21:38:49] okay, rebased [21:39:09] JSherman: I left a message in -sre about snapshot1008 so someone can depool it and it stop throwing scary messages [21:39:16] cscott: I'm resetting my -1 on your cherry pick [21:39:26] RhinosF1: many thanks! [21:39:40] (03CR) 10Jsn.sherman: [C:03+1] [JsonCodec, ParserCache] Improve debugging of serializability failures [core] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1032435 (https://phabricator.wikimedia.org/T365036) (owner: 10C. Scott Ananian) [21:39:47] JSherman: what's the plan now? [21:41:31] I'll start the backport for now. will you be able to tell the difference between the debug header not being respected and the patch having a problem? [21:42:14] yeah i ought to be able to. [21:42:25] ok, I'll let you know when it's testing time [21:43:11] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jsn@deploy1002 using scap backport" [core] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1032435 (https://phabricator.wikimedia.org/T365036) (owner: 10C. Scott Ananian) [21:43:12] the patch changes the exception message in a way such that I should be able to tell if it's running; if I can trigger the log but it has the 'old' exception message that's almost certainly the debug header not being respected. [21:43:38] makes sense! [21:45:24] 10ops-codfw, 06SRE, 06Infrastructure-Foundations, 10netops: Problem re-imaging hosts on row-wide vlan on EVPN switches - https://phabricator.wikimedia.org/T365204#9807226 (10cmooney) [21:45:31] gate and submit is so slow that I managed to go through all of my bumbling before that initial job completed. I'm hoping it will just move forward since there are no - votes [21:49:34] (03Merged) 10jenkins-bot: [JsonCodec, ParserCache] Improve debugging of serializability failures [core] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1032435 (https://phabricator.wikimedia.org/T365036) (owner: 10C. Scott Ananian) [21:49:54] !log jsn@deploy1002 Started scap: Backport for [[gerrit:1032435|[JsonCodec, ParserCache] Improve debugging of serializability failures (T365036)]] [21:49:58] T365036: JSON serialization failures on media files - https://phabricator.wikimedia.org/T365036 [21:49:59] JSherman: it did merge so we're in luck [21:50:07] accidental life hack [21:51:46] (03CR) 10EoghanGaffney: [V:03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2499/console" [puppet] - 10https://gerrit.wikimedia.org/r/1032032 (owner: 10Dzahn) [21:52:35] JSherman: so we're waiting for it to sync to the debug servers? [21:52:46] !log jsn@deploy1002 cscott and jsn: Backport for [[gerrit:1032435|[JsonCodec, ParserCache] Improve debugging of serializability failures (T365036)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:52:58] cscott: just landed, ready to test [21:53:21] are we on eqiad or codfw? [21:54:19] no clue! [21:54:55] My understanding is that if you are using the browser extension, any host should work [21:55:41] ok, testing, give me a few minutes [21:55:44] things are pretty abstracted by scap [21:55:52] eqiad [21:56:13] Reedy: thanks! [21:57:22] oh duh, as in which deployment server did I shell into? [21:58:23] (03Abandoned) 10Kimberly Sarabia: Remove sampling rate in config for MP events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1017378 (https://phabricator.wikimedia.org/T361962) (owner: 10Kimberly Sarabia) [22:02:01] !log ebernhardson@deploy1002 Started deploy [airflow-dags/search@cb359e4]: add dags to collect daily webrequest and satisfaction search metrics [22:02:27] !log ebernhardson@deploy1002 Finished deploy [airflow-dags/search@cb359e4]: add dags to collect daily webrequest and satisfaction search metrics (duration: 00m 25s) [22:03:02] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-openldap-exporter.service on seaborgium:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:04:20] (03PS4) 10Kimberly Sarabia: Introduce sample overrides to web_ui_actions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024813 (https://phabricator.wikimedia.org/T361962) [22:04:57] (03CR) 10Kimberly Sarabia: Introduce sample overrides to web_ui_actions (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024813 (https://phabricator.wikimedia.org/T361962) (owner: 10Kimberly Sarabia) [22:08:06] sorry this is taking so long, i'm still poking at this [22:08:18] wow do I have plenty to learn about our infrastructure; I've only ever used the toolbar for debugging and never bothered changing the host [22:08:21] (03CR) 10Jdlrobson: [C:03+1] Introduce sample overrides to web_ui_actions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1024813 (https://phabricator.wikimedia.org/T361962) (owner: 10Kimberly Sarabia) [22:09:44] cscott: considering your backport didn't start until 40 minutes after the window ended, I don't think any apologies are warranted. I appreciate the caution on your part. [22:14:03] (03PS7) 10Zabe: Use encrypted Argon2 Hashes to store user passwords [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029183 (https://phabricator.wikimedia.org/T150647) [22:14:03] (03PS1) 10Zabe: Deploy configuration for wrapping B type passwords [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032586 (https://phabricator.wikimedia.org/T112359) [22:15:02] (03PS2) 10Zabe: Deploy configuration for wrapping B type passwords with encrypted Argon2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032586 (https://phabricator.wikimedia.org/T112359) [22:16:40] (03PS3) 10Zabe: Deploy configuration for wrapping B type passwords with encrypted Argon2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1032586 (https://phabricator.wikimedia.org/T112359) [22:19:33] JSherman: i can't seem to get the rest response routed to the test server, at least I think that's what's going on.   I'm going to try to just filter logstash to the test servers to see if maybe some queries have come in by chance and ended up on the right server. [22:20:17] JSherman: i'm pretty sure i haven't broken anything at least, ie the patch is safe.  nothing bad is in the logs. (which is the problem, since I'm trying to get more information about the bad stuff!) [22:21:10] cscott: looking at the code, that makes sense to me; these paths are only taken when something has gone wrong, and you've set a default for your new parameter [22:21:33] I'll proceed, just plan on checking prod once we're synced [22:22:34] 06SRE, 06Infrastructure-Foundations, 06Traffic: Slowly ramping up traffic to the Brazil data center (magru) and related geo-maps - https://phabricator.wikimedia.org/T359054#9807307 (10CDanis) Adding the 3rd transit link in magru **greatly** improved the latency for many users in Argentina. The transit link... [22:24:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T352010)', diff saved to https://phabricator.wikimedia.org/P62549 and previous config saved to /var/cache/conftool/dbconfig/20240516-222414-ladsgroup.json [22:24:20] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [22:24:32] JSherman: ok, thanks. [22:24:50] !log jsn@deploy1002 Sync cancelled. [22:25:10] !log jsn@deploy1002 Started scap: Backport for [[gerrit:1032435|[JsonCodec, ParserCache] Improve debugging of serializability failures (T365036)]] [22:25:14] T365036: JSON serialization failures on media files - https://phabricator.wikimedia.org/T365036 [22:25:32] ugh. I accidentally cancelled; running again [22:26:18] I was looking at the logs in copy mode and did an extra carriage return when I exited [22:26:28] (in tmux) [22:27:32] scsott: my turn to apologize for the wait [22:27:41] !log jsn@deploy1002 jsn and cscott: Backport for [[gerrit:1032435|[JsonCodec, ParserCache] Improve debugging of serializability failures (T365036)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [22:27:48] !log jsn@deploy1002 jsn and cscott: Continuing with sync [22:28:04] (03PS1) 10Scott French: wmnet: add data-gateway CNAME record for k8s ingress [dns] - 10https://gerrit.wikimedia.org/r/1032590 (https://phabricator.wikimedia.org/T364921) [22:28:25] good, it looks like it kept state and got right back to the production sync step [22:28:29] (03PS1) 10Scott French: kubernetes: add data-gateway usernames for deployment server [puppet] - 10https://gerrit.wikimedia.org/r/1032591 (https://phabricator.wikimedia.org/T364921) [22:28:40] (03PS1) 10Scott French: service: add data-gateway service (k8s ingress) [puppet] - 10https://gerrit.wikimedia.org/r/1032592 (https://phabricator.wikimedia.org/T364921) [22:28:46] (03PS1) 10Scott French: service: move data-gateway service to production [puppet] - 10https://gerrit.wikimedia.org/r/1032593 (https://phabricator.wikimedia.org/T364921) [22:29:10] (03PS1) 10Scott French: admin_ng: add namespace for data-gateway [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032594 (https://phabricator.wikimedia.org/T364921) [22:29:12] (03PS1) 10Scott French: services: add data-gateway service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032595 (https://phabricator.wikimedia.org/T364921) [22:30:03] (03CR) 10EoghanGaffney: [C:03+1] lists: move definition of primary and standby host to common hiera (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1032032 (owner: 10Dzahn) [22:33:42] we're at the halfway mark for sync-prod-k8s [22:35:18] JSherman: and i'm seeing logs w/ the new code in production, so that's great from my side. [22:35:51] excellent 🎉 [22:39:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P62550 and previous config saved to /var/cache/conftool/dbconfig/20240516-223922-ladsgroup.json [22:41:08] okay, it looks like we're going to get all those same errors for `snapshot1008.eqiad.wmnet`. It is slowing things down since it's adding a timeout on several steps, but isn't a problem otherwise. [22:44:20] 06SRE, 10Cassandra, 06serviceops, 10Data Products (Data Products Sprint 13), and 2 others: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9807339 (10Scott_French) 05Open→03In progress [22:47:08] !log jsn@deploy1002 Finished scap: Backport for [[gerrit:1032435|[JsonCodec, ParserCache] Improve debugging of serializability failures (T365036)]] (duration: 21m 57s) [22:47:12] T365036: JSON serialization failures on media files - https://phabricator.wikimedia.org/T365036 [22:48:32] cscott: okay, your patch is backported! [22:51:21] RECOVERY - MediaWiki CirrusSearch update rate - codfw on alert1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://wikitech.wikimedia.org/wiki/Search%23No_updates https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [22:51:45] I'm going to keep connected in the background for a while in case anybody needs to mention me about these backports [22:52:03] but things do not look angry at the moment [22:54:31] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P62551 and previous config saved to /var/cache/conftool/dbconfig/20240516-225430-ladsgroup.json [22:56:20] JSherman: thank you so much! [22:58:04] FIRING: [2x] HelmReleaseBadStatus: Helm release datasets-config/production on k8s-dse@eqiad in state pending-install - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [22:58:13] jouncebot: nowandnext [22:58:13] No deployments scheduled for the next 7 hour(s) and 1 minute(s) [22:58:13] In 7 hour(s) and 1 minute(s): MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240517T0600) [22:58:41] (03PS1) 10Scott French: envoy: add data-gateway service listener [puppet] - 10https://gerrit.wikimedia.org/r/1032599 (https://phabricator.wikimedia.org/T364921) [23:03:02] FIRING: JobUnavailable: Reduced availability for job mysql-test in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:04:56] !log ebernhardson@deploy1002 Started deploy [airflow-dags/search@312e2be]: Correct new range partition sensor granularity [23:05:17] !log ebernhardson@deploy1002 Finished deploy [airflow-dags/search@312e2be]: Correct new range partition sensor granularity (duration: 00m 21s) [23:09:31] Hmm is there a known issue with things loading slowly? It says ""wgBackendResponseTime":121" for a page I looked at [23:09:39] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T352010)', diff saved to https://phabricator.wikimedia.org/P62552 and previous config saved to /var/cache/conftool/dbconfig/20240516-230939-ladsgroup.json [23:09:41] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance [23:09:43] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [23:09:44] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance [23:09:46] 06SRE, 10Cassandra, 06serviceops, 10Data Products (Data Products Sprint 13), and 2 others: Commons Impact Metrics: Data Gateway endpoints - https://phabricator.wikimedia.org/T364921#9807379 (10Scott_French) Many thanks for getting the image builds running and setting up the data_gateway role, @Eevans. Wit... [23:09:52] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db1173 (T352010)', diff saved to https://phabricator.wikimedia.org/P62553 and previous config saved to /var/cache/conftool/dbconfig/20240516-230951-ladsgroup.json [23:17:35] !log zabe@deploy1002 Synchronized private/PrivateSettings.php: Add secret for encrypting user password hashes - T150647 (duration: 16m 42s) [23:17:39] T150647: Deploy EncryptedPassword to Wikimedia Sites - https://phabricator.wikimedia.org/T150647 [23:18:14] Bsadowski1: what page? [23:19:01] some File: pages with useParsoid=1 are skipping the cache at the moment, but you'd probably only see that if you're an early opt-in to parsoid read views [23:19:39] Oh hmm let me see if I am using Parsoid on this wiki I am using [23:23:29] (03CR) 10Cwhite: "This includes a filter to set the timestamp of the event to requestReceivedTimestamp." [puppet] - 10https://gerrit.wikimedia.org/r/1031602 (https://phabricator.wikimedia.org/T290020) (owner: 10Cwhite) [23:32:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:33:02] gerrit seems sick ^ [23:34:00] the web ui is being really weird and sluggish along with that alert [23:34:35] bd808: Possibly related: https://phabricator.wikimedia.org/T365148 [23:37:39] I would randomly guess the problem today is related to the problem yesterday (apache overloaded). [23:43:00] apache busy workers is pegged, but I suspect gerrit itself. Seems safe enough to give apache a restart though. I'd say lets try that first. [23:43:21] !log restart apache on gerrit1003 [23:43:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:43:26] cwhite: SGTM! [23:46:14] One point to bd808 - gerrit ui seems more responsive now post apache restart [23:47:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:52:47] (03PS1) 10Andrea Denisse: smart: Refine data collection to differentiate RAID and non-RAID disks [puppet] - 10https://gerrit.wikimedia.org/r/1032608 (https://phabricator.wikimedia.org/T267664) [23:53:08] (03CR) 10CI reject: [V:04-1] smart: Refine data collection to differentiate RAID and non-RAID disks [puppet] - 10https://gerrit.wikimedia.org/r/1032608 (https://phabricator.wikimedia.org/T267664) (owner: 10Andrea Denisse) [23:57:23] (03PS2) 10Andrea Denisse: smart: Refine data collection to differentiate RAID and non-RAID disks [puppet] - 10https://gerrit.wikimedia.org/r/1032608 (https://phabricator.wikimedia.org/T267664)