[00:38:55] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1006113 [00:38:58] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1006113 (owner: 10TrainBranchBot) [00:44:37] RECOVERY - Disk space on centrallog1002 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog1002&var-datasource=eqiad+prometheus/ops [00:59:26] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1006113 (owner: 10TrainBranchBot) [01:08:25] (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:21:23] (03PS4) 10Sohom Datta: Remove the Collection extension from wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006164 (https://phabricator.wikimedia.org/T358437) [02:38:44] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:13:44] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:48:44] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:49:15] (03CR) 10Samwilson: [C: 03+1] Remove the Collection extension from wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006164 (https://phabricator.wikimedia.org/T358437) (owner: 10Sohom Datta) [03:50:05] (03CR) 10Samwilson: Remove the Collection extension from wikisource (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006164 (https://phabricator.wikimedia.org/T358437) (owner: 10Sohom Datta) [03:59:38] (03PS1) 10Tim Starling: Switch block schema to read-old/write-both mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006179 (https://phabricator.wikimedia.org/T355034) [03:59:40] (03PS1) 10Tim Starling: Switch block schema to read-new/write-both mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006180 (https://phabricator.wikimedia.org/T355034) [03:59:42] (03PS1) 10Tim Starling: Switch block schema to read-new/write-new mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006181 (https://phabricator.wikimedia.org/T355034) [04:19:46] (03PS1) 10KartikMistry: cxserver: Remove dictionary support [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006182 [04:42:07] PROBLEM - CirrusSearch comp_suggest codfw 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [250.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=50 [04:46:09] RECOVERY - CirrusSearch comp_suggest codfw 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [100.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=codfw&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=50 [05:08:25] (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:08:43] (03PS5) 10Sohom Datta: Remove the Collection extension from wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006164 (https://phabricator.wikimedia.org/T358437) [05:09:14] (03CR) 10Sohom Datta: Remove the Collection extension from wikisource (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006164 (https://phabricator.wikimedia.org/T358437) (owner: 10Sohom Datta) [05:38:00] (03CR) 10Samwilson: [C: 03+1] Remove the Collection extension from wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006164 (https://phabricator.wikimedia.org/T358437) (owner: 10Sohom Datta) [06:26:38] (03PS1) 10Tim Starling: beta: Re-enable partial action blocks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006185 (https://phabricator.wikimedia.org/T353495) [07:36:49] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] C:puppetmaster::monitoring Disable Icinga merge check. [puppet] - 10https://gerrit.wikimedia.org/r/1005743 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [07:43:08] (03CR) 10Slyngshede: [C: 03+2] Add gitreview configuration [software/bitu] - 10https://gerrit.wikimedia.org/r/997809 (https://phabricator.wikimedia.org/T355180) (owner: 10Slyngshede) [07:48:09] (03Merged) 10jenkins-bot: Add gitreview configuration [software/bitu] - 10https://gerrit.wikimedia.org/r/997809 (https://phabricator.wikimedia.org/T355180) (owner: 10Slyngshede) [07:48:39] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:idp Force Tomcat to use the default Java installation. [puppet] - 10https://gerrit.wikimedia.org/r/1005094 (https://phabricator.wikimedia.org/T357749) (owner: 10Slyngshede) [07:48:44] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:00:04] Amir1 and Urbanecm: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240226T0800). [08:00:05] No Gerrit patches in the queue for this window AFAICS. [08:46:31] Etherpad will be down in 15 minutes for around one hour - T316421 [08:46:31] T316421: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421 [08:51:21] !log deploy "facebookexternalhit" varnish 403 - T358455 [08:51:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:51:28] T358455: Primary outbound port utilisation over 80% alert muted - https://phabricator.wikimedia.org/T358455 [08:58:45] !log IDP switchover to idp2002 [08:58:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:32] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on etherpad1003.eqiad.wmnet with reason: Upgrade etherpad and switch to bookworm [09:00:46] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on etherpad1003.eqiad.wmnet with reason: Upgrade etherpad and switch to bookworm [09:08:41] (SystemdUnitFailed) firing: mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:10:35] !log jayme@cumin1002 START - Cookbook sre.hosts.reboot-single for host mw2442.codfw.wmnet [09:10:50] !log jayme@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host mw2442.codfw.wmnet [09:12:09] !log jayme@cumin1002 START - Cookbook sre.hosts.reboot-single for host mw2442.codfw.wmnet [09:12:18] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:12:56] that's my reboot [09:14:46] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Connect - kubernetes-codfw, AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:23:51] !log unmute the outbound port utilisation over 80% alert T358455 [09:23:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:57] T358455: Primary outbound port utilisation over 80% alert muted - https://phabricator.wikimedia.org/T358455 [09:32:32] PROBLEM - Host mw2442 is DOWN: PING CRITICAL - Packet loss = 100% [09:33:33] (KubernetesCalicoDown) firing: mw2442.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=mw2442.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:43:16] Etherpad maintenance finished - T316421 [09:43:17] T316421: Upgrade etherpad.wikimedia.org to v1.9.7 - https://phabricator.wikimedia.org/T316421 [09:53:04] Looking at mw2442, ssh down, management console ok [09:53:28] claime: j.ayme seems to be rebooting that [09:53:38] oh yeah just saw SAL [09:55:16] ah yeah it's the one with a RAID issue [09:56:35] while the mgmt can be connected to, the serial console is dead after the reboot and the server unreachable, this will need DC ops to have a look [09:56:55] I'm connected to the serial console? [09:57:31] ugh [09:57:39] E_NOTENOUGHCOFFEE [09:57:41] 2442? still dead for me [09:57:42] wrong host [09:57:46] ok :-) [10:01:19] I'm about to disable puppet on all cp hosts to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/1005548 selectively [10:01:31] (I'll log it here) [10:04:06] (03PS2) 10Muehlenhoff: udp2log::instance: Use Stdlib::Port for the port [puppet] - 10https://gerrit.wikimedia.org/r/1006482 [10:04:17] !log disabled puppet on all cp hosts to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/1005548 (T358105, T358107) [10:04:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:04:30] T358105: Change HAProxy log-format to support missing information - https://phabricator.wikimedia.org/T358105 [10:04:31] T358107: Change mtail configuration to ignore new fields in HAProxy logs - https://phabricator.wikimedia.org/T358107 [10:06:14] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1006482 (owner: 10Muehlenhoff) [10:07:18] !log installing perl security updates [10:07:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:08:52] (03CR) 10Fabfur: [V: 03+1 C: 03+2] haproxy: configure extended logging (preparatory for Benthos) [puppet] - 10https://gerrit.wikimedia.org/r/1005548 (https://phabricator.wikimedia.org/T358105) (owner: 10Fabfur) [10:13:17] (03CR) 10Volans: [C: 03+1] "Looks sane, if testing is successful go ahead." [cookbooks] - 10https://gerrit.wikimedia.org/r/1005573 (https://phabricator.wikimedia.org/T306421) (owner: 10Cathal Mooney) [10:16:29] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1006001 (https://phabricator.wikimedia.org/T336275) (owner: 10Ayounsi) [10:22:35] 10sre-alert-triage, 10cloud-services-team, 10wikitech.wikimedia.org: Alert in need of triage: Wikitech-static MW version up to date (instance wikitech-static.wikimedia.org) - https://phabricator.wikimedia.org/T357880#9575766 (10taavi) a:03taavi [10:25:36] RECOVERY - Host mw2442 is UP: PING OK - Packet loss = 0%, RTA = 30.30 ms [10:26:40] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 321, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:27:04] !log upgrading wikitech-static to mediawiki 1.41 T357880 [10:27:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:27:09] T357880: Alert in need of triage: Wikitech-static MW version up to date (instance wikitech-static.wikimedia.org) - https://phabricator.wikimedia.org/T357880 [10:28:01] ACKNOWLEDGEMENT - MD RAID on mw2442 is CRITICAL: CRITICAL: State: degraded, Active: 1, Working: 1, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T358474 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [10:28:05] 10SRE, 10ops-codfw: Degraded RAID on mw2442 - https://phabricator.wikimedia.org/T358474#9575783 (10ops-monitoring-bot) [10:29:06] !log jayme@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mw2442.codfw.wmnet [10:30:01] (03PS1) 10Btullis: Install an-redacteddb1001 with puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1006486 (https://phabricator.wikimedia.org/T355571) [10:30:37] (03PS1) 10Majavah: hieradata: redirect wikitech-static icinga alerts to WMCS [puppet] - 10https://gerrit.wikimedia.org/r/1006487 (https://phabricator.wikimedia.org/T357880) [10:33:43] (03CR) 10Muehlenhoff: Install an-redacteddb1001 with puppet 7 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1006486 (https://phabricator.wikimedia.org/T355571) (owner: 10Btullis) [10:36:29] !log enabled puppet on 'A:cp-ulsfo' to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/1005548 (T358105, T358107) [10:36:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:36:36] T358105: Change HAProxy log-format to support missing information - https://phabricator.wikimedia.org/T358105 [10:36:37] T358107: Change mtail configuration to ignore new fields in HAProxy logs - https://phabricator.wikimedia.org/T358107 [10:37:03] (KubernetesCalicoDown) resolved: mw2442.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=mw2442.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:40:56] (03CR) 10Btullis: Install an-redacteddb1001 with puppet 7 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1006486 (https://phabricator.wikimedia.org/T355571) (owner: 10Btullis) [10:41:13] (03PS2) 10Btullis: Install an-redacteddb1001 with puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1006486 (https://phabricator.wikimedia.org/T355571) [10:43:58] (03CR) 10Muehlenhoff: Install an-redacteddb1001 with puppet 7 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1006486 (https://phabricator.wikimedia.org/T355571) (owner: 10Btullis) [10:47:17] (03Abandoned) 10Btullis: Install an-redacteddb1001 with puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1006486 (https://phabricator.wikimedia.org/T355571) (owner: 10Btullis) [10:47:47] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-redacteddb1001.eqiad.wmnet with OS bookworm [10:48:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Patch-For-Review: Q#:rack/setup/install an-redacteddb1001 - https://phabricator.wikimedia.org/T355571#9575848 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host an-redacteddb... [10:48:19] (03CR) 10Btullis: Install an-redacteddb1001 with puppet 7 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1006486 (https://phabricator.wikimedia.org/T355571) (owner: 10Btullis) [10:54:39] (03PS1) 10Fabfur: haproxy: enable extended logformat for cp4037 [puppet] - 10https://gerrit.wikimedia.org/r/1006489 (https://phabricator.wikimedia.org/T358105) [10:57:11] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM" [alerts] - 10https://gerrit.wikimedia.org/r/1002983 (https://phabricator.wikimedia.org/T357406) (owner: 10Majavah) [10:57:58] (03PS2) 10Btullis: Migrate analytics-presto to a new hadoop coordinator [dns] - 10https://gerrit.wikimedia.org/r/998440 (https://phabricator.wikimedia.org/T336045) [10:58:32] (03CR) 10Fabfur: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1006489 (https://phabricator.wikimedia.org/T358105) (owner: 10Fabfur) [10:58:33] 10ops-codfw, 10serviceops: Degraded RAID on mw2442 - https://phabricator.wikimedia.org/T357380#9575876 (10JMeybohm) The new disk was not detected by the host, even after scsi scan (maybe that's not a thing anymore? ;)) Anyhow. I rebooted the node and it did not came back up. Powercycling again with console att... [10:58:51] 10SRE, 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T358417#9575877 (10phaultfinder) [11:00:04] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240226T1100) [11:01:22] (03CR) 10Majavah: [C: 03+2] team-wmcs: haproxy: take backup servers in account in calculations [alerts] - 10https://gerrit.wikimedia.org/r/1002983 (https://phabricator.wikimedia.org/T357406) (owner: 10Majavah) [11:02:55] (03Merged) 10jenkins-bot: team-wmcs: haproxy: take backup servers in account in calculations [alerts] - 10https://gerrit.wikimedia.org/r/1002983 (https://phabricator.wikimedia.org/T357406) (owner: 10Majavah) [11:07:10] (03CR) 10Btullis: [V: 03+2 C: 03+2] Migrate analytics-presto to a new hadoop coordinator [dns] - 10https://gerrit.wikimedia.org/r/998440 (https://phabricator.wikimedia.org/T336045) (owner: 10Btullis) [11:07:11] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [11:07:40] !log brouberol@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [11:11:47] (03PS1) 10Majavah: cloudlb: wikireplicas: shutdown sessions on down servers [puppet] - 10https://gerrit.wikimedia.org/r/1006492 (https://phabricator.wikimedia.org/T300427) [11:13:30] (03CR) 10Filippo Giunchedi: [C: 03+1] hieradata: redirect wikitech-static icinga alerts to WMCS [puppet] - 10https://gerrit.wikimedia.org/r/1006487 (https://phabricator.wikimedia.org/T357880) (owner: 10Majavah) [11:13:37] (03PS7) 10Slyngshede: C:prometheus::process_exporter Add a simplistic process exporter. [puppet] - 10https://gerrit.wikimedia.org/r/1004672 [11:13:47] !log btullis@cumin1002 START - Cookbook sre.presto.roll-restart-workers for Presto analytics cluster: Roll restart of all Presto's jvm daemons. [11:14:28] (03CR) 10Filippo Giunchedi: [C: 03+1] udp2log::instance: Use Stdlib::Port for the port [puppet] - 10https://gerrit.wikimedia.org/r/1006482 (owner: 10Muehlenhoff) [11:15:01] (03CR) 10Majavah: [C: 03+2] hieradata: redirect wikitech-static icinga alerts to WMCS [puppet] - 10https://gerrit.wikimedia.org/r/1006487 (https://phabricator.wikimedia.org/T357880) (owner: 10Majavah) [11:16:56] 10ops-codfw, 10serviceops: Degraded RAID on mw2442 - https://phabricator.wikimedia.org/T357380#9575944 (10JMeybohm) @MatthewVernon pointed out (thanks) that this could have helped (if done before the reboot obviously): https://wikitech.wikimedia.org/wiki/Swift/How_To#Replacing_a_disk_without_touching_the_ring... [11:18:27] !log btullis@cumin1002 END (ERROR) - Cookbook sre.presto.roll-restart-workers (exit_code=97) for Presto analytics cluster: Roll restart of all Presto's jvm daemons. [11:18:57] !log enabled puppet on 'A:cp' to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/1005548 (T358105, T358107) [11:19:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:03] T358105: Change HAProxy log-format to support missing information - https://phabricator.wikimedia.org/T358105 [11:19:04] T358107: Change mtail configuration to ignore new fields in HAProxy logs - https://phabricator.wikimedia.org/T358107 [11:19:11] 10ops-codfw, 10serviceops: Degraded RAID on mw2442 - https://phabricator.wikimedia.org/T357380#9575965 (10MatthewVernon) After the reboot, you could still have made the new virtual drive with the last of those lines: ` megacli -CfgEachDskRaid0 WB RA Direct CachedBadBBU -a0 ` [11:20:12] 10ops-codfw, 10serviceops: Degraded RAID on mw2442 - https://phabricator.wikimedia.org/T357380#9575966 (10JMeybohm) >>! In T357380#9575965, @MatthewVernon wrote: > After the reboot, you could still have made the new virtual drive with the last of those lines: > ` > megacli -CfgEachDskRaid0 WB RA Direct CachedB... [11:20:14] !log STOP persistRevisionThreadItems on viwiki for T315510 again, tons of errors (didn’t even respond to Ctrl+C so I `sudo -u www-data kill`’ed it) [11:20:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:20] T315510: Start maintenance script to backfill talk page comment database - https://phabricator.wikimedia.org/T315510 [11:24:02] (03CR) 10Volans: "With PS3 I get:" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1004192 (owner: 10Hashar) [11:24:55] (03PS1) 10Brouberol: cache: declare superset-(next-)k8s.wikimedia.org as alternate domains [puppet] - 10https://gerrit.wikimedia.org/r/1006493 (https://phabricator.wikimedia.org/T358479) [11:25:55] 10sre-alert-triage, 10cloud-services-team, 10wikitech.wikimedia.org, 10Patch-For-Review: Alert in need of triage: Wikitech-static MW version up to date (instance wikitech-static.wikimedia.org) - https://phabricator.wikimedia.org/T357880#9576000 (10taavi) 05Open→03Resolved [11:29:39] (03PS2) 10Brouberol: cache: declare superset-(next-)k8s.wikimedia.org as alternate domains [puppet] - 10https://gerrit.wikimedia.org/r/1006493 (https://phabricator.wikimedia.org/T358479) [11:29:47] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1006493 (https://phabricator.wikimedia.org/T358479) (owner: 10Brouberol) [11:30:47] (03CR) 10Btullis: [C: 03+1] "Looks good to me, but didn't we say that we were getting rid of superset(-next)-k8s very soon?" [puppet] - 10https://gerrit.wikimedia.org/r/1006493 (https://phabricator.wikimedia.org/T358479) (owner: 10Brouberol) [11:37:31] (03CR) 10Klausman: [C: 03+1] ml-services: move article-descriptions to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006194 (https://phabricator.wikimedia.org/T358467) (owner: 10Kevin Bazira) [11:41:48] !log Restarting failed mediawiki_job_generatecaptcha [11:41:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:42:44] !log btullis@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-redacteddb1001.eqiad.wmnet with OS bookworm [11:42:50] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Q#:rack/setup/install an-redacteddb1001 - https://phabricator.wikimedia.org/T355571#9576047 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host an-redacteddb1001.eqiad.wmnet with OS b... [11:43:25] (SystemdUnitFailed) resolved: mediawiki_job_generatecaptcha.service on mwmaint2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:44:05] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-redacteddb1001.eqiad.wmnet with OS bookworm [11:44:11] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Q#:rack/setup/install an-redacteddb1001 - https://phabricator.wikimedia.org/T355571#9576050 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by btullis@cumin1002 for host an-redacteddb1001.eqiad.wmnet with... [11:48:44] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:54:48] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering, 10Data-Platform-SRE: Q1:rack/setup/install stat1011.eqiad.wmnet - https://phabricator.wikimedia.org/T342454#9576092 (10BTullis) [12:04:08] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-redacteddb1001.eqiad.wmnet with reason: host reimage [12:07:08] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-redacteddb1001.eqiad.wmnet with reason: host reimage [12:11:11] (03CR) 10Brouberol: "We did, but I'd rather we can make sure everything works well for superset-k8s.wikimedia.org before migrating, and I've opened https://pha" [puppet] - 10https://gerrit.wikimedia.org/r/1006493 (https://phabricator.wikimedia.org/T358479) (owner: 10Brouberol) [12:11:13] (03CR) 10Brouberol: [C: 03+2] cache: declare superset-(next-)k8s.wikimedia.org as alternate domains [puppet] - 10https://gerrit.wikimedia.org/r/1006493 (https://phabricator.wikimedia.org/T358479) (owner: 10Brouberol) [12:14:19] (03PS1) 10Clément Goubert: Enable $wgLocalHTTPProxy on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006497 (https://phabricator.wikimedia.org/T298265) [12:16:30] (03CR) 10Kevin Bazira: [C: 03+2] ml-services: move article-descriptions to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006194 (https://phabricator.wikimedia.org/T358467) (owner: 10Kevin Bazira) [12:16:57] (03PS1) 10Samtar: InitialiseSettings: Enable Edit Recovery on arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006498 (https://phabricator.wikimedia.org/T355548) [12:17:27] (03Merged) 10jenkins-bot: ml-services: move article-descriptions to prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006194 (https://phabricator.wikimedia.org/T358467) (owner: 10Kevin Bazira) [12:20:08] (03PS1) 10Clément Goubert: restbase: Start moving mwapi calls to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1005756 (https://phabricator.wikimedia.org/T358213) [12:23:12] (03CR) 10Fabfur: [C: 03+1] "This looks good to me, if we want to be extra safe, we could try on some (depooled) host after applying this to kill various services and " [puppet] - 10https://gerrit.wikimedia.org/r/993804 (https://phabricator.wikimedia.org/T355905) (owner: 10BCornwall) [12:23:39] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006197 [12:28:07] (03PS1) 10Majavah: P:toolforge: image_builder: refresh for bookworm [puppet] - 10https://gerrit.wikimedia.org/r/1006516 (https://phabricator.wikimedia.org/T358483) [12:30:00] (03PS1) 10Clément Goubert: mw-web, mw-api-ext: Raise replicas for 50% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006519 (https://phabricator.wikimedia.org/T357507) [12:31:32] (03PS1) 10Clément Goubert: trafficserver: move 50% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/1006520 (https://phabricator.wikimedia.org/T357507) [12:34:22] (03PS8) 10Slyngshede: C:prometheus::process_exporter Add a simplistic process exporter. [puppet] - 10https://gerrit.wikimedia.org/r/1004672 [12:42:08] (03PS1) 10Brouberol: superset: disable rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006521 (https://phabricator.wikimedia.org/T352166) [12:55:39] !log Restarting MediaModeration scanning maintenance script - See https://wikitech.wikimedia.org/wiki/MediaModeration [12:55:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:38] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [13:03:12] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for sretest2004 - cmooney@cumin1002" [13:03:48] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [13:04:02] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update dns entries for sretest2004 - cmooney@cumin1002" [13:04:02] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [13:04:03] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [13:04:08] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [13:04:16] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [13:04:36] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:04:43] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [13:04:54] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [13:05:49] !log cmooney@cumin1002 START - Cookbook sre.dns.wipe-cache sretest2004.wikimedia.org on all recursors [13:05:52] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) sretest2004.wikimedia.org on all recursors [13:06:03] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [13:06:23] !log cmooney@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2004.wikimedia.org with OS bookworm [13:06:51] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [13:07:25] (SystemdUnitFailed) firing: httpbb_hourly_appserver.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:07:58] ^re-ran, should disappear [13:08:56] PROBLEM - Host sretest2004 is DOWN: PING CRITICAL - Packet loss = 100% [13:09:43] !log trafficserver: move 50% of traffic to mw on k8s - T357507 [13:09:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:53] T357507: Move 50% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T357507 [13:10:28] RECOVERY - Host sretest2004 is UP: PING OK - Packet loss = 0%, RTA = 30.24 ms [13:12:25] (SystemdUnitFailed) resolved: httpbb_hourly_appserver.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:12:48] PROBLEM - Host sretest2004 is DOWN: PING CRITICAL - Packet loss = 100% [13:14:36] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:18:52] 10ops-codfw, 10serviceops: Degraded RAID on mw2442 - https://phabricator.wikimedia.org/T357380#9576213 (10MoritzMuehlenhoff) [13:19:14] 10SRE, 10ops-codfw: Degraded RAID on mw2442 - https://phabricator.wikimedia.org/T358474#9576211 (10MoritzMuehlenhoff) [13:19:59] (03PS1) 10Jelto: prometheus::ops: WIP monitor active etherpad instance only [puppet] - 10https://gerrit.wikimedia.org/r/1006523 (https://phabricator.wikimedia.org/T316421) [13:20:01] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536#9576217 (10Clement_Goubert) [13:20:10] 10SRE, 10MW-on-K8s, 10Release-Engineering-Team, 10Traffic, and 2 others: Move 50% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T357507#9576216 (10Clement_Goubert) 05Open→03Resolved [13:21:25] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350#9576220 (10Jhancock.wm) @Marostegui can you add this to site.pp for me? [13:23:56] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350#9576229 (10Jhancock.wm) [13:24:29] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350#9576230 (10Marostegui) @Jhancock.wm all these hosts were added to site.pp long ago: https://gerrit.wikimedia.org/r/c/operations/puppet/+/991269 [13:25:20] (03PS1) 10Klausman: hiera: add deploy config for art-desc on LiftWing [puppet] - 10https://gerrit.wikimedia.org/r/1006524 [13:27:29] (03CR) 10Jelto: "Do you have a idea how to implement this properly? I want Prometheus to scrape the etherpad exporter on the active host (etherpad1004) but" [puppet] - 10https://gerrit.wikimedia.org/r/1006523 (https://phabricator.wikimedia.org/T316421) (owner: 10Jelto) [13:28:17] (03PS1) 10Muehlenhoff: openstack::base::pdns::recursor::service: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1006525 [13:29:33] (03CR) 10Kevin Bazira: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1006524 (owner: 10Klausman) [13:30:40] 10SRE, 10MW-on-K8s, 10Release-Engineering-Team, 10Traffic, 10serviceops: Move 50% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T357507#9576241 (10Clement_Goubert) [13:31:07] 10SRE, 10MW-on-K8s, 10Release-Engineering-Team, 10Traffic, 10serviceops: Move 60% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T357508#9576243 (10Clement_Goubert) 05Stalled→03In progress [13:31:14] 10SRE, 10MW-on-K8s, 10Scap, 10serviceops, 10Release-Engineering-Team (Now this 🫠): Scap should check errors coming from mw-on-k8s canaries during deployments - https://phabricator.wikimedia.org/T357402#9576244 (10Clement_Goubert) [13:31:19] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536#9576245 (10Clement_Goubert) [13:31:39] (03CR) 10Klausman: [C: 03+2] hiera: add deploy config for art-desc on LiftWing [puppet] - 10https://gerrit.wikimedia.org/r/1006524 (owner: 10Klausman) [13:39:02] (03PS6) 10FNegri: elasticsearch: move to opensearch client [software/spicerack] - 10https://gerrit.wikimedia.org/r/966492 (https://phabricator.wikimedia.org/T345337) (owner: 10David Caro) [13:41:45] (03CR) 10Nikerabbit: [C: 04-1] "I am seeing a bunch of changes in https://integration.wikimedia.org/ci/job/operations-mw-config-php74-composer-diffConfig-docker/6165/cons" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/995176 (https://phabricator.wikimedia.org/T298235) (owner: 10KartikMistry) [13:44:13] (03PS2) 10Klausman: LiftWing: add missing entry for article-desc certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006528 (https://phabricator.wikimedia.org/T354516) [13:45:29] (03CR) 10CI reject: [V: 04-1] elasticsearch: move to opensearch client [software/spicerack] - 10https://gerrit.wikimedia.org/r/966492 (https://phabricator.wikimedia.org/T345337) (owner: 10David Caro) [13:47:40] 10SRE, 10Infrastructure-Foundations, 10netops: Servers on public1-a-codfw and public1-b-codfw not getting DHCP during reimage - https://phabricator.wikimedia.org/T358488#9576320 (10cmooney) p:05Triage→03Medium [13:49:35] (03PS1) 10Clément Goubert: mw-web, mw-api-ext: Raise replicas for 55% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006526 (https://phabricator.wikimedia.org/T357508) [13:49:37] (03CR) 10Clément Goubert: "For future merge, for example on 2024-02-28" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006526 (https://phabricator.wikimedia.org/T357508) (owner: 10Clément Goubert) [13:49:55] (03PS1) 10Clément Goubert: trafficserver: move 55% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/1006527 (https://phabricator.wikimedia.org/T357508) [13:49:57] (03CR) 10Clément Goubert: "For future merge, for example on 2024-02-28" [puppet] - 10https://gerrit.wikimedia.org/r/1006527 (https://phabricator.wikimedia.org/T357508) (owner: 10Clément Goubert) [13:50:31] (03CR) 10Kamila Součková: [C: 03+1] mw-web, mw-api-ext: Raise replicas for 55% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006526 (https://phabricator.wikimedia.org/T357508) (owner: 10Clément Goubert) [13:51:12] (03CR) 10Kamila Součková: [C: 03+1] trafficserver: move 55% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/1006527 (https://phabricator.wikimedia.org/T357508) (owner: 10Clément Goubert) [13:53:48] (03PS3) 10Klausman: LiftWing: add missing entry for article-desc certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006528 (https://phabricator.wikimedia.org/T358467) [13:54:31] (03CR) 10Kevin Bazira: [C: 03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006528 (https://phabricator.wikimedia.org/T358467) (owner: 10Klausman) [13:56:11] 10ops-codfw, 10DC-Ops, 10serviceops: mw2420-mw2451 do have unncecesarry raid controllers (configured - https://phabricator.wikimedia.org/T358489#9576345 (10JMeybohm) [13:56:24] (03CR) 10Klausman: [C: 03+2] LiftWing: add missing entry for article-desc certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006528 (https://phabricator.wikimedia.org/T358467) (owner: 10Klausman) [13:57:25] 10ops-codfw, 10serviceops: Degraded RAID on mw2442 - https://phabricator.wikimedia.org/T357380#9576360 (10JMeybohm) 05Open→03Resolved a:03JMeybohm T358489 as follow-up for the strange RAID config, resolving this one. [13:58:07] 10ops-codfw, 10DC-Ops, 10serviceops: mw2420-mw2451 do have unncecesarry raid controllers (configured) - https://phabricator.wikimedia.org/T358489#9576368 (10JMeybohm) [13:59:29] (03Merged) 10jenkins-bot: LiftWing: add missing entry for article-desc certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006528 (https://phabricator.wikimedia.org/T358467) (owner: 10Klausman) [13:59:40] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [13:59:43] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance [13:59:49] (03PS3) 10Clément Goubert: ferm: Check ferm.service status in ferm_status.py [puppet] - 10https://gerrit.wikimedia.org/r/1005978 (https://phabricator.wikimedia.org/T354855) [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC afternoon backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240226T1400). [14:00:05] Sohom_Datta, Dreamy_Jazz, and claime: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:12] \o [14:00:17] o/ [14:00:18] Here, can self deploy [14:00:18] I can deploy the patch I've added [14:00:47] 10ops-codfw, 10DC-Ops, 10serviceops: mw2420-mw2451 do have unncecesarry raid controllers (configured) - https://phabricator.wikimedia.org/T358489#9576376 (10MatthewVernon) If you do decide you might want to reprovision these nodes as non-RAID, there is a [[ https://gerrit.wikimedia.org/r/plugins/gitiles/oper... [14:01:03] Dreamy_Jazz: I don’t think it’s a good idea to deploy a Beta change at the moment tbh, given T358329 and related issues [14:01:04] T358329: beta-update-databases-eqiad job times out / beta databases are having issues - https://phabricator.wikimedia.org/T358329 [14:01:10] (03PS7) 10FNegri: elasticsearch: move to opensearch client [software/spicerack] - 10https://gerrit.wikimedia.org/r/966492 (https://phabricator.wikimedia.org/T345337) (owner: 10David Caro) [14:01:20] even if the config change is probably unrelated [14:01:35] The config change is completely unrelated [14:02:02] its unlikely you'll be able to test it fwiw [14:02:23] last time I looked, some log entries weren't saving [14:02:37] I just need to load Special:Block to test it. [14:02:45] (in theory) [14:03:34] o/ (I think I missed the start of the deployments) [14:03:38] :) [14:03:48] I'll go ahead and deploy my patch while you decide if that's all right? [14:04:00] I was going to start with Sohom_Datta [14:04:09] ah, go ahead Lucas_WMDE [14:04:11] :) [14:04:56] although tbh I’m not a huge fan of the “remove stuff from all wikisources without asking and see if they scream” pattern suggested in https://phabricator.wikimedia.org/T358437#9575252 [14:05:43] actually, was this even discussed on enwikisource? or is it just a suggestion by Xover as an individual? [14:06:03] Dreamy_Jazz: (otoh, its a pretty low-impact patch, even if it turns out to not be testable) [14:06:51] Yeah, my feeling was that it was low impact enough and required no writes using on-wiki interfaces to test, so as long as I could load a Special:Block page it would be fine. [14:07:01] doesn’t look like it was ever discussed on enwikisource either https://en.wikisource.org/w/index.php?search=T358437&title=Special%3ASearch&profile=advanced&fulltext=1&ns0=1&ns1=1&ns2=1&ns3=1&ns4=1&ns5=1&ns6=1&ns7=1&ns8=1&ns9=1&ns10=1&ns11=1&ns12=1&ns13=1&ns14=1&ns15=1&ns100=1&ns101=1&ns102=1&ns103=1&ns104=1&ns105=1&ns106=1&ns107=1&ns114=1&ns115=1 [14:07:01] &ns710=1&ns711=1&ns828=1&ns829=1&ns2300=1&ns2301=1&ns2302=1&ns2303=1 [14:07:02] T358437: Undeploy Collection extension (BookMaker) from English Wikisource and other possibly interested Wikisource language projects - https://phabricator.wikimedia.org/T358437 [14:07:31] I don't know if it was, but the functionality is very broken and not really used by the community [14:07:31] (03CR) 10CI reject: [V: 04-1] elasticsearch: move to opensearch client [software/spicerack] - 10https://gerrit.wikimedia.org/r/966492 (https://phabricator.wikimedia.org/T345337) (owner: 10David Caro) [14:08:02] As for loading Special:Block, it works for a wiki that I could test this change on https://en.wikipedia.beta.wmflabs.org/wiki/Special:Block [14:08:14] But happy to not deploy if others are against it. [14:08:21] I asked about it on the Wikisource Telegram and there wasn't anyone who complained [14:09:09] My understanding is that most peeps use the Wikisource extension's "Download book" functionality [14:09:18] Instead of this [14:09:43] 10ops-codfw, 10DC-Ops, 10serviceops: mw2420-mw2451 do have unncecesarry raid controllers (configured) - https://phabricator.wikimedia.org/T358489#9576400 (10RobH) Moritz asked me about this, and I have some background. So orders placed in January 2023 via the dell portal for standard configs also included a... [14:09:47] (03PS8) 10FNegri: elasticsearch: move to opensearch client [software/spicerack] - 10https://gerrit.wikimedia.org/r/966492 (https://phabricator.wikimedia.org/T345337) (owner: 10David Caro) [14:10:36] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [14:11:01] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance [14:11:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1166 (T357189)', diff saved to https://phabricator.wikimedia.org/P57922 and previous config saved to /var/cache/conftool/dbconfig/20240226-141107-arnaudb.json [14:11:09] 10ops-codfw, 10DC-Ops, 10serviceops: mw2420-mw2451 do have unncecesarry raid controllers (configured) - https://phabricator.wikimedia.org/T358489#9576401 (10RobH) I'm told there is a question on 'can we pull these raid controllers to use elsewhere' and the answer is 'no, or the host you remove it from has no... [14:11:13] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [14:11:33] Dreamy_Jazz: my 2c are for go ahead but be aware — it's not my deploy window though so :-) [14:11:59] (03PS9) 10KartikMistry: Enable Section Translation on newly created Wikipedias by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/995176 (https://phabricator.wikimedia.org/T298235) [14:12:31] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006164 (https://phabricator.wikimedia.org/T358437) (owner: 10Sohom Datta) [14:12:39] (03CR) 10CI reject: [V: 04-1] Enable Section Translation on newly created Wikipedias by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/995176 (https://phabricator.wikimedia.org/T298235) (owner: 10KartikMistry) [14:12:41] (03PS9) 10Slyngshede: C:prometheus::process_exporter Add a simplistic process exporter. [puppet] - 10https://gerrit.wikimedia.org/r/1004672 [14:13:20] (03Merged) 10jenkins-bot: Remove the Collection extension from wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006164 (https://phabricator.wikimedia.org/T358437) (owner: 10Sohom Datta) [14:14:04] * Lucas_WMDE watches scap pull a bunch of git repos [14:14:11] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:1006164|Remove the Collection extension from wikisource (T358437)]] [14:14:17] T358437: Undeploy Collection extension (BookMaker) from English Wikisource and other possibly interested Wikisource language projects - https://phabricator.wikimedia.org/T358437 [14:15:06] (03PS10) 10KartikMistry: Enable Section Translation on newly created Wikipedias by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/995176 (https://phabricator.wikimedia.org/T298235) [14:15:44] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and soda: Backport for [[gerrit:1006164|Remove the Collection extension from wikisource (T358437)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:16:07] (03CR) 10CI reject: [V: 04-1] elasticsearch: move to opensearch client [software/spicerack] - 10https://gerrit.wikimedia.org/r/966492 (https://phabricator.wikimedia.org/T345337) (owner: 10David Caro) [14:16:52] (03PS10) 10Slyngshede: C:prometheus::process_exporter Add a simplistic process exporter. [puppet] - 10https://gerrit.wikimedia.org/r/1004672 [14:17:28] Just tested it on the debug servers, appears to work :) [14:17:50] ok [14:17:52] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and soda: Continuing with sync [14:18:11] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1466/co" [puppet] - 10https://gerrit.wikimedia.org/r/1004672 (owner: 10Slyngshede) [14:20:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T357189)', diff saved to https://phabricator.wikimedia.org/P57923 and previous config saved to /var/cache/conftool/dbconfig/20240226-142046-arnaudb.json [14:20:53] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [14:23:23] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host sretest2004.wikimedia.org with OS bookworm [14:25:01] (03CR) 10KartikMistry: "Yes. Those are closed Wikipedias. Expected. I tried setting up 'closed' to false, but enabling 'wikipedia' to true is conflicting." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/995176 (https://phabricator.wikimedia.org/T298235) (owner: 10KartikMistry) [14:25:09] (03PS11) 10KartikMistry: Enable Section Translation on newly created Wikipedias by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/995176 (https://phabricator.wikimedia.org/T298235) [14:26:01] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:1006164|Remove the Collection extension from wikisource (T358437)]] (duration: 11m 49s) [14:26:06] T358437: Undeploy Collection extension (BookMaker) from English Wikisource and other possibly interested Wikisource language projects - https://phabricator.wikimedia.org/T358437 [14:26:21] Dreamy_Jazz: if you want to deploy your change, go ahead [14:26:24] Thanks. [14:26:29] I will do that now. [14:26:41] (03PS9) 10FNegri: elasticsearch: move to opensearch client [software/spicerack] - 10https://gerrit.wikimedia.org/r/966492 (https://phabricator.wikimedia.org/T345337) (owner: 10David Caro) [14:27:26] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006185 (https://phabricator.wikimedia.org/T353495) (owner: 10Tim Starling) [14:28:25] (03Merged) 10jenkins-bot: beta: Re-enable partial action blocks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006185 (https://phabricator.wikimedia.org/T353495) (owner: 10Tim Starling) [14:29:04] claime: Patch deployed. Feel free to go ahead with yours now. [14:29:13] thanks Dreamy_Jazz :) [14:29:46] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cgoubert@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006497 (https://phabricator.wikimedia.org/T298265) (owner: 10Clément Goubert) [14:30:04] grmbl, needs rebase [14:30:18] (03PS2) 10Clément Goubert: Enable $wgLocalHTTPProxy on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006497 (https://phabricator.wikimedia.org/T298265) [14:31:45] (03CR) 10Ssingh: [C: 03+1] haproxy: enable extended logformat for cp4037 [puppet] - 10https://gerrit.wikimedia.org/r/1006489 (https://phabricator.wikimedia.org/T358105) (owner: 10Fabfur) [14:31:48] (03CR) 10Clément Goubert: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006497 (https://phabricator.wikimedia.org/T298265) (owner: 10Clément Goubert) [14:32:07] (03PS12) 10KartikMistry: Enable Section Translation on newly created Wikipedias by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/995176 (https://phabricator.wikimedia.org/T298235) [14:33:32] (03PS10) 10FNegri: elasticsearch: move to opensearch client [software/spicerack] - 10https://gerrit.wikimedia.org/r/966492 (https://phabricator.wikimedia.org/T345337) (owner: 10David Caro) [14:33:43] (03CR) 10TrainBranchBot: "Approved by cgoubert@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006497 (https://phabricator.wikimedia.org/T298265) (owner: 10Clément Goubert) [14:34:24] (03Merged) 10jenkins-bot: Enable $wgLocalHTTPProxy on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006497 (https://phabricator.wikimedia.org/T298265) (owner: 10Clément Goubert) [14:34:40] !log cgoubert@deploy2002 Started scap: Backport for [[gerrit:1006497|Enable $wgLocalHTTPProxy on all wikis (T298265)]] [14:34:46] T298265: Have internal MediaWiki to MediaWiki HTTP requests use an envoyproxy on appservers - https://phabricator.wikimedia.org/T298265 [14:35:53] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P57924 and previous config saved to /var/cache/conftool/dbconfig/20240226-143553-arnaudb.json [14:36:06] !log cgoubert@deploy2002 cgoubert: Backport for [[gerrit:1006497|Enable $wgLocalHTTPProxy on all wikis (T298265)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:38:00] (03CR) 10Fabfur: [V: 03+1 C: 03+2] haproxy: enable extended logformat for cp4037 [puppet] - 10https://gerrit.wikimedia.org/r/1006489 (https://phabricator.wikimedia.org/T358105) (owner: 10Fabfur) [14:38:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:14] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good. Before rollout, let's disable Puppet on "C:profile::firewall::log::ferm" and then enable on a few initial hosts." [puppet] - 10https://gerrit.wikimedia.org/r/1005978 (https://phabricator.wikimedia.org/T354855) (owner: 10Clément Goubert) [14:39:53] !log cgoubert@deploy2002 cgoubert: Continuing with sync [14:40:06] (03CR) 10FNegri: "All tests are now passing." [software/spicerack] - 10https://gerrit.wikimedia.org/r/966492 (https://phabricator.wikimedia.org/T345337) (owner: 10David Caro) [14:40:11] httpbb looks good on mwdebug bare metal, proceeding [14:42:02] !log depooled and deactivated puppet on cp4037 to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/1006489 (T358105) [14:42:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:08] T358105: Change HAProxy log-format to support missing information - https://phabricator.wikimedia.org/T358105 [14:43:02] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1006525 (owner: 10Muehlenhoff) [14:43:38] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review, 10cloud-services-team (FY2023/2024-Q3-Q4): spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337#9576504 (10fnegri) a:03fnegri I have updated the patch by @dcaro (https... [14:43:46] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review, 10cloud-services-team (FY2023/2024-Q3-Q4): spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337#9576507 (10fnegri) [14:46:09] !log klausman@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [14:46:28] !log klausman@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [14:46:53] !log klausman@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [14:47:37] !log klausman@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [14:47:53] !log klausman@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [14:48:05] !log cgoubert@deploy2002 Finished scap: Backport for [[gerrit:1006497|Enable $wgLocalHTTPProxy on all wikis (T298265)]] (duration: 13m 24s) [14:48:10] T298265: Have internal MediaWiki to MediaWiki HTTP requests use an envoyproxy on appservers - https://phabricator.wikimedia.org/T298265 [14:48:11] !log klausman@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [14:48:12] (03PS4) 10Ayounsi: Add SameSite=Lax attribute to NetworkProbeLimit cookie [puppet] - 10https://gerrit.wikimedia.org/r/989457 (https://phabricator.wikimedia.org/T342624) [14:48:21] looking good :) [14:48:29] !log klausman@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [14:48:48] !log klausman@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [14:48:59] (SystemdUnitFailed) firing: (2) isc-dhcp-server.service on install2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:51:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P57925 and previous config saved to /var/cache/conftool/dbconfig/20240226-145059-arnaudb.json [14:51:24] !log fabfur@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp4037.ulsfo.wmnet,service=(cdn|ats-be) [14:51:36] !log UTC afternoon backport+config window done [14:51:43] !log repooled and reactivate puppet on cp4037 to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/1006489 (T358105) [14:51:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:58] T358105: Change HAProxy log-format to support missing information - https://phabricator.wikimedia.org/T358105 [14:52:10] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [14:52:30] !log kevinbazira@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [14:54:38] PROBLEM - Disk space on centrallog1002 is CRITICAL: DISK CRITICAL - free space: /srv 54280 MB (3% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/d/000000377/host-overview?var-server=centrallog1002&var-datasource=eqiad+prometheus/ops [14:56:45] (03PS1) 10Muehlenhoff: Remove debmonitor1002/2002 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1006531 (https://phabricator.wikimedia.org/T241049) [14:58:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:02:55] !log Disabling meta-monitoring for the alert hosts - T333615 [15:03:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:18] T333615: Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615 [15:03:30] (03CR) 10David Caro: "🎉 thanks" [software/spicerack] - 10https://gerrit.wikimedia.org/r/966492 (https://phabricator.wikimedia.org/T345337) (owner: 10David Caro) [15:06:06] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T357189)', diff saved to https://phabricator.wikimedia.org/P57926 and previous config saved to /var/cache/conftool/dbconfig/20240226-150606-arnaudb.json [15:06:09] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance [15:06:15] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [15:06:33] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance [15:06:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1175 (T357189)', diff saved to https://phabricator.wikimedia.org/P57927 and previous config saved to /var/cache/conftool/dbconfig/20240226-150639-arnaudb.json [15:09:41] (03CR) 10Ssingh: "The CI failures *seems* to be related to black and newlines. Let me know if I should fix those?" [software/conftool] - 10https://gerrit.wikimedia.org/r/1005694 (owner: 10Ssingh) [15:09:53] (03PS1) 10Ssingh: P:dns::auth:: switch confd .ssh/config back to Puppet [puppet] - 10https://gerrit.wikimedia.org/r/1006532 (https://phabricator.wikimedia.org/T347054) [15:11:15] !log denisse@cumin2002 START - Cookbook sre.hosts.reimage for host alert2001.wikimedia.org with OS bookworm [15:11:26] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q3): Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615#9576595 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by denisse@cumin2002 for host alert2001.wikimedia.org with OS bookworm [15:12:27] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: mw2420-mw2451 do have unncecesarry raid controllers (configured) - https://phabricator.wikimedia.org/T358489#9576608 (10JMeybohm) >>! In T358489#9576376, @MatthewVernon wrote: > If you do decide you might want to reprovision these nodes as non-RAID, there is a [[... [15:12:58] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [15:13:04] (03PS2) 10Ssingh: P:dns::auth:: switch confd .ssh/config back to Puppet [puppet] - 10https://gerrit.wikimedia.org/r/1006532 (https://phabricator.wikimedia.org/T347054) [15:13:29] (SystemdUnitFailed) firing: (2) isc-dhcp-server.service on install2004:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:13:29] !log kevinbazira@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [15:13:49] !log kevinbazira@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [15:15:00] (03PS3) 10Ssingh: P:dns::auth:: switch confd .ssh/config back to Puppet [puppet] - 10https://gerrit.wikimedia.org/r/1006532 (https://phabricator.wikimedia.org/T347054) [15:15:05] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt for es1036-40 - jclark@cumin1002" [15:15:36] !log aqu@deploy2002 Started deploy [airflow-dags/analytics_test@b115452]: Deploy Refine job POC on test cluster - update 2 [15:15:48] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics_test@b115452]: Deploy Refine job POC on test cluster - update 2 (duration: 00m 12s) [15:15:57] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt for es1036-40 - jclark@cumin1002" [15:15:57] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:16:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T357189)', diff saved to https://phabricator.wikimedia.org/P57928 and previous config saved to /var/cache/conftool/dbconfig/20240226-151624-arnaudb.json [15:16:40] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [15:16:48] !log klausman@deploy2002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [15:17:05] !log klausman@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [15:18:22] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host es1035.mgmt.eqiad.wmnet with reboot policy FORCED [15:18:30] (JobUnavailable) firing: (2) Reduced availability for job alertmanager in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:18:30] (03CR) 10Majavah: "This looks fine to me, what about the notrack rules below?" [puppet] - 10https://gerrit.wikimedia.org/r/1006525 (owner: 10Muehlenhoff) [15:20:12] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host es1036.mgmt.eqiad.wmnet with reboot policy FORCED [15:20:16] (03PS4) 10Ssingh: P:dns::auth:: switch confd .ssh/config back to Puppet [puppet] - 10https://gerrit.wikimedia.org/r/1006532 (https://phabricator.wikimedia.org/T347054) [15:20:35] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host es1037.mgmt.eqiad.wmnet with reboot policy FORCED [15:20:54] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host es1038.mgmt.eqiad.wmnet with reboot policy FORCED [15:21:03] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host es1039.mgmt.eqiad.wmnet with reboot policy FORCED [15:21:25] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1470/co" [puppet] - 10https://gerrit.wikimedia.org/r/1006532 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [15:21:29] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host es1040.mgmt.eqiad.wmnet with reboot policy FORCED [15:23:25] !log klausman@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [15:23:30] (JobUnavailable) resolved: (2) Reduced availability for job alertmanager in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:23:53] 10SRE, 10SRE Program Management, 10Logos: SRE needs a logo - https://phabricator.wikimedia.org/T312067#9576651 (10kamila) Inspired by some of the above: {F42149298} {F42149308} {F42149325} [15:25:09] !log aqu@deploy2002 Started deploy [airflow-dags/analytics_test@b115452]: Deploy Refine job POC on test cluster - update 3 [15:25:21] !log aqu@deploy2002 Finished deploy [airflow-dags/analytics_test@b115452]: Deploy Refine job POC on test cluster - update 3 (duration: 00m 12s) [15:25:58] !log klausman@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [15:26:52] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure: Connection errors from puppetmaster1002 to puppetdb - https://phabricator.wikimedia.org/T358187#9576669 (10MoritzMuehlenhoff) p:05Triage→03Medium [15:27:10] !log klausman@deploy2002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [15:28:12] !log klausman@deploy2002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'article-descriptions' for release 'main' . [15:30:17] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, 10Spicerack: Add a cookbook to safely deploy puppet changes - https://phabricator.wikimedia.org/T341442#9576681 (10joanna_borun) p:05Triage→03Medium [15:31:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P57929 and previous config saved to /var/cache/conftool/dbconfig/20240226-153131-arnaudb.json [15:31:35] 10SRE, 10Infrastructure-Foundations, 10Mail: Access to DMARCIAN - https://phabricator.wikimedia.org/T356920#9576691 (10joanna_borun) p:05Triage→03Medium [15:36:19] 10Puppet, 10Cloud-VPS, 10Infrastructure-Foundations, 10cloud-services-team: wmf_auto_restart_cron.service failing in Cloud VPS bookworm instances - https://phabricator.wikimedia.org/T358343#9576710 (10MoritzMuehlenhoff) p:05Triage→03Medium a:03MoritzMuehlenhoff [15:36:25] 10SRE, 10Infrastructure-Foundations, 10Mail: Integrations tests - https://phabricator.wikimedia.org/T358355#9576711 (10joanna_borun) p:05Triage→03Medium [15:37:04] (03PS2) 10Klausman: LiftWing: Decrease CPU request for article-desc isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006534 [15:38:57] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10User-aborrero: ACPI kernel failure on debian installer last step - https://phabricator.wikimedia.org/T357896#9576723 (10aborrero) p:05Triage→03Low I haven't checked if the server has the latest firmware updates issued by Dell. Out of cautio... [15:41:02] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host es1038.mgmt.eqiad.wmnet with reboot policy FORCED [15:41:28] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host es1038.mgmt.eqiad.wmnet with reboot policy FORCED [15:43:12] (03CR) 10FNegri: elasticsearch: move to opensearch client (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/966492 (https://phabricator.wikimedia.org/T345337) (owner: 10David Caro) [15:44:45] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host es1038.mgmt.eqiad.wmnet with reboot policy FORCED [15:45:19] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host es1038.mgmt.eqiad.wmnet with reboot policy FORCED [15:46:24] 10SRE, 10Infrastructure-Foundations, 10Mail, 10Patch-For-Review, 10User-MoritzMuehlenhoff: Consider Postfix as MTA for our MXes (and OTRS/Mailman/Phab) - https://phabricator.wikimedia.org/T232343#9576754 (10jhathaway) a:03jhathaway [15:46:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P57930 and previous config saved to /var/cache/conftool/dbconfig/20240226-154637-arnaudb.json [15:46:56] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack: gNMI module in Spicerack - https://phabricator.wikimedia.org/T344325#9576756 (10ayounsi) a:03ayounsi [15:49:45] (03CR) 10Klausman: [C: 03+2] LiftWing: Decrease CPU request for article-desc isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006534 (owner: 10Klausman) [15:50:39] (03Merged) 10jenkins-bot: LiftWing: Decrease CPU request for article-desc isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006534 (owner: 10Klausman) [15:53:30] (JobUnavailable) firing: (2) Reduced availability for job alertmanager in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:54:14] (03CR) 10BBlack: [C: 03+1] Revert "conftool: introduce schema and host file for dnsboxes" [puppet] - 10https://gerrit.wikimedia.org/r/1005693 (owner: 10Ssingh) [15:55:36] (03CR) 10BBlack: [C: 03+1] P:dns::auth:: switch confd .ssh/config back to Puppet [puppet] - 10https://gerrit.wikimedia.org/r/1006532 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [15:57:52] (03CR) 10Ssingh: [C: 03+2] Revert "conftool: introduce schema and host file for dnsboxes" [puppet] - 10https://gerrit.wikimedia.org/r/1005693 (owner: 10Ssingh) [15:58:31] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es1039.mgmt.eqiad.wmnet with reboot policy FORCED [15:58:35] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es1037.mgmt.eqiad.wmnet with reboot policy FORCED [15:58:53] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es1040.mgmt.eqiad.wmnet with reboot policy FORCED [15:59:17] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host es1036.mgmt.eqiad.wmnet with reboot policy FORCED [15:59:34] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es1035.mgmt.eqiad.wmnet with reboot policy FORCED [16:00:17] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Figure out next steps for cergen in Puppet setup - https://phabricator.wikimedia.org/T357750#9576803 (10CDanis) Should this ticket really be "deprecate cergen"? :) [16:00:46] 10SRE, 10Infrastructure-Foundations, 10Puppet-Infrastructure, 10Patch-For-Review, 10Puppet (Puppet 7.0): Figure out next steps for cergen in Puppet setup - https://phabricator.wikimedia.org/T357750#9576804 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [16:01:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T357189)', diff saved to https://phabricator.wikimedia.org/P57931 and previous config saved to /var/cache/conftool/dbconfig/20240226-160143-arnaudb.json [16:01:46] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: Maintenance [16:02:00] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [16:02:00] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: Maintenance [16:02:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1189 (T357189)', diff saved to https://phabricator.wikimedia.org/P57932 and previous config saved to /var/cache/conftool/dbconfig/20240226-160206-arnaudb.json [16:02:40] !log jclark@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['es1035'] [16:02:47] !log jclark@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['es1036'] [16:02:52] !log jclark@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['es1037'] [16:03:00] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['es1036'] [16:03:00] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['es1035'] [16:03:03] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['es1037'] [16:04:04] !log jclark@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['es1036'] [16:04:10] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['es1036'] [16:04:33] !log jclark@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['es1035'] [16:04:44] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['es1035'] [16:05:18] 10SRE, 10Infrastructure-Foundations, 10Packaging, 10serviceops: Package php-ast in {stretch,buster}-wikimedia/component - https://phabricator.wikimedia.org/T280210#9576817 (10joanna_borun) [16:05:26] 10SRE, 10Infrastructure-Foundations, 10Packaging, 10serviceops: Package php-ast in {stretch,buster}-wikimedia/component - https://phabricator.wikimedia.org/T280210#9576820 (10joanna_borun) @Reedy is it still valid? [16:09:34] !log cmooney@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2004.wikimedia.org with OS bookworm [16:11:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T357189)', diff saved to https://phabricator.wikimedia.org/P57933 and previous config saved to /var/cache/conftool/dbconfig/20240226-161148-arnaudb.json [16:11:58] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [16:13:46] !log jclark@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['es1037'] [16:13:57] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['es1037'] [16:14:04] (03CR) 10Muehlenhoff: "The notrack support for nftables was only added recently and for the initial service which I applied it to, it had to be reverted, since t" [puppet] - 10https://gerrit.wikimedia.org/r/1006525 (owner: 10Muehlenhoff) [16:16:56] 10SRE, 10Infrastructure-Foundations, 10cloud-services-team, 10User-aborrero: ACPI kernel failure on debian installer last step - https://phabricator.wikimedia.org/T357896#9576914 (10MoritzMuehlenhoff) >>! In T357896#9576723, @aborrero wrote: > I haven't checked if the server has the latest firmware updates... [16:17:04] (03CR) 10Majavah: [C: 03+1] openstack::base::pdns::recursor::service: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/1006525 (owner: 10Muehlenhoff) [16:18:00] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host es1036.mgmt.eqiad.wmnet with reboot policy FORCED [16:19:14] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [16:20:39] !log vriley@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [16:22:56] !log etcd: purging /conftool/v1/dnsbox: old schema, deprecated: T347054 [16:23:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:23:02] T347054: Simplify maintenance of DNS/NTP hosts to reduce toil around reboots, reimages, and other work - https://phabricator.wikimedia.org/T347054 [16:23:24] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host logging-hd1001.mgmt.eqiad.wmnet with reboot policy FORCED [16:25:07] (03CR) 10Ssingh: [C: 03+2] conftool-data: add dnsbox hosts data [puppet] - 10https://gerrit.wikimedia.org/r/1006021 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [16:25:26] 10SRE, 10ops-codfw, 10DC-Ops, 10serviceops: mw2420-mw2451 do have unncecesarry raid controllers (configured) - https://phabricator.wikimedia.org/T358489#9576996 (10JMeybohm) >>! In T358489#9576608, @JMeybohm wrote: >>>! In T358489#9576376, @MatthewVernon wrote: >> If you do decide you might want to reprovi... [16:26:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P57935 and previous config saved to /var/cache/conftool/dbconfig/20240226-162655-arnaudb.json [16:27:01] 10SRE, 10SRE Observability (FY2023/2024-Q3): Fix the Alert hosts Puppet catalogue to be compatible with Puppet 7 - https://phabricator.wikimedia.org/T358506#9576999 (10andrea.denisse) [16:30:04] jan_drewniak: OwO what's this, a deployment window?? Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240226T1630). nyaa~ [16:30:55] !log denisse@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host alert2001.wikimedia.org with OS bookworm [16:31:04] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q3): Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615#9577040 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by denisse@cumin2002 for host alert2001.wikimedia.org with OS bookworm executed with errors... [16:34:21] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [16:36:41] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt logging-hd1002 - vriley@cumin1002" [16:37:33] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt logging-hd1002 - vriley@cumin1002" [16:37:33] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:38:05] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [16:38:07] (03PS1) 10Andrea Denisse: Set the alert2001 to insetup for the Bookworm upgrade [puppet] - 10https://gerrit.wikimedia.org/r/1006540 (https://phabricator.wikimedia.org/T333615) [16:38:19] 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10vm-requests: Ganeti VM for contint migration - https://phabricator.wikimedia.org/T358237#9577063 (10LSobanski) a:03Dzahn [16:39:29] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:39:35] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es1038.mgmt.eqiad.wmnet with reboot policy FORCED [16:40:03] (03CR) 10Muehlenhoff: "Note that role::insetup::observability defaults to Puppet 7" [puppet] - 10https://gerrit.wikimedia.org/r/1006540 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [16:41:22] (03PS1) 10Ssingh: conftool-data: dnsbox: fix typo for authdns-ns2 [puppet] - 10https://gerrit.wikimedia.org/r/1006542 [16:42:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P57936 and previous config saved to /var/cache/conftool/dbconfig/20240226-164201-arnaudb.json [16:42:02] 10SRE, 10Infrastructure-Foundations, 10netops: Servers on public1-a-codfw and public1-b-codfw not getting DHCP during reimage - https://phabricator.wikimedia.org/T358488#9577083 (10cmooney) Digging a little deeper on this the source IP of the packets hitting the install server don't really matter, what is mo... [16:42:09] (03CR) 10Muehlenhoff: "But you can use role::insetup::buster (even though his won't get instaled with Buster), which continues to default to Puppet 5." [puppet] - 10https://gerrit.wikimedia.org/r/1006540 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [16:42:18] (03CR) 10Andrea Denisse: "Noted, we'll add the --new and -p5 flags to the reimage." [puppet] - 10https://gerrit.wikimedia.org/r/1006540 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [16:43:03] !log vriley@cumin1002 START - Cookbook sre.dns.netbox [16:43:58] (03CR) 10Ssingh: [C: 03+2] conftool-data: dnsbox: fix typo for authdns-ns2 [puppet] - 10https://gerrit.wikimedia.org/r/1006542 (owner: 10Ssingh) [16:45:03] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host logging-hd1002.mgmt.eqiad.wmnet with reboot policy FORCED [16:46:00] !log vriley@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt logging-hd1003 - vriley@cumin1002" [16:46:39] !log sukhe@puppetmaster1001 conftool action : set/weight=100; selector: cluster=dnsbox [16:46:51] !log vriley@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: update mgmt logging-hd1003 - vriley@cumin1002" [16:46:51] !log vriley@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:47:19] (03CR) 10Ssingh: [V: 03+1 C: 03+2] P:dns::auth:: switch confd .ssh/config back to Puppet [puppet] - 10https://gerrit.wikimedia.org/r/1006532 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [16:47:44] (03PS1) 10Htriedman: update page redaction list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006544 [16:47:50] !log disable puppet on A:dns-rec to merge CR 1006532 [16:47:52] (03CR) 10CI reject: [V: 04-1] update page redaction list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006544 (owner: 10Htriedman) [16:47:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:11] (03PS1) 10Volans: sre.hosts.reimage: support also Icinga hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1006545 [16:48:33] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host logging-hd1003.mgmt.eqiad.wmnet with reboot policy FORCED [16:49:06] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120#9577115 (10Clement_Goubert) [16:49:22] (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1006545 (owner: 10Volans) [16:49:26] (03PS2) 10Htriedman: update page redaction list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006544 [16:49:30] (03Abandoned) 10Andrea Denisse: Set the alert2001 to insetup for the Bookworm upgrade [puppet] - 10https://gerrit.wikimedia.org/r/1006540 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [16:50:30] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Migrate internal traffic to k8s - https://phabricator.wikimedia.org/T333120#8728141 (10Clement_Goubert) [16:52:40] (03CR) 10Volans: [C: 03+2] sre.hosts.reimage: support also Icinga hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1006545 (owner: 10Volans) [16:54:25] !log re-enable Puppet on A:dns-rec and run agent [16:54:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:11] 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10vm-requests: Ganeti VM for contint migration - https://phabricator.wikimedia.org/T358237#9577168 (10LSobanski) p:05Triage→03High [16:55:12] !log sudo cumin 'A:dns-rec and not P{dns6001*}' "run-puppet-agent --enable 'merging CR'" [16:55:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:46] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host logging-hd1001.mgmt.eqiad.wmnet with reboot policy FORCED [16:56:31] (03Merged) 10jenkins-bot: sre.hosts.reimage: support also Icinga hosts [cookbooks] - 10https://gerrit.wikimedia.org/r/1006545 (owner: 10Volans) [16:57:00] (03PS1) 10Brouberol: superset: assign Superset roles from LDAP groups [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006547 (https://phabricator.wikimedia.org/T297120) [16:57:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T357189)', diff saved to https://phabricator.wikimedia.org/P57937 and previous config saved to /var/cache/conftool/dbconfig/20240226-165707-arnaudb.json [16:57:10] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1198.eqiad.wmnet with reason: Maintenance [16:57:14] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [16:57:24] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1198.eqiad.wmnet with reason: Maintenance [16:57:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1198 (T357189)', diff saved to https://phabricator.wikimedia.org/P57938 and previous config saved to /var/cache/conftool/dbconfig/20240226-165730-arnaudb.json [16:59:27] (03PS1) 10Ssingh: P:dns::update::account: add .ssh/config file [puppet] - 10https://gerrit.wikimedia.org/r/1006548 (https://phabricator.wikimedia.org/T347054) [16:59:53] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host es1036.mgmt.eqiad.wmnet with reboot policy FORCED [17:00:35] !log denisse@cumin2002 START - Cookbook sre.hosts.reimage for host alert2001.wikimedia.org with OS bookworm [17:00:44] 10SRE, 10Patch-For-Review, 10SRE Observability (FY2023/2024-Q3): Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615#9577208 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by denisse@cumin2002 for host alert2001.wikimedia.org with OS bookworm [17:00:47] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1471/co" [puppet] - 10https://gerrit.wikimedia.org/r/1006548 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [17:03:10] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host logging-hd1002.mgmt.eqiad.wmnet with reboot policy FORCED [17:04:03] !log vriley@cumin1002 START - Cookbook sre.hosts.provision for host logging-hd1002.mgmt.eqiad.wmnet with reboot policy FORCED [17:04:57] !log vriley@cumin1002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['logging-hd1001'] [17:05:24] !log vriley@cumin1002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['logging-hd1001'] [17:07:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T357189)', diff saved to https://phabricator.wikimedia.org/P57939 and previous config saved to /var/cache/conftool/dbconfig/20240226-170712-arnaudb.json [17:07:19] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [17:08:28] (03CR) 10Ssingh: [V: 03+1 C: 03+2] P:dns::update::account: add .ssh/config file [puppet] - 10https://gerrit.wikimedia.org/r/1006548 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [17:09:03] (03CR) 10DCausse: [C: 03+2] cirrus: Add script to orchestrate reindexing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005635 (https://phabricator.wikimedia.org/T356303) (owner: 10Ebernhardson) [17:10:07] (03Merged) 10jenkins-bot: cirrus: Add script to orchestrate reindexing [deployment-charts] - 10https://gerrit.wikimedia.org/r/1005635 (https://phabricator.wikimedia.org/T356303) (owner: 10Ebernhardson) [17:14:33] (03PS1) 10Ssingh: P:dns::auth::update: revert to using ferm rules from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/1006549 (https://phabricator.wikimedia.org/T347054) [17:16:20] (03PS2) 10Ssingh: P:dns::auth::update: revert to using ferm rules from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/1006549 (https://phabricator.wikimedia.org/T347054) [17:16:40] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host logging-hd1003.mgmt.eqiad.wmnet with reboot policy FORCED [17:17:39] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability: Q#:rack/setup/install logging-hd100[123] - https://phabricator.wikimedia.org/T355700#9577275 (10VRiley-WMF) [17:18:30] (JobUnavailable) resolved: (2) Reduced availability for job alertmanager in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [17:18:33] !log denisse@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on alert2001.wikimedia.org with reason: host reimage [17:22:19] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P57940 and previous config saved to /var/cache/conftool/dbconfig/20240226-172218-arnaudb.json [17:22:51] !log denisse@cumin2002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 2:00:00 on alert2001.wikimedia.org with reason: host reimage [17:30:30] PROBLEM - Host wikitech-static.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100% [17:31:17] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host es1035.eqiad.wmnet with OS bookworm [17:31:22] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9577385 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host es1035.eqiad.wmnet with OS bookworm [17:32:42] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host es1036.eqiad.wmnet with OS bookworm [17:32:47] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9577396 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host es1036.eqiad.wmnet with OS bookworm [17:32:57] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host es1037.eqiad.wmnet with OS bookworm [17:33:07] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9577403 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host es1037.eqiad.wmnet with OS bookworm [17:33:12] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host es1038.eqiad.wmnet with OS bookworm [17:33:16] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host es1039.eqiad.wmnet with OS bookworm [17:33:18] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9577404 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host es1038.eqiad.wmnet with OS bookworm [17:33:18] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host es1040.eqiad.wmnet with OS bookworm [17:33:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9577405 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host es1039.eqiad.wmnet with OS bookworm [17:33:32] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9577406 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host es1040.eqiad.wmnet with OS bookworm [17:35:56] !log Enabled meta-monitoring for alert1001 - T333615 [17:36:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:36:09] T333615: Upgrade alert* hosts to Bookworm - https://phabricator.wikimedia.org/T333615 [17:37:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P57941 and previous config saved to /var/cache/conftool/dbconfig/20240226-173725-arnaudb.json [17:38:35] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [17:38:42] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [17:39:14] (03CR) 10BBlack: [C: 03+1] P:dns::auth::update: revert to using ferm rules from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/1006549 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [17:41:00] (03CR) 10Ssingh: [C: 03+2] P:dns::auth::update: revert to using ferm rules from Puppet [puppet] - 10https://gerrit.wikimedia.org/r/1006549 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [17:41:59] !log vriley@cumin1002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host logging-hd1002.mgmt.eqiad.wmnet with reboot policy FORCED [17:43:29] 10SRE, 10Infrastructure-Foundations: Setup cumin1002 and eventually decom cumin1001 - https://phabricator.wikimedia.org/T353419#9577440 (10VRiley-WMF) [17:44:12] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations: Decommission cumin1001 - https://phabricator.wikimedia.org/T358328#9577438 (10VRiley-WMF) 05Open→03Resolved [17:44:15] (03PS1) 10Ssingh: P:dns::auth::upate: absent confd management of ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/1006554 (https://phabricator.wikimedia.org/T347054) [17:44:52] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations: Decommission cumin1001 - https://phabricator.wikimedia.org/T358328#9571567 (10VRiley-WMF) This has been completed [17:45:31] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1474/console" [puppet] - 10https://gerrit.wikimedia.org/r/1006554 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [17:46:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9577453 (10Jclark-ctr) [17:47:28] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [17:47:31] 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10vm-requests: Ganeti VM for contint migration - https://phabricator.wikimedia.org/T358237#9577454 (10Dzahn) I'll go with private IP but cloud VPS doesn't really seem feasible to me. [17:48:04] (03CR) 10Ssingh: [V: 03+1 C: 03+2] P:dns::auth::upate: absent confd management of ferm rules [puppet] - 10https://gerrit.wikimedia.org/r/1006554 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [17:51:31] !log running dummy authdns-update to confirm working ferm rules [17:51:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T357189)', diff saved to https://phabricator.wikimedia.org/P57942 and previous config saved to /var/cache/conftool/dbconfig/20240226-175231-arnaudb.json [17:52:35] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1212.eqiad.wmnet with reason: Maintenance [17:52:41] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [17:52:48] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1212.eqiad.wmnet with reason: Maintenance [17:52:50] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [17:53:09] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance [17:53:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1212 (T357189)', diff saved to https://phabricator.wikimedia.org/P57943 and previous config saved to /var/cache/conftool/dbconfig/20240226-175315-arnaudb.json [17:54:19] (03PS1) 10Dzahn: site: add contint1003 with insetup::collab role [puppet] - 10https://gerrit.wikimedia.org/r/1006557 (https://phabricator.wikimedia.org/T358237) [17:56:23] (03CR) 10Dzahn: [C: 03+2] site: add contint1003 with insetup::collab role [puppet] - 10https://gerrit.wikimedia.org/r/1006557 (https://phabricator.wikimedia.org/T358237) (owner: 10Dzahn) [17:56:44] (03PS1) 10Ssingh: P:dns::auth::update: update authdns-update for new confctl changes [puppet] - 10https://gerrit.wikimedia.org/r/1006558 (https://phabricator.wikimedia.org/T347054) [17:57:57] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (DIFF 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1006558 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [17:58:00] !log dzahn@cumin1002 START - Cookbook sre.dns.netbox [17:59:23] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240226T1800) [18:00:05] ryankemper: Time to do the Wikidata Query Service weekly deploy deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240226T1800). [18:00:58] !log dzahn@cumin1002 START - Cookbook sre.ganeti.makevm for new host contint1003.eqiad.wmnet [18:00:59] !log dzahn@cumin1002 START - Cookbook sre.dns.netbox [18:02:14] RECOVERY - Host wikitech-static.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 28.68 ms [18:03:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T357189)', diff saved to https://phabricator.wikimedia.org/P57944 and previous config saved to /var/cache/conftool/dbconfig/20240226-180321-arnaudb.json [18:03:37] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [18:04:29] !log dzahn@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM contint1003.eqiad.wmnet - dzahn@cumin1002" [18:05:21] !log dzahn@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM contint1003.eqiad.wmnet - dzahn@cumin1002" [18:05:21] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:05:21] !log dzahn@cumin1002 START - Cookbook sre.dns.wipe-cache contint1003.eqiad.wmnet on all recursors [18:05:22] (03CR) 10Htriedman: "adding reviewers" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006544 (owner: 10Htriedman) [18:05:25] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) contint1003.eqiad.wmnet on all recursors [18:05:42] 10SRE, 10Infrastructure-Foundations, 10netops: Servers on public1-a-codfw and public1-b-codfw not getting DHCP during reimage - https://phabricator.wikimedia.org/T358488#9577561 (10cmooney) Juniper seem to document this scenario here, and advise using the "link-selection" keyword: https://www.juniper.net/do... [18:05:49] !log dzahn@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM contint1003.eqiad.wmnet - dzahn@cumin1002" [18:06:40] !log dzahn@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM contint1003.eqiad.wmnet - dzahn@cumin1002" [18:06:59] !log dzahn@cumin1002 START - Cookbook sre.hosts.reimage for host contint1003.eqiad.wmnet with OS bullseye [18:07:07] 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10vm-requests, 10Patch-For-Review: Ganeti VM for contint migration - https://phabricator.wikimedia.org/T358237#9577563 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1002 for host contint1003.... [18:07:23] !log dzahn@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host contint1003.eqiad.wmnet with OS bullseye [18:07:23] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=97) for new host contint1003.eqiad.wmnet [18:07:32] 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10vm-requests, 10Patch-For-Review: Ganeti VM for contint migration - https://phabricator.wikimedia.org/T358237#9577564 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1002 for host contint1003.eqia... [18:07:48] !log dzahn@cumin1002 START - Cookbook sre.ganeti.makevm for new host contint1003.eqiad.wmnet [18:07:49] !log dzahn@cumin1002 START - Cookbook sre.dns.netbox [18:08:43] (03PS4) 10Andrew Bogott: OpenStack Designate: move from cloudservices to cloudcontrols in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/997965 (https://phabricator.wikimedia.org/T350995) [18:09:00] PROBLEM - Host sretest2004 is DOWN: PING CRITICAL - Packet loss = 100% [18:09:00] (03PS1) 10Ebernhardson: cirrus: Set env and release specific backfill values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006560 [18:09:24] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [18:09:32] !log dzahn@cumin1002 START - Cookbook sre.dns.wipe-cache contint1003.eqiad.wmnet on all recursors [18:09:35] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) contint1003.eqiad.wmnet on all recursors [18:09:38] !log dzahn@cumin1002 START - Cookbook sre.dns.netbox [18:11:01] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:11:02] !log dzahn@cumin1002 START - Cookbook sre.dns.wipe-cache contint1003.eqiad.wmnet on all recursors [18:11:05] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) contint1003.eqiad.wmnet on all recursors [18:11:12] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host contint1003.eqiad.wmnet [18:13:14] !log dzahn@cumin1002 START - Cookbook sre.hosts.reimage for host contint1003.eqiad.wmnet with OS bullseye [18:14:30] !log T357007 Running mwscript CampaignEvents:GenerateInvitationList --wiki=metawiki --listfile=/home/daimona/list.txt [18:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:43] T357007: Generate Invitation Lists for Event Organizers - https://phabricator.wikimedia.org/T357007 [18:16:14] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [18:16:21] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:16:28] (03PS1) 10EoghanGaffney: [gitlab] Pause/Prompt before restarting gitlab during upgrade [cookbooks] - 10https://gerrit.wikimedia.org/r/1006562 [18:17:02] (03CR) 10Slyngshede: [C: 03+2] Remove debmonitor1002/2002 from site.pp [puppet] - 10https://gerrit.wikimedia.org/r/1006531 (https://phabricator.wikimedia.org/T241049) (owner: 10Muehlenhoff) [18:17:14] (03CR) 10Slyngshede: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1006531 (https://phabricator.wikimedia.org/T241049) (owner: 10Muehlenhoff) [18:18:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P57945 and previous config saved to /var/cache/conftool/dbconfig/20240226-181827-arnaudb.json [18:18:53] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [18:18:59] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:19:28] 10SRE, 10Infrastructure-Foundations, 10netops: Servers on public1-a-codfw and public1-b-codfw not getting DHCP during reimage - https://phabricator.wikimedia.org/T358488#9577613 (10cmooney) After issuing a manual release of the IP and trying again things seem to be working as expected: ` cmooney@install2004:... [18:19:59] 10SRE, 10ops-eqiad, 10DBA, 10Wikimedia-Incident: db2118 crashed and rebooted due to HW - https://phabricator.wikimedia.org/T358421#9577608 (10wiki_willy) ++ @VRiley-WMF & @Jclark-ctr >>! In T358421#9574362, @Marostegui wrote: > @wiki_willy can we contact the vendor about this issue which caused a reboot?... [18:20:11] (03CR) 10Volans: [C: 03+1] "LGTM but I would like an explicit approval from Search as this module has always been under their care." [software/spicerack] - 10https://gerrit.wikimedia.org/r/966492 (https://phabricator.wikimedia.org/T345337) (owner: 10David Caro) [18:21:16] 10SRE-tools, 10Infrastructure-Foundations, 10Spicerack, 10Patch-For-Review, 10cloud-services-team (FY2023/2024-Q3-Q4): spicerack: tox fails to install PyYAML using python 3.11 on bookworm - https://phabricator.wikimedia.org/T345337#9577617 (10Volans) @fnegri Thanks a lot for resuming this and taking care... [18:23:11] 10SRE, 10Infrastructure-Foundations, 10netops: Servers on public1-a-codfw and public1-b-codfw not getting DHCP during reimage - https://phabricator.wikimedia.org/T358488#9577627 (10cmooney) So I think the solution is: # Add the "link-selection" command to the config on EVPN switches to add the IRB interface... [18:24:32] (03PS1) 10Bking: wdqs: add blackbox check for query.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/1006564 (https://phabricator.wikimedia.org/T358029) [18:25:42] (03CR) 10CI reject: [V: 04-1] wdqs: add blackbox check for query.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/1006564 (https://phabricator.wikimedia.org/T358029) (owner: 10Bking) [18:28:08] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2004.wikimedia.org with OS bookworm [18:28:14] (03CR) 10Volans: [C: 03+1] "Code LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1006562 (owner: 10EoghanGaffney) [18:28:29] (03PS2) 10Bking: wdqs: add blackbox check for query.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/1006564 (https://phabricator.wikimedia.org/T358029) [18:29:27] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [18:29:34] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [18:32:54] (03CR) 10Ebernhardson: [C: 03+2] cirrus: Set env and release specific backfill values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006560 (owner: 10Ebernhardson) [18:33:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P57946 and previous config saved to /var/cache/conftool/dbconfig/20240226-183334-arnaudb.json [18:33:50] (03Merged) 10jenkins-bot: cirrus: Set env and release specific backfill values [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006560 (owner: 10Ebernhardson) [18:39:07] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1006564 (https://phabricator.wikimedia.org/T358029) (owner: 10Bking) [18:41:49] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host es1036.eqiad.wmnet with OS bookworm [18:42:00] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host es1035.eqiad.wmnet with OS bookworm [18:42:06] !log cmooney@cumin1002 START - Cookbook sre.hosts.dhcp for host sretest2004.wikimedia.org [18:42:18] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host es1037.eqiad.wmnet with OS bookworm [18:42:21] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host es1040.eqiad.wmnet with OS bookworm [18:42:25] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host es1039.eqiad.wmnet with OS bookworm [18:42:34] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host es1038.eqiad.wmnet with OS bookworm [18:42:47] 10SRE, 10Wikimedia-Mailing-lists, 10Hindi-Sites: Adminship of Hindi Wikipedia Mailing List - https://phabricator.wikimedia.org/T73388#9577680 (10Dzahn) I don't really see the value in moving tickets on workboards that have been resolved years ago. For me persoally this created hundreds of notifications but n... [18:42:51] (03CR) 10BBlack: [C: 03+1] "Seems reasonable!" [puppet] - 10https://gerrit.wikimedia.org/r/1006558 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [18:43:14] (03CR) 10Nik Gkountas: [C: 04-1] "This change won't be enough for enabling SX on new wikis. We also have to add them inside the "SectionTranslationTargetLanguages" config p" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/995176 (https://phabricator.wikimedia.org/T298235) (owner: 10KartikMistry) [18:48:05] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.dhcp (exit_code=0) for host sretest2004.wikimedia.org [18:48:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T357189)', diff saved to https://phabricator.wikimedia.org/P57947 and previous config saved to /var/cache/conftool/dbconfig/20240226-184841-arnaudb.json [18:48:44] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1223.eqiad.wmnet with reason: Maintenance [18:48:47] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [18:48:57] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1223.eqiad.wmnet with reason: Maintenance [18:49:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db1223 (T357189)', diff saved to https://phabricator.wikimedia.org/P57948 and previous config saved to /var/cache/conftool/dbconfig/20240226-184903-arnaudb.json [18:49:14] !log cmooney@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2004.wikimedia.org with OS bookworm [18:51:12] !log cmooney@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2003.codfw.wmnet with OS bookworm [18:51:17] 10SRE, 10Infrastructure-Foundations, 10netops: Servers on public1-a-codfw and public1-b-codfw not getting DHCP during reimage - https://phabricator.wikimedia.org/T358488#9577701 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1002 for host sretest2003.codfw.wmnet with... [18:54:14] !log jclark@cumin1002 START - Cookbook sre.dns.netbox [18:55:51] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host contint1003.eqiad.wmnet with OS bullseye [18:55:52] (03PS1) 10Ebernhardson: cirrus: Add ability to backfill all wikis [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006567 [18:56:08] !log jclark@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt for es1036-40 - jclark@cumin1002" [18:59:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T357189)', diff saved to https://phabricator.wikimedia.org/P57949 and previous config saved to /var/cache/conftool/dbconfig/20240226-185907-arnaudb.json [18:59:14] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [18:59:25] (03PS1) 10Cathal Mooney: Use loopback for DHCP relay on single-ip EVPN anycast GWs [homer/public] - 10https://gerrit.wikimedia.org/r/1006568 (https://phabricator.wikimedia.org/T358488) [18:59:52] !log jclark@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: added network and mgmt for es1036-40 - jclark@cumin1002" [18:59:52] !log jclark@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:00:21] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host es1035 [19:00:31] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host es1035 [19:00:35] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host es1036 [19:00:38] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host es1037 [19:00:41] (03PS1) 10Ebernhardson: cirrus: Deploy to all cloudelastic wikis [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006569 (https://phabricator.wikimedia.org/T358518) [19:00:42] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host es1036 [19:00:44] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host es1037 [19:00:48] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host es1038 [19:00:51] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host es1039 [19:00:55] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host es1038 [19:00:56] can't create VM and can't decom it either. both fail [19:00:57] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host es1039 [19:01:02] !log jclark@cumin1002 START - Cookbook sre.network.configure-switch-interfaces for host es1040 [19:01:08] !log jclark@cumin1002 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host es1040 [19:01:47] (03PS1) 10Ebernhardson: cirrus: Transition remaining cloudelastic wikis to streaming updater [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006570 (https://phabricator.wikimedia.org/T358518) [19:02:29] !log dzahn@cumin1002 START - Cookbook sre.ganeti.makevm for new host contint1003.eqiad.wmnet [19:02:30] !log dzahn@cumin1002 START - Cookbook sre.dns.netbox [19:02:36] (03CR) 10CI reject: [V: 04-1] cirrus: Transition remaining cloudelastic wikis to streaming updater [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006570 (https://phabricator.wikimedia.org/T358518) (owner: 10Ebernhardson) [19:04:16] !log T358237 - makevm cookbook was interrupted by accident. re-running it would create a second IP with the same DNS name, running decom cookbook also fails, stuck [19:04:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:04:25] T358237: Ganeti VM for contint migration - https://phabricator.wikimedia.org/T358237 [19:04:57] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99) [19:05:00] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) for new host contint1003.eqiad.wmnet [19:05:10] (03PS2) 10Ebernhardson: cirrus: Enable consumer-cloudelastic writes to all wikis [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006569 (https://phabricator.wikimedia.org/T358518) [19:05:27] (03PS2) 10Ebernhardson: cirrus: Transition remaining cloudelastic wikis to streaming updater [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006570 (https://phabricator.wikimedia.org/T358518) [19:06:14] (03CR) 10CI reject: [V: 04-1] cirrus: Transition remaining cloudelastic wikis to streaming updater [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006570 (https://phabricator.wikimedia.org/T358518) (owner: 10Ebernhardson) [19:06:49] !log @deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [19:06:56] !log @deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:09:20] !log decom cookbook finishes with 0 but does not remove DNS record of virtual machine T358237 [19:09:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:26] T358237: Ganeti VM for contint migration - https://phabricator.wikimedia.org/T358237 [19:09:57] !log dzahn@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "sync - dzahn@cumin1002" [19:10:27] (03CR) 10Ssingh: [V: 03+1 C: 03+2] P:dns::auth::update: update authdns-update for new confctl changes [puppet] - 10https://gerrit.wikimedia.org/r/1006558 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [19:10:50] !log dzahn@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "sync - dzahn@cumin1002" [19:11:51] !log dzahn@cumin1002 START - Cookbook sre.dns.netbox [19:12:06] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2003.codfw.wmnet with reason: host reimage [19:13:15] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [19:13:29] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:14:15] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P57950 and previous config saved to /var/cache/conftool/dbconfig/20240226-191414-arnaudb.json [19:15:03] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2003.codfw.wmnet with reason: host reimage [19:15:46] !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=dns6002.wikimedia.org,service=authdns-update [19:21:40] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host es1035.eqiad.wmnet with OS bookworm [19:21:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9577860 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host es1035.eqiad.wmnet with OS bookworm [19:23:46] 10SRE, 10Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977#9577866 (10cmooney) p:05Low→03Medium Actually a different need to upgrade has now become clear, relating to the issue detailed in T358488 The solution to that requ... [19:26:31] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host es1037.eqiad.wmnet with OS bookworm [19:26:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9577876 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host es1037.eqiad.wmnet with OS bookworm [19:26:47] (03PS1) 10Ssingh: P:dns::auth::update: remove redundant conftool_prefix [puppet] - 10https://gerrit.wikimedia.org/r/1006573 (https://phabricator.wikimedia.org/T347054) [19:27:33] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Servers on public1-a-codfw and public1-b-codfw not getting DHCP during reimage - https://phabricator.wikimedia.org/T358488#9577879 (10cmooney) >>! In T358488#9577627, @cmooney wrote: > # Add the "link-selection" command to the config on EVP... [19:28:09] (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1476/co" [puppet] - 10https://gerrit.wikimedia.org/r/1006573 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [19:29:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P57951 and previous config saved to /var/cache/conftool/dbconfig/20240226-192920-arnaudb.json [19:30:17] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2003.codfw.wmnet with OS bookworm [19:30:27] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Servers on public1-a-codfw and public1-b-codfw not getting DHCP during reimage - https://phabricator.wikimedia.org/T358488#9577883 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1002 for host sretest2003... [19:30:41] !log cmooney@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host sretest2004.wikimedia.org with OS bookworm [19:30:41] (03CR) 10Ssingh: [V: 03+1 C: 03+2] P:dns::auth::update: remove redundant conftool_prefix [puppet] - 10https://gerrit.wikimedia.org/r/1006573 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [19:31:22] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host es1036.eqiad.wmnet with OS bookworm [19:31:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9577886 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host es1036.eqiad.wmnet with OS bookworm [19:32:08] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host es1038.eqiad.wmnet with OS bookworm [19:32:12] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host es1039.eqiad.wmnet with OS bookworm [19:32:13] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host es1040.eqiad.wmnet with OS bookworm [19:32:14] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9577887 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host es1038.eqiad.wmnet with OS bookworm [19:32:17] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9577888 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host es1039.eqiad.wmnet with OS bookworm [19:32:19] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9577889 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host es1040.eqiad.wmnet with OS bookworm [19:33:00] (03PS2) 10Cathal Mooney: Use loopback for DHCP relay on single-ip EVPN anycast GWs [homer/public] - 10https://gerrit.wikimedia.org/r/1006568 (https://phabricator.wikimedia.org/T358488) [19:34:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10SRE Observability: Q#:rack/setup/install logging-hd100[123] - https://phabricator.wikimedia.org/T355700#9577891 (10VRiley-WMF) a:05Jclark-ctr→03VRiley-WMF [19:35:29] (03PS3) 10Bking: wdqs: add blackbox check for query.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/1006564 (https://phabricator.wikimedia.org/T358029) [19:36:03] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1006564 (https://phabricator.wikimedia.org/T358029) (owner: 10Bking) [19:41:10] 10SRE, 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T357379#9577912 (10VRiley-WMF) @ayounsi for this issue, could we setup a time frame so this could troubleshoot this? Let us know, thanks! [19:43:59] !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=dns6001.wikimedia.org,service=authdns-update [19:44:27] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T357189)', diff saved to https://phabricator.wikimedia.org/P57952 and previous config saved to /var/cache/conftool/dbconfig/20240226-194427-arnaudb.json [19:44:30] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1240.eqiad.wmnet with reason: Maintenance [19:44:33] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [19:44:44] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1240.eqiad.wmnet with reason: Maintenance [19:45:13] !log cmooney@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2003.codfw.wmnet with OS bookworm [19:45:21] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Servers on public1-a-codfw and public1-b-codfw not getting DHCP during reimage - https://phabricator.wikimedia.org/T358488#9577919 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1002 for host sretest... [19:48:53] (03PS1) 10Ssingh: wikimedia-authdns.conf.tpl.erb: get the correct key path [puppet] - 10https://gerrit.wikimedia.org/r/1006576 (https://phabricator.wikimedia.org/T347054) [19:51:09] (03CR) 10Ebernhardson: [C: 03+1] wdqs: add blackbox check for query.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/1006564 (https://phabricator.wikimedia.org/T358029) (owner: 10Bking) [19:52:39] (03CR) 10Ssingh: [C: 03+2] wikimedia-authdns.conf.tpl.erb: get the correct key path [puppet] - 10https://gerrit.wikimedia.org/r/1006576 (https://phabricator.wikimedia.org/T347054) (owner: 10Ssingh) [19:53:26] (03CR) 10Bking: [C: 03+2] wdqs: add blackbox check for query.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/1006564 (https://phabricator.wikimedia.org/T358029) (owner: 10Bking) [19:55:52] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [19:56:17] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [19:59:33] !log bblack@cumin1002 conftool action : set/pooled=yes; selector: cluster=dnsbox,service=authdns-update,name=dns6002.wikimedia.org [20:00:31] !log bblack@cumin1002 conftool action : set/pooled=no; selector: cluster=dnsbox,service=authdns-update,name=dns3001.wikimedia.org [20:00:50] !log cmooney@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on sretest2003.codfw.wmnet with reason: host reimage [20:01:28] !log bblack@cumin1002 conftool action : set/pooled=no; selector: cluster=dnsbox,service=authdns-update,name=dns3003.wikimedia.org [20:02:44] !log bblack@cumin1002 conftool action : set/pooled=yes; selector: cluster=dnsbox,service=authdns-update,name=dns3003.wikimedia.org [20:03:47] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on sretest2003.codfw.wmnet with reason: host reimage [20:07:11] !log bblack@cumin1002 conftool action : set/pooled=no; selector: cluster=dnsbox,service=authdns-update,name=dns3001.wikimedia.org [20:07:15] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2105.codfw.wmnet with reason: Maintenance [20:07:28] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2105.codfw.wmnet with reason: Maintenance [20:07:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2105 (T357189)', diff saved to https://phabricator.wikimedia.org/P57953 and previous config saved to /var/cache/conftool/dbconfig/20240226-200734-arnaudb.json [20:07:40] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [20:12:00] !log dzahn@cumin1002 START - Cookbook sre.ganeti.makevm for new host contint1004.eqiad.wmnet [20:12:02] !log dzahn@cumin1002 START - Cookbook sre.dns.netbox [20:13:24] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006579 (https://phabricator.wikimedia.org/T128546) [20:14:50] !log running dummy authdns-update [20:14:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:16:25] !log dzahn@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM contint1004.eqiad.wmnet - dzahn@cumin1002" [20:17:16] !log dzahn@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM contint1004.eqiad.wmnet - dzahn@cumin1002" [20:17:16] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [20:17:16] !log dzahn@cumin1002 START - Cookbook sre.dns.wipe-cache contint1004.eqiad.wmnet on all recursors [20:17:20] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) contint1004.eqiad.wmnet on all recursors [20:17:30] (ProbeDown) firing: (6) Service wdqs2008:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:17:46] !log dzahn@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM contint1004.eqiad.wmnet - dzahn@cumin1002" [20:18:29] (03PS1) 10Bking: Revert "wdqs: add blackbox check for query.wikidata.org" [puppet] - 10https://gerrit.wikimedia.org/r/1006311 [20:18:39] !log dzahn@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM contint1004.eqiad.wmnet - dzahn@cumin1002" [20:18:55] !log cmooney@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host sretest2003.codfw.wmnet with OS bookworm [20:19:03] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Servers on public1-a-codfw and public1-b-codfw not getting DHCP during reimage - https://phabricator.wikimedia.org/T358488#9577971 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1002 for host sretest2003... [20:19:10] (03CR) 10Bking: [V: 03+2 C: 03+2] Revert "wdqs: add blackbox check for query.wikidata.org" [puppet] - 10https://gerrit.wikimedia.org/r/1006311 (owner: 10Bking) [20:19:12] !log dzahn@cumin1002 START - Cookbook sre.hosts.reimage for host contint1004.eqiad.wmnet with OS bullseye [20:19:19] 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10vm-requests: Ganeti VM for contint migration - https://phabricator.wikimedia.org/T358237#9577972 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by dzahn@cumin1002 for host contint1004.eqiad.wmnet with OS bu... [20:21:51] (03PS1) 10Dzahn: site: replace contint1003 with contint1004 [puppet] - 10https://gerrit.wikimedia.org/r/1006581 (https://phabricator.wikimedia.org/T358237) [20:22:30] (ProbeDown) firing: (2) Service wdqs1016:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1016:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:22:37] (03CR) 10Dzahn: [C: 03+2] site: replace contint1003 with contint1004 [puppet] - 10https://gerrit.wikimedia.org/r/1006581 (https://phabricator.wikimedia.org/T358237) (owner: 10Dzahn) [20:30:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T357189)', diff saved to https://phabricator.wikimedia.org/P57954 and previous config saved to /var/cache/conftool/dbconfig/20240226-203038-arnaudb.json [20:30:52] (03CR) 10Gmodena: "Looks good. Only one nit: could you maybe make the commit message more explicit? This repo contains several modules, not just eventstreams" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006544 (owner: 10Htriedman) [20:30:52] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [20:37:57] (03CR) 10Ebernhardson: [C: 03+2] cirrus: Enable consumer-cloudelastic writes to all wikis [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006569 (https://phabricator.wikimedia.org/T358518) (owner: 10Ebernhardson) [20:38:50] (03Merged) 10jenkins-bot: cirrus: Enable consumer-cloudelastic writes to all wikis [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006569 (https://phabricator.wikimedia.org/T358518) (owner: 10Ebernhardson) [20:41:55] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es1035.eqiad.wmnet with OS bookworm [20:42:00] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9578053 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host es1035.eqiad.wmnet with OS bookworm executed with errors: - es1035 (**FAIL... [20:44:55] !log T358237 used the next hostname number,1004, to avoid the duplicate IP issue. makevm cookbook is at attempt 103/240 to detect a reboot of the VM and uptime just keeps going up. used the "gnt-instance console --show-cmd " trick to get a console despite https://phabricator.wikimedia.org/T309724 - was missing partman config [20:45:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:03] T358237: Ganeti VM for contint migration - https://phabricator.wikimedia.org/T358237 [20:45:45] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P57955 and previous config saved to /var/cache/conftool/dbconfig/20240226-204544-arnaudb.json [20:46:45] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es1037.eqiad.wmnet with OS bookworm [20:46:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9578128 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host es1037.eqiad.wmnet with OS bookworm executed with errors: - es1037 (**FAIL... [20:48:50] (03PS1) 10Dzahn: installserver: add partman config for contint100[34] [puppet] - 10https://gerrit.wikimedia.org/r/1006583 (https://phabricator.wikimedia.org/T358237) [20:51:37] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es1036.eqiad.wmnet with OS bookworm [20:51:42] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9578151 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host es1036.eqiad.wmnet with OS bookworm executed with errors: - es1036 (**FAIL... [20:52:23] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es1038.eqiad.wmnet with OS bookworm [20:52:26] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es1039.eqiad.wmnet with OS bookworm [20:52:27] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9578152 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host es1038.eqiad.wmnet with OS bookworm executed with errors: - es1038 (**FAIL... [20:52:31] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9578153 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host es1039.eqiad.wmnet with OS bookworm executed with errors: - es1039 (**FAIL... [20:52:33] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host es1040.eqiad.wmnet with OS bookworm [20:52:37] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9578154 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jclark@cumin1002 for host es1040.eqiad.wmnet with OS bookworm executed with errors: - es1040 (**FAIL... [20:52:45] (ProbeDown) resolved: (2) Service wdqs1016:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1016:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:53:34] is there any brave enough to +2 my regression bugfix for TimedMediaHandler? :D https://gerrit.wikimedia.org/r/c/mediawiki/extensions/TimedMediaHandler/+/1006131 [20:53:44] (03CR) 10Dzahn: [C: 03+2] installserver: add partman config for contint100[34] [puppet] - 10https://gerrit.wikimedia.org/r/1006583 (https://phabricator.wikimedia.org/T358237) (owner: 10Dzahn) [20:55:34] bvibber: done :) [20:55:48] \o/ [20:55:50] thx ;) [20:56:23] * bvibber <- n00b dev ;) ;) [20:56:32] oh we have seen worth [20:56:34] worse [20:56:36] hahaha [20:56:52] remember that guy that was blindly doing s/"/'/g because "Performance?" [20:57:02] lolololololol [20:57:57] !log dzahn@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host contint1004.eqiad.wmnet with OS bullseye [20:57:58] !log dzahn@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=97) for new host contint1004.eqiad.wmnet [20:58:00] (ProbeDown) firing: (2) Service wdqs1016:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1016:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:58:06] 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10vm-requests, 10Patch-For-Review: Ganeti VM for contint migration - https://phabricator.wikimedia.org/T358237#9578161 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by dzahn@cumin1002 for host contint1004.eqia... [20:58:21] !log dzahn@cumin1002 START - Cookbook sre.hosts.reimage for host contint1004.eqiad.wmnet with OS bullseye [20:59:15] (ProbeDown) resolved: (12) Service wdqs1011:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [20:59:32] bvibber: https://static-codereview.wikimedia.org/MediaWiki/4904.html :) [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: That opportune time for a UTC late backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240226T2100). [21:00:05] bvibber and jan_drewniak: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:12] oh beautiful, we have all that archived out still <3 [21:00:21] yeah [21:00:22] yay [21:00:42] gotta say i don't miss subversion [21:00:43] and we might still have a copy of the rcs repo that we used to track InitialiseSettings.php back in the old days [21:00:48] heh [21:00:51] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2105', diff saved to https://phabricator.wikimedia.org/P57956 and previous config saved to /var/cache/conftool/dbconfig/20240226-210050-arnaudb.json [21:00:54] :D [21:01:31] hihi o/ [21:01:39] ohai [21:02:00] i can deploy - unless bvibber you are self-deploying? [21:02:18] I can self-deploy my patch after [21:02:19] cjming: nope i'm trying to stay out of the deploy business and leave it to those who won't type it wrong ;) [21:02:22] !log ebernhardson@deploy2002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:02:24] lol [21:02:29] !log ebernhardson@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:02:45] i'm barely comfortable having shell for my maintenance scripts hehe [21:02:58] thanks jan_drewniak - i'll do bvibber's patch and hand it over to you thereafter [21:03:01] thx [21:04:00] (ProbeDown) firing: (2) Service wdqs1016:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1016:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:04:15] (ProbeDown) resolved: (12) Service wdqs1011:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:05:12] bvibber: looks like you want to backport your patch? do you want to make that while your master patch is merging? [21:05:22] yeah lemme set those up [21:05:26] then maybe we can let Jan do his first since it's just config [21:05:27] did we branch .20 already? [21:05:44] oh - wait - it should just ride the train [21:05:50] so no need to backport [21:05:52] (03PS1) 10Bvibber: Fix regression in WebM transcodes breaking audio [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1006312 (https://phabricator.wikimedia.org/T358342) [21:05:53] right? [21:06:10] i think it gets cut later today or tomorrow? [21:06:22] i'd ideally deploy to commons asap :D [21:06:36] which i think will be on .19 until tomorrow? [21:06:50] ah - so you do want to backport to .19 then [21:06:53] yeah [21:07:13] ok backport is above ^ [21:07:16] !log dzahn@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on contint1004.eqiad.wmnet with reason: host reimage [21:07:58] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by cjming@deploy2002 using scap backport" [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1006312 (https://phabricator.wikimedia.org/T358342) (owner: 10Bvibber) [21:09:00] (ProbeDown) firing: (4) Service wdqs1016:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:09:14] hmm, i'm noticing quite a bit of `BackupDumper:499 PHP Notice: fwrite(): write of 124 bytes failed with errno=32 Broken pipe` in logs since about 20:10 UTC [21:10:24] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on contint1004.eqiad.wmnet with reason: host reimage [21:15:57] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2105 (T357189)', diff saved to https://phabricator.wikimedia.org/P57957 and previous config saved to /var/cache/conftool/dbconfig/20240226-211557-arnaudb.json [21:15:59] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2109.codfw.wmnet with reason: Maintenance [21:16:07] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [21:16:13] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2109.codfw.wmnet with reason: Maintenance [21:16:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2109 (T357189)', diff saved to https://phabricator.wikimedia.org/P57958 and previous config saved to /var/cache/conftool/dbconfig/20240226-211619-arnaudb.json [21:19:59] (03PS1) 10Dzahn: contint: add shell access and cluster contacts to contint1004 [puppet] - 10https://gerrit.wikimedia.org/r/1006586 (https://phabricator.wikimedia.org/T358237) [21:22:09] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host contint1004.eqiad.wmnet with OS bullseye [21:25:12] 10SRE, 10SRE-swift-storage, 10Data-Persistence, 10Thumbor, and 5 others: Change default image thumbnail size - https://phabricator.wikimedia.org/T355914#9578211 (10Jdlrobson) Providing engineering perspective on behalf of the WMF web team, I agree that if we want to make this change in English we should d... [21:27:24] (03Merged) 10jenkins-bot: Fix regression in WebM transcodes breaking audio [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.19) - 10https://gerrit.wikimedia.org/r/1006312 (https://phabricator.wikimedia.org/T358342) (owner: 10Bvibber) [21:27:39] !log cjming@deploy2002 Started scap: Backport for [[gerrit:1006312|Fix regression in WebM transcodes breaking audio (T358342)]] [21:27:46] T358342: No sound in uploaded videos - https://phabricator.wikimedia.org/T358342 [21:29:00] (ProbeDown) resolved: (2) Service wdqs1016:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1016:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:29:04] !log cjming@deploy2002 cjming and bvibber: Backport for [[gerrit:1006312|Fix regression in WebM transcodes breaking audio (T358342)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:29:07] bvibber: testable? on mwdebug [21:30:01] it's in the job queues so i can't test until it hits the runners live :( [21:30:06] so we just gotta push :D [21:30:30] cool - syncing [21:30:33] thx! [21:30:36] !log cjming@deploy2002 cjming and bvibber: Continuing with sync [21:38:54] !log cjming@deploy2002 Finished scap: Backport for [[gerrit:1006312|Fix regression in WebM transcodes breaking audio (T358342)]] (duration: 11m 14s) [21:39:04] bvibber: should be live! [21:39:08] T358342: No sound in uploaded videos - https://phabricator.wikimedia.org/T358342 [21:39:09] woot lemme test [21:39:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T357189)', diff saved to https://phabricator.wikimedia.org/P57959 and previous config saved to /var/cache/conftool/dbconfig/20240226-213916-arnaudb.json [21:39:38] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [21:40:04] cjming: confirmed ok :D thanks! [21:40:17] bvibber: nice! yw :) [21:40:28] jan_drewniak: all yours [21:40:44] cjming: cool thanks! [21:41:52] (03CR) 10Dzahn: [C: 03+2] contint: add shell access and cluster contacts to contint1004 [puppet] - 10https://gerrit.wikimedia.org/r/1006586 (https://phabricator.wikimedia.org/T358237) (owner: 10Dzahn) [21:42:33] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006579 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [21:43:56] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006579 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [21:46:08] (03PS1) 10Dzahn: contint: remove contint-docker group since it won't work without ci role [puppet] - 10https://gerrit.wikimedia.org/r/1006590 (https://phabricator.wikimedia.org/T358237) [21:46:55] (03PS2) 10Dzahn: contint: remove contint-docker group from contint1004 [puppet] - 10https://gerrit.wikimedia.org/r/1006590 (https://phabricator.wikimedia.org/T358237) [21:47:28] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [21:48:29] PROBLEM - Host urldownloader1003 is DOWN: PING CRITICAL - Packet loss = 100% [21:49:21] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1029 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [21:50:40] (03CR) 10Dzahn: [C: 03+2] contint: remove contint-docker group from contint1004 [puppet] - 10https://gerrit.wikimedia.org/r/1006590 (https://phabricator.wikimedia.org/T358237) (owner: 10Dzahn) [21:50:50] (03PS3) 10Dzahn: contint: remove contint-docker group from contint1004 [puppet] - 10https://gerrit.wikimedia.org/r/1006590 (https://phabricator.wikimedia.org/T358237) [21:51:19] (03CR) 10Btullis: [C: 03+1] "All fine by me, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1003382 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [21:51:58] (03CR) 10Btullis: [C: 03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/1003383 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [21:52:51] (SwaggerProbeHasFailures) firing: Not all openapi/swagger endpoints returned healthy - https://wikitech.wikimedia.org/wiki/Runbook#https://citoid.svc.eqiad.wmnet:4003 - https://grafana.wikimedia.org/d/_77ik484k/openapi-swagger-endpoint-state?var-site=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwaggerProbeHasFailures [21:53:02] (ProbeDown) firing: (2) Service urldownloader1003:8080 has failed probes (http_url_downloader_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Url-downloader - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:54:02] !log jdrewniak@deploy2002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:1006579| Bumping portals to master (T128546)]] (duration: 08m 26s) [21:54:08] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [21:54:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P57960 and previous config saved to /var/cache/conftool/dbconfig/20240226-215422-arnaudb.json [21:54:58] (03CR) 10Dzahn: [V: 03+2 C: 03+2] contint: remove contint-docker group from contint1004 [puppet] - 10https://gerrit.wikimedia.org/r/1006590 (https://phabricator.wikimedia.org/T358237) (owner: 10Dzahn) [21:56:30] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host es1035.eqiad.wmnet with OS bookworm [21:56:35] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9578319 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host es1035.eqiad.wmnet with OS bookworm [21:59:04] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv4: Connect - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:00:05] Reedy, sbassett, Maryum, and manfredi: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240226T2200). [22:00:08] (03PS5) 10Dzahn: add alert for planet content updates (last modified) [alerts] - 10https://gerrit.wikimedia.org/r/1003507 (https://phabricator.wikimedia.org/T353298) [22:00:25] (03CR) 10Dzahn: add alert for planet content updates (last modified) (032 comments) [alerts] - 10https://gerrit.wikimedia.org/r/1003507 (https://phabricator.wikimedia.org/T353298) (owner: 10Dzahn) [22:00:40] (03PS1) 10Jforrester: [BETA CLUSTER] List deployment-db14, new replica [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006593 (https://phabricator.wikimedia.org/T358329) [22:00:42] PROBLEM - Host kubernetes2029 is DOWN: PING CRITICAL - Packet loss = 100% [22:00:43] (03PS1) 10Jforrester: [BETA CLUSTER] Drop deployment-db12 and deployment-db13 replicas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006594 (https://phabricator.wikimedia.org/T358329) [22:01:03] 10SRE, 10Continuous-Integration-Infrastructure, 10collaboration-services, 10vm-requests, 10Patch-For-Review: Ganeti VM for contint migration - https://phabricator.wikimedia.org/T358237#9578337 (10Dzahn) 05Open→03In progress [22:01:14] (03CR) 10Dzahn: [C: 03+2] add alert for planet content updates (last modified) [alerts] - 10https://gerrit.wikimedia.org/r/1003507 (https://phabricator.wikimedia.org/T353298) (owner: 10Dzahn) [22:02:12] RECOVERY - Host kubernetes2029 is UP: PING OK - Packet loss = 0%, RTA = 30.30 ms [22:02:23] (03Merged) 10jenkins-bot: add alert for planet content updates (last modified) [alerts] - 10https://gerrit.wikimedia.org/r/1003507 (https://phabricator.wikimedia.org/T353298) (owner: 10Dzahn) [22:02:39] !log jdrewniak@deploy2002 Synchronized portals: Wikimedia Portals Update: [[gerrit:1006579| Bumping portals to master (T128546)]] (duration: 08m 37s) [22:02:48] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [22:04:25] (SystemdUnitFailed) firing: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:05:08] (03CR) 10Dzahn: [C: 03+2] "replaced by modules/profile/manifests/planet.pp: prometheus::blackbox::check::http { 'en.planet.wikimedia.org':" [puppet] - 10https://gerrit.wikimedia.org/r/1003098 (owner: 10Dzahn) [22:05:26] PROBLEM - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:05:46] (03PS2) 10Dzahn: icinga: delete monitoring class for planet [puppet] - 10https://gerrit.wikimedia.org/r/1003098 [22:06:35] !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (2 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic plugin upgrade - ryankemper@cumin2002 - T356651 [22:06:43] T356651: Rebuild and deploy textify plugin - https://phabricator.wikimedia.org/T356651 [22:08:40] (KubernetesRsyslogDown) firing: rsyslog on kubernetes2029:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2029 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [22:09:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P57961 and previous config saved to /var/cache/conftool/dbconfig/20240226-220928-arnaudb.json [22:11:41] (03CR) 10Dzahn: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/1003098 (owner: 10Dzahn) [22:12:05] (03CR) 10Ladsgroup: [C: 03+1] [BETA CLUSTER] Drop deployment-db12 and deployment-db13 replicas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006594 (https://phabricator.wikimedia.org/T358329) (owner: 10Jforrester) [22:12:09] (03CR) 10Ladsgroup: [C: 03+1] [BETA CLUSTER] List deployment-db14, new replica [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006593 (https://phabricator.wikimedia.org/T358329) (owner: 10Jforrester) [22:14:19] !log jclark@cumin1002 START - Cookbook sre.hosts.provision for host es1036.mgmt.eqiad.wmnet with reboot policy FORCED [22:14:55] jouncebot: nowandnext [22:14:55] For the next 1 hour(s) and 45 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240226T2200) [22:14:55] In 4 hour(s) and 45 minute(s): Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240227T0300) [22:15:17] (03CR) 10Jforrester: [C: 03+2] [BETA CLUSTER] Drop deployment-db12 and deployment-db13 replicas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006594 (https://phabricator.wikimedia.org/T358329) (owner: 10Jforrester) [22:15:19] (03CR) 10Jforrester: [C: 03+2] [BETA CLUSTER] List deployment-db14, new replica [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006593 (https://phabricator.wikimedia.org/T358329) (owner: 10Jforrester) [22:15:45] !log jclark@cumin1002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host es1036.mgmt.eqiad.wmnet with reboot policy FORCED [22:16:22] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1009 is CRITICAL: CRITICAL - elasticsearch inactive shards 476 threshold =0.2 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 764, active_shards: 1055, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 476, delayed_unassigned_shards: 0, number_of_pending_ [22:16:22] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 68.90920966688438 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:16:22] PROBLEM - ElasticSearch health check for shards on 9400 on cloudelastic1009 is CRITICAL: CRITICAL - elasticsearch inactive shards 532 threshold =0.2 breach: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 799, active_shards: 1068, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 532, delayed_unassigned_shards: 0, number_of_pendin [22:16:22] 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 66.75 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:16:25] (03Merged) 10jenkins-bot: [BETA CLUSTER] List deployment-db14, new replica [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006593 (https://phabricator.wikimedia.org/T358329) (owner: 10Jforrester) [22:16:28] (03Merged) 10jenkins-bot: [BETA CLUSTER] Drop deployment-db12 and deployment-db13 replicas [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1006594 (https://phabricator.wikimedia.org/T358329) (owner: 10Jforrester) [22:16:28] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1008 is CRITICAL: CRITICAL - elasticsearch inactive shards 476 threshold =0.2 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 764, active_shards: 1055, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 476, delayed_unassigned_shards: 0, number_of_pending_ [22:16:28] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 68.90920966688438 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:16:28] PROBLEM - ElasticSearch health check for shards on 9400 on cloudelastic1008 is CRITICAL: CRITICAL - elasticsearch inactive shards 532 threshold =0.2 breach: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 799, active_shards: 1068, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 532, delayed_unassigned_shards: 0, number_of_pendin [22:16:28] 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 66.75 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:16:30] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 476 threshold =0.2 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 764, active_shards: 1055, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 476, delayed_unassigned_shards: 0, number_of_pending_ [22:16:30] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 68.90920966688438 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:16:30] PROBLEM - ElasticSearch health check for shards on 9400 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 532 threshold =0.2 breach: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 799, active_shards: 1068, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 532, delayed_unassigned_shards: 0, number_of_pendin [22:16:30] 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 66.75 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:16:32] PROBLEM - ElasticSearch health check for shards on 9400 on cloudelastic1006 is CRITICAL: CRITICAL - elasticsearch inactive shards 532 threshold =0.2 breach: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 799, active_shards: 1068, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 532, delayed_unassigned_shards: 0, number_of_pendin [22:16:32] 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 66.75 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:16:32] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1006 is CRITICAL: CRITICAL - elasticsearch inactive shards 476 threshold =0.2 breach: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 764, active_shards: 1055, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 476, delayed_unassigned_shards: 0, number_of_pending_ [22:16:32] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 68.90920966688438 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:16:34] PROBLEM - ElasticSearch health check for shards on 9600 on cloudelastic1009 is CRITICAL: CRITICAL - elasticsearch inactive shards 508 threshold =0.2 breach: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 761, active_shards: 1016, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 508, delayed_unassigned_shards: 0, number_of_pending_ [22:16:34] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 66.66666666666666 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:16:55] ^ Looking [22:17:10] PROBLEM - ElasticSearch health check for shards on 9600 on cloudelastic1006 is CRITICAL: CRITICAL - elasticsearch inactive shards 508 threshold =0.2 breach: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 761, active_shards: 1016, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 508, delayed_unassigned_shards: 0, number_of_pending_ [22:17:10] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 66.66666666666666 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:17:10] PROBLEM - ElasticSearch health check for shards on 9600 on cloudelastic1008 is CRITICAL: CRITICAL - elasticsearch inactive shards 508 threshold =0.2 breach: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 761, active_shards: 1016, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 508, delayed_unassigned_shards: 0, number_of_pending_ [22:17:10] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 66.66666666666666 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:17:10] PROBLEM - ElasticSearch health check for shards on 9600 on cloudelastic1010 is CRITICAL: CRITICAL - elasticsearch inactive shards 508 threshold =0.2 breach: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 761, active_shards: 1016, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 508, delayed_unassigned_shards: 0, number_of_pending_ [22:17:11] number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 66.66666666666666 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:18:24] !log ryankemper@cumin2002 END (ERROR) - Cookbook sre.elasticsearch.rolling-operation (exit_code=97) Operation.UPGRADE (2 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic plugin upgrade - ryankemper@cumin2002 - T356651 [22:18:30] T356651: Rebuild and deploy textify plugin - https://phabricator.wikimedia.org/T356651 [22:18:34] RECOVERY - ElasticSearch health check for shards on 9600 on cloudelastic1009 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 761, active_shards: 1243, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 276, delayed_unassigned_shards: 0, number_of_pending_tasks: 5, number_of_in_ [22:18:34] tch: 0, task_max_waiting_in_queue_millis: 2608, active_shards_percent_as_number: 81.56167979002625 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:18:40] (KubernetesRsyslogDown) resolved: rsyslog on kubernetes2029:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubernetes2029 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [22:19:00] One cloudelastic host too many got restarted, cluster will be back to green status (from yellow) shortly [22:19:10] RECOVERY - ElasticSearch health check for shards on 9600 on cloudelastic1006 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 761, active_shards: 1363, relocating_shards: 0, initializing_shards: 8, unassigned_shards: 153, delayed_unassigned_shards: 0, number_of_pending_tasks: 4, number_of_in_ [22:19:10] tch: 0, task_max_waiting_in_queue_millis: 327, active_shards_percent_as_number: 89.43569553805774 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:19:10] RECOVERY - ElasticSearch health check for shards on 9600 on cloudelastic1008 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 761, active_shards: 1366, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 153, delayed_unassigned_shards: 0, number_of_pending_tasks: 3, number_of_in_ [22:19:10] tch: 0, task_max_waiting_in_queue_millis: 118, active_shards_percent_as_number: 89.63254593175853 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:19:12] RECOVERY - ElasticSearch health check for shards on 9600 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-psi-eqiad: cluster_name: cloudelastic-psi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 761, active_shards: 1366, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 153, delayed_unassigned_shards: 0, number_of_pending_tasks: 3, number_of_in_ [22:19:12] tch: 0, task_max_waiting_in_queue_millis: 187, active_shards_percent_as_number: 89.63254593175853 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:19:22] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1029 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [22:19:22] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1009 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 764, active_shards: 1339, relocating_shards: 0, initializing_shards: 32, unassigned_shards: 160, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in [22:19:22] etch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 87.45917700849118 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:19:22] RECOVERY - ElasticSearch health check for shards on 9400 on cloudelastic1009 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 799, active_shards: 1430, relocating_shards: 0, initializing_shards: 5, unassigned_shards: 165, delayed_unassigned_shards: 0, number_of_pending_tasks: 8, number_of [22:19:22] t_fetch: 0, task_max_waiting_in_queue_millis: 679, active_shards_percent_as_number: 89.375 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:19:30] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1008 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 764, active_shards: 1339, relocating_shards: 0, initializing_shards: 32, unassigned_shards: 160, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in [22:19:30] etch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 87.45917700849118 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:19:30] RECOVERY - ElasticSearch health check for shards on 9400 on cloudelastic1008 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 799, active_shards: 1444, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 154, delayed_unassigned_shards: 0, number_of_pending_tasks: 2, number_of [22:19:30] t_fetch: 0, task_max_waiting_in_queue_millis: 4338, active_shards_percent_as_number: 90.25 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:19:30] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 764, active_shards: 1340, relocating_shards: 0, initializing_shards: 32, unassigned_shards: 159, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in [22:19:30] etch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 87.52449379490528 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:19:32] RECOVERY - ElasticSearch health check for shards on 9400 on cloudelastic1010 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 799, active_shards: 1447, relocating_shards: 0, initializing_shards: 7, unassigned_shards: 146, delayed_unassigned_shards: 0, number_of_pending_tasks: 9, number_of [22:19:32] t_fetch: 0, task_max_waiting_in_queue_millis: 1027, active_shards_percent_as_number: 90.4375 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:19:32] RECOVERY - ElasticSearch health check for shards on 9400 on cloudelastic1006 is OK: OK - elasticsearch status cloudelastic-omega-eqiad: cluster_name: cloudelastic-omega-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 799, active_shards: 1451, relocating_shards: 0, initializing_shards: 3, unassigned_shards: 146, delayed_unassigned_shards: 0, number_of_pending_tasks: 3, number_of [22:19:32] t_fetch: 0, task_max_waiting_in_queue_millis: 1418, active_shards_percent_as_number: 90.6875 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:19:33] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1006 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 764, active_shards: 1340, relocating_shards: 0, initializing_shards: 32, unassigned_shards: 159, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in [22:19:33] etch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 87.52449379490528 https://wikitech.wikimedia.org/wiki/Search%23Administration [22:20:01] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host es1036.eqiad.wmnet with OS bookworm [22:24:28] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es1035.eqiad.wmnet with reason: host reimage [22:24:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T357189)', diff saved to https://phabricator.wikimedia.org/P57962 and previous config saved to /var/cache/conftool/dbconfig/20240226-222435-arnaudb.json [22:24:37] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [22:24:46] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [22:25:03] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance [22:27:30] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1035.eqiad.wmnet with reason: host reimage [22:28:49] !log ryankemper@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on 6 hosts with reason: cloudelastic restart [22:29:13] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 6 hosts with reason: cloudelastic restart [22:38:06] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host es1040.eqiad.wmnet with OS bookworm [22:41:21] !log btullis@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host an-redacteddb1001.eqiad.wmnet with OS bookworm [22:42:09] !log btullis@cumin1002 START - Cookbook sre.hosts.reimage for host an-redacteddb1001.eqiad.wmnet with OS bookworm [22:42:48] !log on snapshot1010 killed PHP processes left over from kill -9 of python parents T358458 [22:42:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:42:53] T358458: 20240220 database backup dump appears stuck - https://phabricator.wikimedia.org/T358458 [22:42:55] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es1036.eqiad.wmnet with reason: host reimage [22:45:35] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1036.eqiad.wmnet with reason: host reimage [22:45:38] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2149.codfw.wmnet with reason: Maintenance [22:45:52] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2149.codfw.wmnet with reason: Maintenance [22:45:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2149 (T357189)', diff saved to https://phabricator.wikimedia.org/P57963 and previous config saved to /var/cache/conftool/dbconfig/20240226-224557-arnaudb.json [22:46:03] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [22:46:38] !log ryankemper@cumin2002 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic plugin upgrade - ryankemper@cumin2002 - T356651 [22:46:43] T356651: Rebuild and deploy textify plugin - https://phabricator.wikimedia.org/T356651 [22:54:59] !log btullis@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on an-redacteddb1001.eqiad.wmnet with reason: host reimage [22:55:51] !log jclark@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on es1040.eqiad.wmnet with reason: host reimage [22:57:59] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-redacteddb1001.eqiad.wmnet with reason: host reimage [23:00:22] !log jclark@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on es1040.eqiad.wmnet with reason: host reimage [23:04:25] (SystemdUnitFailed) resolved: httpbb_kubernetes_mw-parsoid_hourly.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:04:51] PROBLEM - BGP status on cr2-esams is CRITICAL: BGP CRITICAL - No response from remote host 185.15.59.129 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:05:31] RECOVERY - Check unit status of httpbb_kubernetes_mw-parsoid_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-parsoid_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [23:05:43] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/1005450 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [23:06:28] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.UPGRADE (1 nodes at a time) for ElasticSearch cluster cloudelastic: cloudelastic plugin upgrade - ryankemper@cumin2002 - T356651 [23:06:47] T356651: Rebuild and deploy textify plugin - https://phabricator.wikimedia.org/T356651 [23:09:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T357189)', diff saved to https://phabricator.wikimedia.org/P57964 and previous config saved to /var/cache/conftool/dbconfig/20240226-230934-arnaudb.json [23:09:46] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [23:11:22] !log btullis@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1002" [23:13:29] (SystemdUnitFailed) firing: generate_os_reports.service on puppetdb2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:14:36] (03PS3) 10Htriedman: eventstreams: update page redaction list [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006544 (https://phabricator.wikimedia.org/T354456) [23:14:51] (03CR) 10Htriedman: "done" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1006544 (https://phabricator.wikimedia.org/T354456) (owner: 10Htriedman) [23:24:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P57965 and previous config saved to /var/cache/conftool/dbconfig/20240226-232443-arnaudb.json [23:26:16] !log btullis@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - btullis@cumin1002" [23:26:28] !log btullis@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host an-redacteddb1001.eqiad.wmnet with OS bookworm [23:26:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Q#:rack/setup/install an-redacteddb1001 - https://phabricator.wikimedia.org/T355571#9578597 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host an-redacteddb1001.eqiad.wmnet with OS b... [23:27:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03): Q#:rack/setup/install an-redacteddb1001 - https://phabricator.wikimedia.org/T355571#9578599 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by btullis@cumin1002 for host an-redacteddb1001.eqiad.wmnet with OS b... [23:32:49] (ProbeDown) firing: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:33:45] (03PS1) 10Btullis: Update the contacts for an-redacteddb1001 [puppet] - 10https://gerrit.wikimedia.org/r/1006600 (https://phabricator.wikimedia.org/T355571) [23:37:49] (ProbeDown) resolved: Service phab1004:443 has failed probes (http_phabricator_wikimedia_org_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#phab1004:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:39:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P57966 and previous config saved to /var/cache/conftool/dbconfig/20240226-233953-arnaudb.json [23:49:20] 10SRE, 10Traffic: Cannot edit wikipedia from my work computer - https://phabricator.wikimedia.org/T356799#9578640 (10Dzahn) @Rijikk The footer would be right under the "If you report this error to the Wikimedia System Administrators, please include the details below." message you quoted in the error page itsel... [23:55:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T357189)', diff saved to https://phabricator.wikimedia.org/P57967 and previous config saved to /var/cache/conftool/dbconfig/20240226-235500-arnaudb.json [23:55:04] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2156.codfw.wmnet with reason: Maintenance [23:55:07] T357189: Drop iwl_prefix_from_title from iwlinks - https://phabricator.wikimedia.org/T357189 [23:55:17] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2156.codfw.wmnet with reason: Maintenance [23:55:20] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [23:55:34] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance [23:55:40] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depooling db2156 (T357189)', diff saved to https://phabricator.wikimedia.org/P57968 and previous config saved to /var/cache/conftool/dbconfig/20240226-235539-arnaudb.json [23:56:12] (03CR) 10Btullis: [C: 03+2] Update the contacts for an-redacteddb1001 [puppet] - 10https://gerrit.wikimedia.org/r/1006600 (https://phabricator.wikimedia.org/T355571) (owner: 10Btullis) [23:59:07] RECOVERY - MD RAID on mw2442 is OK: OK: Active: 2, Working: 2, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [23:59:45] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE ( 2024.02.12 - 2024.03.03), 10Patch-For-Review: Q#:rack/setup/install an-redacteddb1001 - https://phabricator.wikimedia.org/T355571#9578670 (10BTullis) [23:59:53] !log jclark@cumin1002 START - Cookbook sre.hosts.reimage for host es1035.eqiad.wmnet with OS bookworm [23:59:59] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es10[35-40] - https://phabricator.wikimedia.org/T355269#9578671 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1002 for host es1035.eqiad.wmnet with OS bookworm