[00:26:46] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:29:48] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T352010)', diff saved to https://phabricator.wikimedia.org/P62807 and previous config saved to /var/cache/conftool/dbconfig/20240522-002948-ladsgroup.json [00:29:53] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [00:34:42] (03PS1) 10RLazarus: deployment_server: Rework mwscript_k8s flags [puppet] - 10https://gerrit.wikimedia.org/r/1034632 (https://phabricator.wikimedia.org/T341553) [00:34:51] (03PS1) 10RLazarus: deployment_server: Add --follow, --attach to mwscript_k8s [puppet] - 10https://gerrit.wikimedia.org/r/1034633 (https://phabricator.wikimedia.org/T341553) [00:36:46] FIRING: SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:44:56] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P62808 and previous config saved to /var/cache/conftool/dbconfig/20240522-004456-ladsgroup.json [01:00:04] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171', diff saved to https://phabricator.wikimedia.org/P62809 and previous config saved to /var/cache/conftool/dbconfig/20240522-010004-ladsgroup.json [01:07:50] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 127 probes of 734 (alerts on 90) - https://atlas.ripe.net/measurements/59935539/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:08:46] PROBLEM - IPv4 ping to esams on ripe-atlas-esams is CRITICAL: CRITICAL - failed 85 probes of 800 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:12:52] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 49 probes of 734 (alerts on 90) - https://atlas.ripe.net/measurements/59935539/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:13:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:13:46] RECOVERY - IPv4 ping to esams on ripe-atlas-esams is OK: OK - failed 9 probes of 800 (alerts on 35) - https://atlas.ripe.net/measurements/59935536/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [01:15:13] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2171 (T352010)', diff saved to https://phabricator.wikimedia.org/P62810 and previous config saved to /var/cache/conftool/dbconfig/20240522-011512-ladsgroup.json [01:15:15] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2178.codfw.wmnet with reason: Maintenance [01:15:18] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [01:15:28] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2178.codfw.wmnet with reason: Maintenance [01:15:36] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2178 (T352010)', diff saved to https://phabricator.wikimedia.org/P62811 and previous config saved to /var/cache/conftool/dbconfig/20240522-011536-ladsgroup.json [01:19:30] (03Abandoned) 10BCornwall: Automated MarkMonitor domain sync [puppet] - 10https://gerrit.wikimedia.org/r/1032884 (owner: 10BCornwall) [01:25:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T364299)', diff saved to https://phabricator.wikimedia.org/P62812 and previous config saved to /var/cache/conftool/dbconfig/20240522-012529-marostegui.json [01:25:34] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [01:34:11] 06SRE, 10SRE-Access-Requests: Requesting permissions for analytics-privatedata-users (with kerberos) for Mareike Heuer - https://phabricator.wikimedia.org/T364715#9819517 (10KFrancis) Hello @Dzahn The NDA is out for signatures. I'll confirm when it's complete. [01:39:42] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:40:02] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:40:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P62813 and previous config saved to /var/cache/conftool/dbconfig/20240522-014037-marostegui.json [01:41:02] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate sessionstore.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [01:41:34] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.236 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:41:52] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51923 bytes in 0.094 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:42:37] (03CR) 10Scott French: [C:03+1] "LGTM with one nit / question. Also +1 to using `--`, as it's really the only straightforward way to be sure to avoid consuming conflicting" [puppet] - 10https://gerrit.wikimedia.org/r/1034632 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [01:45:12] FIRING: [2x] CertAlmostExpired: Certificate for service sessionstore:8081 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#sessionstore:8081 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [01:46:34] RECOVERY - snapshot of s6 in codfw on backupmon1001 is OK: Last snapshot for s6 at codfw (db2197) taken on 2024-05-22 01:03:37 (475 GiB, +0.5 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [01:55:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193', diff saved to https://phabricator.wikimedia.org/P62814 and previous config saved to /var/cache/conftool/dbconfig/20240522-015545-marostegui.json [02:10:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1193 (T364299)', diff saved to https://phabricator.wikimedia.org/P62815 and previous config saved to /var/cache/conftool/dbconfig/20240522-021053-marostegui.json [02:10:56] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance [02:10:58] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [02:11:09] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1203.eqiad.wmnet with reason: Maintenance [02:11:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1203 (T364299)', diff saved to https://phabricator.wikimedia.org/P62816 and previous config saved to /var/cache/conftool/dbconfig/20240522-021116-marostegui.json [02:21:46] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:36:46] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:01:46] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:16:12] FIRING: ProbeDown: Service kubemaster1001:6443 has failed probes (http_eqiad_kube_apiserver_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#kubemaster1001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:21:12] RESOLVED: ProbeDown: Service kubemaster1001:6443 has failed probes (http_eqiad_kube_apiserver_ip6) #page - https://wikitech.wikimedia.org/wiki/Runbook#kubemaster1001:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [03:33:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T352010)', diff saved to https://phabricator.wikimedia.org/P62817 and previous config saved to /var/cache/conftool/dbconfig/20240522-033332-ladsgroup.json [03:33:37] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [03:48:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P62818 and previous config saved to /var/cache/conftool/dbconfig/20240522-034840-ladsgroup.json [03:49:46] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:50:02] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:54:58] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51925 bytes in 6.123 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [03:55:40] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8616 bytes in 0.391 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [04:03:22] 06SRE, 10Language-Technical Support, 06serviceops, 10Wikimedia-Site-requests: Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319#9819571 (10Roxette5) Hello to everyone! I work primarily on the hebrew wikisource site, mostly involved in the religious texts etc... [04:03:51] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178', diff saved to https://phabricator.wikimedia.org/P62819 and previous config saved to /var/cache/conftool/dbconfig/20240522-040349-ladsgroup.json [04:04:20] (03PS1) 10KartikMistry: Disable Section Translation on simplewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034642 (https://phabricator.wikimedia.org/T361597) [04:11:27] 10ops-codfw, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T365543 (10phaultfinder) 03NEW [04:16:46] 06SRE, 10DNS, 06Traffic, 10WikiLearn, 13Patch-For-Review: DNS records for WikiLearn - https://phabricator.wikimedia.org/T365435#9819594 (10Asaf) Yes, agreed. Our vendor made a mistake, and I pasted it verbatim. 😅 [04:18:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2178 (T352010)', diff saved to https://phabricator.wikimedia.org/P62820 and previous config saved to /var/cache/conftool/dbconfig/20240522-041858-ladsgroup.json [04:19:01] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2192.codfw.wmnet with reason: Maintenance [04:19:03] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [04:19:14] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2192.codfw.wmnet with reason: Maintenance [04:19:22] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2192 (T352010)', diff saved to https://phabricator.wikimedia.org/P62821 and previous config saved to /var/cache/conftool/dbconfig/20240522-041922-ladsgroup.json [04:36:46] FIRING: SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:00:36] (03PS1) 10Abijeet Patro: SpecialNotifyTranslators: Fix group id in dropdown [extensions/TranslationNotifications] (wmf/1.43.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1034610 (https://phabricator.wikimedia.org/T253984) [05:03:10] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T352010)', diff saved to https://phabricator.wikimedia.org/P62822 and previous config saved to /var/cache/conftool/dbconfig/20240522-050310-ladsgroup.json [05:03:14] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [05:06:58] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Long schema change [05:07:01] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1192.eqiad.wmnet with reason: Long schema change [05:07:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1192 for a schema change', diff saved to https://phabricator.wikimedia.org/P62823 and previous config saved to /var/cache/conftool/dbconfig/20240522-050727-root.json [05:13:08] (03PS1) 10Marostegui: es2024: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1034644 [05:13:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:14:12] (03CR) 10Marostegui: [C:03+2] es2024: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1034644 (owner: 10Marostegui) [05:18:19] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P62824 and previous config saved to /var/cache/conftool/dbconfig/20240522-051818-ladsgroup.json [05:21:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1249', diff saved to https://phabricator.wikimedia.org/P62825 and previous config saved to /var/cache/conftool/dbconfig/20240522-052108-root.json [05:22:19] (03PS1) 10Marostegui: db1249: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1034645 [05:22:39] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1249.eqiad.wmnet with OS bookworm [05:22:55] (03CR) 10Marostegui: [C:03+2] db1249: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1034645 (owner: 10Marostegui) [05:25:51] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install dbproxy102[89] - https://phabricator.wikimedia.org/T365485#9819723 (10Marostegui) [05:26:22] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Q4:rack/setup/install dbproxy102[89] - https://phabricator.wikimedia.org/T365485#9819724 (10Marostegui) Please remember these hosts must not have IPV6 dns records. [05:33:27] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208', diff saved to https://phabricator.wikimedia.org/P62826 and previous config saved to /var/cache/conftool/dbconfig/20240522-053326-ladsgroup.json [05:33:36] (03PS1) 10Marostegui: Revert "db1249: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1034614 [05:35:35] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1249.eqiad.wmnet with reason: host reimage [05:38:41] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1249.eqiad.wmnet with reason: host reimage [05:41:02] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate sessionstore.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [05:45:12] FIRING: [2x] CertAlmostExpired: Certificate for service sessionstore:8081 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#sessionstore:8081 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [05:48:35] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2208 (T352010)', diff saved to https://phabricator.wikimedia.org/P62827 and previous config saved to /var/cache/conftool/dbconfig/20240522-054834-ladsgroup.json [05:48:37] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2220.codfw.wmnet with reason: Maintenance [05:48:39] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [05:48:50] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2220.codfw.wmnet with reason: Maintenance [05:48:58] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2220 (T352010)', diff saved to https://phabricator.wikimedia.org/P62828 and previous config saved to /var/cache/conftool/dbconfig/20240522-054857-ladsgroup.json [05:53:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1249 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P62829 and previous config saved to /var/cache/conftool/dbconfig/20240522-055355-root.json [05:54:08] (03CR) 10Marostegui: [C:03+2] Revert "db1249: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1034614 (owner: 10Marostegui) [05:55:56] (03PS1) 10Marostegui: db1249: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1034789 [05:56:32] (03CR) 10Marostegui: [C:03+2] db1249: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1034789 (owner: 10Marostegui) [05:59:30] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1249.eqiad.wmnet with OS bookworm [06:00:57] 06SRE, 10Language-Technical Support, 06serviceops, 10Wikimedia-Site-requests: Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319#9819748 (10Fuzzy) I would like to point out that a similar issue arose a long time ago, regarding the length of custom signatures,... [06:08:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T352010)', diff saved to https://phabricator.wikimedia.org/P62830 and previous config saved to /var/cache/conftool/dbconfig/20240522-060814-ladsgroup.json [06:08:21] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [06:09:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1249 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P62831 and previous config saved to /var/cache/conftool/dbconfig/20240522-060901-root.json [06:18:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1249', diff saved to https://phabricator.wikimedia.org/P62832 and previous config saved to /var/cache/conftool/dbconfig/20240522-061806-root.json [06:19:42] !log Install 10..6.18 on db1249 T365338 [06:19:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:19:47] T365338: MariaDB 10.6.18 released - https://phabricator.wikimedia.org/T365338 [06:21:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1249 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P62833 and previous config saved to /var/cache/conftool/dbconfig/20240522-062103-root.json [06:23:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P62834 and previous config saved to /var/cache/conftool/dbconfig/20240522-062324-ladsgroup.json [06:30:45] o/ [06:35:50] (03PS1) 10Stevemunene: Setup kubeconfigs for datahub and datahub-next on dse-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1034797 (https://phabricator.wikimedia.org/T363832) [06:36:11] (03PS2) 10Filippo Giunchedi: pki: add temporary profile for prometheus + k8s [puppet] - 10https://gerrit.wikimedia.org/r/1034048 (https://phabricator.wikimedia.org/T343529) [06:36:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1249 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P62835 and previous config saved to /var/cache/conftool/dbconfig/20240522-063610-root.json [06:36:30] (03CR) 10Filippo Giunchedi: pki: add temporary profile for prometheus + k8s (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1034048 (https://phabricator.wikimedia.org/T343529) (owner: 10Filippo Giunchedi) [06:38:33] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192', diff saved to https://phabricator.wikimedia.org/P62836 and previous config saved to /var/cache/conftool/dbconfig/20240522-063832-ladsgroup.json [06:41:23] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1247.eqiad.wmnet [06:42:29] (03PS1) 10Muehlenhoff: Switch db1247 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1034798 (https://phabricator.wikimedia.org/T349619) [06:45:32] (03CR) 10Muehlenhoff: [C:03+2] Switch db1247 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1034798 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [06:47:57] 06SRE-OnFire, 14SRE-Sprint-Week-Sustainability-March2023, 06DBA, 07Schema-change-in-production, 10Sustainability (Incident Followup): Adjust the field type of globalblocks timestamp columns to fixed binary on wmf wikis - https://phabricator.wikimedia.org/T307501#9819802 (10Marostegui) [06:51:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1249 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P62837 and previous config saved to /var/cache/conftool/dbconfig/20240522-065117-root.json [06:52:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1247.eqiad.wmnet [06:53:25] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1248.eqiad.wmnet [06:53:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2192 (T352010)', diff saved to https://phabricator.wikimedia.org/P62838 and previous config saved to /var/cache/conftool/dbconfig/20240522-065340-ladsgroup.json [06:53:43] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2201.codfw.wmnet with reason: Maintenance [06:53:45] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [06:53:56] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2201.codfw.wmnet with reason: Maintenance [06:54:55] (03PS1) 10Muehlenhoff: Switch db1248 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1034799 (https://phabricator.wikimedia.org/T349619) [06:57:05] (03CR) 10Muehlenhoff: [C:03+2] Switch db1248 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1034799 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [06:57:58] (03PS1) 10Marostegui: db1154: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1034801 [06:58:22] (03CR) 10Marostegui: [C:03+2] db1154: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1034801 (owner: 10Marostegui) [06:58:43] !log Reimage db1154 (sanitarium) there will be lag in s1, s3, s5 and s8 in wiki replicas [06:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:04] Amir1 and Urbanecm: Your horoscope predicts another UTC morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240522T0700). [07:00:05] kart_ and abijeet: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:01:25] In case a deployer has a spare cycle, backporting https://gerrit.wikimedia.org/r/c/mediawiki/extensions/LiquidThreads/+/1034182 would also be lovely! If not then I'll do that later today. Thanks! [07:01:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1248.eqiad.wmnet [07:02:21] Amir1, o/ [07:02:28] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host db1249.eqiad.wmnet [07:03:22] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db1154.eqiad.wmnet with OS bookworm [07:03:48] (03PS3) 10Stevemunene: Change datahub service to use dse ingress [puppet] - 10https://gerrit.wikimedia.org/r/1032399 (https://phabricator.wikimedia.org/T363450) [07:04:07] (03PS1) 10Muehlenhoff: Switch db1249 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1034802 (https://phabricator.wikimedia.org/T349619) [07:06:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1249 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P62839 and previous config saved to /var/cache/conftool/dbconfig/20240522-070624-root.json [07:07:38] abijeet: started deployment? [07:07:43] I'm bit late. [07:08:08] kart_, no, deployment hasn't started [07:08:54] OK. I'll go ahead with config patch. Feel free to +2 your patch meanwhile. [07:09:16] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034642 (https://phabricator.wikimedia.org/T361597) (owner: 10KartikMistry) [07:09:53] (03Merged) 10jenkins-bot: Disable Section Translation on simplewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034642 (https://phabricator.wikimedia.org/T361597) (owner: 10KartikMistry) [07:10:49] !log kartik@deploy1002 Started scap: Backport for [[gerrit:1034642|Disable Section Translation on simplewiki (T361597)]] [07:10:54] T361597: Fix the mobile experience for a second group of Wikipedias where Content Translation is in beta - https://phabricator.wikimedia.org/T361597 [07:13:46] !log kartik@deploy1002 kartik: Backport for [[gerrit:1034642|Disable Section Translation on simplewiki (T361597)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:17:10] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1154.eqiad.wmnet with reason: host reimage [07:17:20] !log kartik@deploy1002 kartik: Continuing with sync [07:20:12] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1154.eqiad.wmnet with reason: host reimage [07:20:57] jouncebot: refresh [07:20:58] I refreshed my knowledge about deployments. [07:21:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1249 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P62840 and previous config saved to /var/cache/conftool/dbconfig/20240522-072130-root.json [07:21:42] jouncebot: refresh [07:21:43] I refreshed my knowledge about deployments. [07:21:46] jouncebot: now [07:21:46] For the next 0 hour(s) and 38 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240522T0700) [07:22:11] I have added https://gerrit.wikimedia.org/r/c/mediawiki/extensions/LiquidThreads/+/1034182 [07:22:21] I will deploy it once kart_ and abijeet_ are done :) [07:22:46] (03CR) 10Hashar: [C:03+2] SpecialNotifyTranslators: Fix group id in dropdown [extensions/TranslationNotifications] (wmf/1.43.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1034610 (https://phabricator.wikimedia.org/T253984) (owner: 10Abijeet Patro) [07:22:59] abijeet_: I +2ed your patch to kick in CI :) [07:23:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'T364290 db1232', diff saved to https://phabricator.wikimedia.org/P62841 and previous config saved to /var/cache/conftool/dbconfig/20240522-072307-arnaudb.json [07:23:12] T364290: Upgrade s1 to MariaDB 10.6 - https://phabricator.wikimedia.org/T364290 [07:23:49] !log installing nodejs security updates [07:23:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:24:00] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1232.eqiad.wmnet with reason: reimage [07:24:13] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1232.eqiad.wmnet with reason: reimage [07:24:39] (03CR) 10Ayounsi: [C:03+2] LibreNMS: add special case [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1034536 (https://phabricator.wikimedia.org/T364628) (owner: 10Ayounsi) [07:25:06] (03Merged) 10jenkins-bot: LibreNMS: add special case [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1034536 (https://phabricator.wikimedia.org/T364628) (owner: 10Ayounsi) [07:25:40] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [07:25:46] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [07:25:48] (03CR) 10Brouberol: [C:03+1] Setup kubeconfigs for datahub and datahub-next on dse-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1034797 (https://phabricator.wikimedia.org/T363832) (owner: 10Stevemunene) [07:25:58] hashar, thanks! [07:26:00] 10ops-codfw, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Degraded RAID on backup2010 - https://phabricator.wikimedia.org/T365217#9819886 (10jcrespo) A disk was rebuilt on the 17 of May: ` seqNum: 0x00001227 Time: Fri May 17 04:07:00 2024 Code: 0x00000072 Class: 0 Locale: 0x02 Event... [07:26:04] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db1232.eqiad.wmnet with OS bookworm [07:28:14] (03CR) 10Brouberol: "I don't think this is necessary, as per our conversation yesterday. I think you should change the Apache Traffic Server (ATS) configuratio" [puppet] - 10https://gerrit.wikimedia.org/r/1032399 (https://phabricator.wikimedia.org/T363450) (owner: 10Stevemunene) [07:30:24] (03Merged) 10jenkins-bot: SpecialNotifyTranslators: Fix group id in dropdown [extensions/TranslationNotifications] (wmf/1.43.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1034610 (https://phabricator.wikimedia.org/T253984) (owner: 10Abijeet Patro) [07:30:37] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:1034642|Disable Section Translation on simplewiki (T361597)]] (duration: 19m 47s) [07:30:42] T361597: Fix the mobile experience for a second group of Wikipedias where Content Translation is in beta - https://phabricator.wikimedia.org/T361597 [07:31:15] (03CR) 10Brouberol: "To go even further, we mentioned yesterday that datahub should be _removed_ from LVS, not adapted. This is due to the fact that `datahub.s" [puppet] - 10https://gerrit.wikimedia.org/r/1032399 (https://phabricator.wikimedia.org/T363450) (owner: 10Stevemunene) [07:31:33] (03PS1) 10Ayounsi: LibreNMS report: fix for server tech PDUs special case [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1034836 (https://phabricator.wikimedia.org/T364628) [07:32:22] !log installing postgresql-11 security updates [07:32:24] (03CR) 10Ayounsi: [C:03+2] LibreNMS report: fix for server tech PDUs special case [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1034836 (https://phabricator.wikimedia.org/T364628) (owner: 10Ayounsi) [07:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:32:52] (03Merged) 10jenkins-bot: LibreNMS report: fix for server tech PDUs special case [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1034836 (https://phabricator.wikimedia.org/T364628) (owner: 10Ayounsi) [07:32:56] @abijeet_ deploying your change.. [07:33:06] kart_, thanks! [07:33:22] !log ayounsi@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [07:33:28] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [07:33:29] !log kartik@deploy1002 Started scap: Backport for [[gerrit:1034610|SpecialNotifyTranslators: Fix group id in dropdown (T253984)]] [07:33:34] T253984: Change dropdown to searchbox in NotifyTranslators - https://phabricator.wikimedia.org/T253984 [07:36:13] !log kartik@deploy1002 abi and kartik: Backport for [[gerrit:1034610|SpecialNotifyTranslators: Fix group id in dropdown (T253984)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:36:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1249 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P62842 and previous config saved to /var/cache/conftool/dbconfig/20240522-073636-root.json [07:36:40] (03CR) 10Hashar: [C:03+2] Fix fatal error due to missing signature on very old comments [extensions/LiquidThreads] (wmf/1.43.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1034182 (https://phabricator.wikimedia.org/T365495) (owner: 10Jforrester) [07:37:00] I will do that LiquidThreads fix once you have finished deployed [07:39:04] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db1232.eqiad.wmnet with reason: host reimage [07:39:05] (03Merged) 10jenkins-bot: Fix fatal error due to missing signature on very old comments [extensions/LiquidThreads] (wmf/1.43.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1034182 (https://phabricator.wikimedia.org/T365495) (owner: 10Jforrester) [07:39:40] abijeet_: can you please test the patch on mwdebug server(s)? [07:40:12] (03CR) 10Muehlenhoff: [C:03+2] Enable profile::auto_restarts::service for rsync/idp-test [puppet] - 10https://gerrit.wikimedia.org/r/1023817 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [07:40:37] kart_, checking [07:41:31] (03PS2) 10Muehlenhoff: Automatically restart memcached/mcrouter on idp-test nodes [puppet] - 10https://gerrit.wikimedia.org/r/1023838 (https://phabricator.wikimedia.org/T135991) [07:41:34] (03PS1) 10Effie Mouzeli: mediawiki::memcached: switch to running as user memcache [puppet] - 10https://gerrit.wikimedia.org/r/1034839 (https://phabricator.wikimedia.org/T273950) [07:42:03] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1232.eqiad.wmnet with reason: host reimage [07:42:44] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1154.eqiad.wmnet with OS bookworm [07:42:57] kart_, Looks good. [07:43:08] cool. going ahead.. [07:43:11] !log kartik@deploy1002 abi and kartik: Continuing with sync [07:44:07] kart_, thanks! [07:45:39] FIRING: CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [07:46:07] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1023838 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [07:50:28] 07Puppet, 10Wikidata, 06Wikidata Dev Team, 10wmde-wikidata-tech, and 2 others: Remove the WDCM clone (stats1007) - https://phabricator.wikimedia.org/T351072#9819960 (10Manuel) [07:51:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1249 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P62843 and previous config saved to /var/cache/conftool/dbconfig/20240522-075142-root.json [07:51:45] (03PS1) 10Marostegui: Revert "db1154: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1034624 [07:52:19] (03PS1) 10Slyngshede: TOTP API [software/bitu] - 10https://gerrit.wikimedia.org/r/1034840 [07:54:37] kart_: still syncing? [07:54:44] (03PS6) 10Slyngshede: Build Bitu contain image using Blubber. [software/bitu] - 10https://gerrit.wikimedia.org/r/1030743 (https://phabricator.wikimedia.org/T362318) [07:54:48] hashar: yes. fpm-restarts.. [07:54:56] :-\ [07:56:10] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:1034610|SpecialNotifyTranslators: Fix group id in dropdown (T253984)]] (duration: 22m 42s) [07:56:16] T253984: Change dropdown to searchbox in NotifyTranslators - https://phabricator.wikimedia.org/T253984 [07:56:34] hashar: done. [07:56:41] abijeet_: patch is deployed. [07:58:01] (03CR) 10Marostegui: [C:03+2] Revert "db1154: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1034624 (owner: 10Marostegui) [07:58:43] (03PS1) 10Muehlenhoff: Add stub secrets for mpic_next [labs/private] - 10https://gerrit.wikimedia.org/r/1034842 (https://phabricator.wikimedia.org/T361341) [07:59:02] (03PS7) 10Slyngshede: Build Bitu contain image using Blubber. [software/bitu] - 10https://gerrit.wikimedia.org/r/1030743 (https://phabricator.wikimedia.org/T362318) [07:59:11] (03CR) 10Muehlenhoff: [C:03+2] Switch db1249 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/1034802 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [07:59:37] I am doing the LQT one [07:59:41] andre: I am doing a backport :) [07:59:50] hashar, thank you! [08:00:01] !log hashar@deploy1002 Started scap: Backport for [[gerrit:1034182|Fix fatal error due to missing signature on very old comments (T365495)]] [08:00:04] andre and hashar: Deploy window MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240522T0800) [08:00:09] T365495: TypeError: Argument 1 passed to MediaWiki\Output\OutputPage::parseInternal() must be of the type string, null given, called in /srv/mediawiki/php-1.43.0-wmf.6/includes/Output/OutputPage.php on line 2446 - https://phabricator.wikimedia.org/T365495 [08:00:39] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [08:00:48] (03PS1) 10Marostegui: db1154: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1034843 [08:01:30] (03PS8) 10Slyngshede: Build Bitu contain image using Blubber. [software/bitu] - 10https://gerrit.wikimedia.org/r/1030743 (https://phabricator.wikimedia.org/T362318) [08:01:31] (03CR) 10Marostegui: "This is a noop" [puppet] - 10https://gerrit.wikimedia.org/r/1034843 (owner: 10Marostegui) [08:01:34] (03CR) 10Marostegui: [C:03+2] db1154: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1034843 (owner: 10Marostegui) [08:02:38] !log hashar@deploy1002 jforrester and hashar: Backport for [[gerrit:1034182|Fix fatal error due to missing signature on very old comments (T365495)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:02:48] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1232.eqiad.wmnet with OS bookworm [08:02:56] !log hashar@deploy1002 jforrester and hashar: Continuing with sync [08:03:01] tested, that fixed it [08:03:33] (03CR) 10Slyngshede: "check experimental" [software/bitu] - 10https://gerrit.wikimedia.org/r/1030743 (https://phabricator.wikimedia.org/T362318) (owner: 10Slyngshede) [08:07:44] (03CR) 10Slyngshede: [C:03+1] "LGTM" [labs/private] - 10https://gerrit.wikimedia.org/r/1034842 (https://phabricator.wikimedia.org/T361341) (owner: 10Muehlenhoff) [08:08:13] (03PS9) 10Slyngshede: Build Bitu contain image using Blubber. [software/bitu] - 10https://gerrit.wikimedia.org/r/1030743 (https://phabricator.wikimedia.org/T362318) [08:08:22] 06SRE, 10SRE-tools, 06collaboration-services, 06Infrastructure-Foundations, and 5 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619#9820025 (10MoritzMuehlenhoff) [08:08:36] (03CR) 10Slyngshede: "check experimental" [software/bitu] - 10https://gerrit.wikimedia.org/r/1030743 (https://phabricator.wikimedia.org/T362318) (owner: 10Slyngshede) [08:08:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host db1249.eqiad.wmnet [08:09:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1232 (re)pooling @ 1%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62844 and previous config saved to /var/cache/conftool/dbconfig/20240522-080924-arnaudb.json [08:11:00] !log arnaudb@cumin1002 dbctl commit (dc=all): 'T364290 db2173', diff saved to https://phabricator.wikimedia.org/P62845 and previous config saved to /var/cache/conftool/dbconfig/20240522-081059-arnaudb.json [08:11:04] T364290: Upgrade s1 to MariaDB 10.6 - https://phabricator.wikimedia.org/T364290 [08:13:32] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2173.codfw.wmnet with reason: reimage [08:13:45] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2173.codfw.wmnet with reason: reimage [08:15:17] (03CR) 10Muehlenhoff: [V:03+2 C:03+2] Add stub secrets for mpic_next [labs/private] - 10https://gerrit.wikimedia.org/r/1034842 (https://phabricator.wikimedia.org/T361341) (owner: 10Muehlenhoff) [08:16:29] !log hashar@deploy1002 Finished scap: Backport for [[gerrit:1034182|Fix fatal error due to missing signature on very old comments (T365495)]] (duration: 16m 27s) [08:16:33] T365495: TypeError: Argument 1 passed to MediaWiki\Output\OutputPage::parseInternal() must be of the type string, null given, called in /srv/mediawiki/php-1.43.0-wmf.6/includes/Output/OutputPage.php on line 2446 - https://phabricator.wikimedia.org/T365495 [08:16:33] (03CR) 10Slyngshede: Build Bitu contain image using Blubber. (031 comment) [software/bitu] - 10https://gerrit.wikimedia.org/r/1030743 (https://phabricator.wikimedia.org/T362318) (owner: 10Slyngshede) [08:16:35] 08:16:29 Finished php-fpm-restarts (duration: 03m 42s) [08:16:43] I don't get WHY that takes almost 4 minutes ... [08:16:48] anyway done [08:17:12] 06SRE, 10Cloud-VPS: Depleted connection tracking table on labvirt1010 - https://phabricator.wikimedia.org/T139598#9820055 (10taavi) [08:18:53] PROBLEM - Check whether ferm is active by checking the default input chain on mw1457 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [08:21:17] (03PS3) 10Muehlenhoff: Automatically restart memcached/mcrouter on idp-test nodes [puppet] - 10https://gerrit.wikimedia.org/r/1023838 (https://phabricator.wikimedia.org/T135991) [08:24:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1232 (re)pooling @ 2%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62846 and previous config saved to /var/cache/conftool/dbconfig/20240522-082431-arnaudb.json [08:24:37] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1023838 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:25:25] (03PS1) 10TrainBranchBot: group1 wikis to 1.43.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034846 (https://phabricator.wikimedia.org/T361400) [08:25:26] (03CR) 10TrainBranchBot: [C:03+2] group1 wikis to 1.43.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034846 (https://phabricator.wikimedia.org/T361400) (owner: 10TrainBranchBot) [08:26:06] (03Merged) 10jenkins-bot: group1 wikis to 1.43.0-wmf.6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034846 (https://phabricator.wikimedia.org/T361400) (owner: 10TrainBranchBot) [08:32:17] (03CR) 10Muehlenhoff: [C:03+2] Automatically restart memcached/mcrouter on idp-test nodes [puppet] - 10https://gerrit.wikimedia.org/r/1023838 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:32:39] (03PS3) 10Muehlenhoff: configmaster: Enable profile::auto_restarts::service for apache/Envoy [puppet] - 10https://gerrit.wikimedia.org/r/1023847 (https://phabricator.wikimedia.org/T135991) [08:36:46] FIRING: SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:38:10] (03CR) 10Btullis: "I agree with Balthazar here. We should be removing the whole `datahub-gms` and `datahub-frontend` sections, but only after we have switche" [puppet] - 10https://gerrit.wikimedia.org/r/1032399 (https://phabricator.wikimedia.org/T363450) (owner: 10Stevemunene) [08:38:32] PROBLEM - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:39:25] FIRING: [3x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:39:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1232 (re)pooling @ 5%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62848 and previous config saved to /var/cache/conftool/dbconfig/20240522-083937-arnaudb.json [08:41:40] PROBLEM - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:41:51] !log aklapper@deploy1002 rebuilt and synchronized wikiversions files: group1 wikis to 1.43.0-wmf.6 refs T361400 [08:41:55] T361400: 1.43.0-wmf.6 deployment blockers - https://phabricator.wikimedia.org/T361400 [08:42:18] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:45:34] (03PS2) 10Fabfur: benthos:cache: fix potentially missing uri_host field [puppet] - 10https://gerrit.wikimedia.org/r/1034534 (https://phabricator.wikimedia.org/T365441) [08:45:35] !log arnaudb@cumin1002 START - Cookbook sre.hosts.ipmi-password-reset [08:45:35] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.hosts.ipmi-password-reset (exit_code=99) [08:45:50] !log arnaudb@cumin1002 START - Cookbook sre.hosts.ipmi-password-reset [08:45:58] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast, AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [08:46:03] (03CR) 10Sergio Gimeno: [C:03+1] Remove forward slashes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034495 (https://phabricator.wikimedia.org/T332580) (owner: 10Cyndywikime) [08:46:04] !log arnaudb@cumin1002 Updating IPMI password on 1 hosts - arnaudb@cumin1002 [08:46:06] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [08:46:31] !log arnaudb@cumin1002 START - Cookbook sre.hosts.ipmi-password-reset [08:46:37] !log arnaudb@cumin1002 Updating IPMI password on 1 hosts - arnaudb@cumin1002 [08:46:39] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.ipmi-password-reset (exit_code=0) [08:48:50] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 20121 [08:48:52] RECOVERY - Check whether ferm is active by checking the default input chain on mw1457 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [08:48:52] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 20121 [08:49:13] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/device-analytics: apply [08:49:14] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 20121 [08:49:15] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 20121 [08:49:16] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/device-analytics: apply [08:49:21] (03CR) 10Hnowlan: [C:03+2] Add wikikube-ctrl2001 to server SRV record for etcd [dns] - 10https://gerrit.wikimedia.org/r/1034444 (https://phabricator.wikimedia.org/T353464) (owner: 10JMeybohm) [08:49:27] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 20121 [08:49:52] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 20121 [08:50:24] !log ayounsi@cumin1002 START - Cookbook sre.network.peering with action 'configure' for AS: 26162 [08:51:10] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 26162 [08:51:14] jouncebot: nowandnext [08:51:15] For the next 1 hour(s) and 8 minute(s): MediaWiki train - Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240522T0800) [08:51:15] In 1 hour(s) and 8 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240522T1000) [08:52:29] 06SRE, 06Infrastructure-Foundations, 10Puppet-Infrastructure, 10Puppet (Puppet 7.0): spicerack: update spicerack to work with the newer puppet infrastructure - https://phabricator.wikimedia.org/T341496#9820187 (10Volans) 05Open→03Resolved a:03Volans I think all was done. Resolving. [08:54:44] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1232 (re)pooling @ 10%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62849 and previous config saved to /var/cache/conftool/dbconfig/20240522-085443-arnaudb.json [08:59:36] (03CR) 10Hnowlan: [C:03+2] Add wikikube-ctrl200[1-3] as master_stacked [puppet] - 10https://gerrit.wikimedia.org/r/1034449 (https://phabricator.wikimedia.org/T353464) (owner: 10JMeybohm) [09:01:09] FIRING: HelmReleaseBadStatus: Helm release device-analytics/main on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://grafana.wikimedia.org/d/UT4GtK3nz?var-site=eqiad&var-cluster=k8s-staging&var-namespace=device-analytics - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:02:15] (03PS3) 10Majavah: aptrepo: drop tekton component [puppet] - 10https://gerrit.wikimedia.org/r/975769 [09:02:34] (03CR) 10JMeybohm: [V:03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2560/console" [puppet] - 10https://gerrit.wikimedia.org/r/1034449 (https://phabricator.wikimedia.org/T353464) (owner: 10JMeybohm) [09:03:21] (03CR) 10Majavah: [C:03+2] aptrepo: drop tekton component [puppet] - 10https://gerrit.wikimedia.org/r/975769 (owner: 10Majavah) [09:06:09] (03PS1) 10JMeybohm: Add wikikube-ctrl2002 as master_stacked [puppet] - 10https://gerrit.wikimedia.org/r/1034849 (https://phabricator.wikimedia.org/T353464) [09:06:11] (03PS1) 10JMeybohm: Add wikikube-ctrl2003 as master_stacked [puppet] - 10https://gerrit.wikimedia.org/r/1034850 (https://phabricator.wikimedia.org/T353464) [09:06:22] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/device-analytics: apply [09:06:26] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/device-analytics: apply [09:08:32] (03PS4) 10Majavah: P:dumps::distribution::nfs: use networks class for WMCS network ranges [puppet] - 10https://gerrit.wikimedia.org/r/1007889 [09:09:25] FIRING: [4x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:09:50] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1232 (re)pooling @ 25%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62850 and previous config saved to /var/cache/conftool/dbconfig/20240522-090949-arnaudb.json [09:10:07] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2561/co" [puppet] - 10https://gerrit.wikimedia.org/r/1007889 (owner: 10Majavah) [09:12:32] (03CR) 10Volans: [C:04-1] "Sorry David for the confusion, if I might not have explained myself in the other CR." [cookbooks] - 10https://gerrit.wikimedia.org/r/1034538 (owner: 10DCausse) [09:12:34] !log btullis@deploy1002 helmfile [staging] START helmfile.d/services/device-analytics: apply [09:12:36] !log btullis@deploy1002 helmfile [staging] DONE helmfile.d/services/device-analytics: apply [09:13:30] (03CR) 10Majavah: [V:03+1 C:03+2] P:dumps::distribution::nfs: use networks class for WMCS network ranges [puppet] - 10https://gerrit.wikimedia.org/r/1007889 (owner: 10Majavah) [09:13:39] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/device-analytics: apply [09:13:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:13:46] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/device-analytics: apply [09:13:54] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/device-analytics: apply [09:14:01] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/device-analytics: apply [09:14:07] !log btullis@deploy1002 helmfile [codfw] START helmfile.d/services/device-analytics: apply [09:14:10] !log btullis@deploy1002 helmfile [codfw] DONE helmfile.d/services/device-analytics: apply [09:14:32] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/device-analytics: apply [09:14:35] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/device-analytics: apply [09:14:38] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/device-analytics: apply [09:14:43] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/device-analytics: apply [09:14:45] !log btullis@deploy1002 helmfile [eqiad] START helmfile.d/services/device-analytics: apply [09:14:47] !log btullis@deploy1002 helmfile [eqiad] DONE helmfile.d/services/device-analytics: apply [09:15:56] FIRING: RdfStreamingUpdaterFlinkJobUnstable: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=wikidata - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [09:16:39] (03CR) 10DCausse: "Sure, no worries!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1034538 (owner: 10DCausse) [09:17:28] (03Abandoned) 10DCausse: Add LvsConfig to sre/init.py [cookbooks] - 10https://gerrit.wikimedia.org/r/1034538 (owner: 10DCausse) [09:19:22] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2211.codfw.wmnet with reason: Maintenance [09:19:25] FIRING: [9x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:19:35] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2211.codfw.wmnet with reason: Maintenance [09:19:43] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2211 (T352010)', diff saved to https://phabricator.wikimedia.org/P62851 and previous config saved to /var/cache/conftool/dbconfig/20240522-091942-ladsgroup.json [09:19:47] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [09:20:56] RESOLVED: RdfStreamingUpdaterFlinkJobUnstable: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=rdf-streaming-updater&var-helm_release=wikidata - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [09:21:44] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db2173.codfw.wmnet with OS bookworm [09:21:49] (03CR) 10Volans: [C:03+1] "Nice! LGTM although I'll leave the fine details of the specific logic to your team ;)" [cookbooks] - 10https://gerrit.wikimedia.org/r/1032544 (owner: 10DCausse) [09:22:29] (03CR) 10Vgutierrez: [C:03+1] benthos:cache: fix potentially missing uri_host field [puppet] - 10https://gerrit.wikimedia.org/r/1034534 (https://phabricator.wikimedia.org/T365441) (owner: 10Fabfur) [09:22:30] !log running homer to add bgp status for wikikube-ctrl2001 [09:22:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:24:25] FIRING: [9x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:24:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1232 (re)pooling @ 50%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62852 and previous config saved to /var/cache/conftool/dbconfig/20240522-092455-arnaudb.json [09:26:32] RECOVERY - snapshot of s7 in codfw on backupmon1001 is OK: Last snapshot for s7 at codfw (db2198) taken on 2024-05-22 08:36:49 (1054 GiB, +0.1 %) https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup [09:27:51] FIRING: KubernetesCalicoDown: wikikube-ctrl2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-ctrl2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:29:25] FIRING: [8x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:29:37] (03CR) 10Fabfur: [C:03+2] benthos:cache: fix potentially missing uri_host field [puppet] - 10https://gerrit.wikimedia.org/r/1034534 (https://phabricator.wikimedia.org/T365441) (owner: 10Fabfur) [09:31:09] FIRING: [2x] HelmReleaseBadStatus: Helm release device-analytics/main on k8s-staging@eqiad in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:33:44] (03PS1) 10Ayounsi: Revert "Add BGP sessions to KPN in esams" [homer/public] - 10https://gerrit.wikimedia.org/r/1034625 [09:34:25] FIRING: [8x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:34:35] (03PS1) 10Fabfur: cache: remove unused field from HAProxy log and Benthos conf [puppet] - 10https://gerrit.wikimedia.org/r/1034852 (https://phabricator.wikimedia.org/T365566) [09:36:09] FIRING: [3x] HelmReleaseBadStatus: Helm release device-analytics/main on k8s@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [09:37:47] (03PS1) 10Cathal Mooney: Add wikikube-ctrl to Homer wmf plugin to assign to k8s BGP group [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1034853 (https://phabricator.wikimedia.org/T353464) [09:38:14] (03CR) 10Btullis: [V:03+1 C:03+2] Absent last remaining reportupdater resources [puppet] - 10https://gerrit.wikimedia.org/r/1034475 (https://phabricator.wikimedia.org/T332580) (owner: 10Btullis) [09:38:31] RECOVERY - Check unit status of httpbb_kubernetes_mw-web_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-web_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:38:34] (03PS2) 10Btullis: Remove the last of the reportupdater resources in puppet [puppet] - 10https://gerrit.wikimedia.org/r/1034477 (https://phabricator.wikimedia.org/T332580) [09:39:11] (03CR) 10Btullis: [C:03+2] Remove the last of the reportupdater resources in puppet [puppet] - 10https://gerrit.wikimedia.org/r/1034477 (https://phabricator.wikimedia.org/T332580) (owner: 10Btullis) [09:39:25] RESOLVED: [8x] SystemdUnitFailed: httpbb_kubernetes_mw-api-int_hourly.service on cumin1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:40:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1232 (re)pooling @ 75%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62853 and previous config saved to /var/cache/conftool/dbconfig/20240522-094001-arnaudb.json [09:40:46] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2173.codfw.wmnet with reason: host reimage [09:41:02] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate sessionstore.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [09:41:39] RECOVERY - Check unit status of httpbb_kubernetes_mw-wikifunctions_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-wikifunctions_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:41:51] (03PS1) 10Majavah: hieradata: Auto-restart idp memcached on cloud [puppet] - 10https://gerrit.wikimedia.org/r/1034854 [09:42:17] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1002 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:42:51] RESOLVED: KubernetesCalicoDown: wikikube-ctrl2001.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-ctrl2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [09:44:19] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2173.codfw.wmnet with reason: host reimage [09:45:12] FIRING: [2x] CertAlmostExpired: Certificate for service sessionstore:8081 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#sessionstore:8081 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [09:48:04] (03PS1) 10Majavah: O:openstack: merge net_ovs role back to the main net one [puppet] - 10https://gerrit.wikimedia.org/r/1034855 [09:49:30] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2562/co" [puppet] - 10https://gerrit.wikimedia.org/r/1034855 (owner: 10Majavah) [09:54:50] (03PS1) 10PipelineBot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1033403 [09:55:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db1232 (re)pooling @ 100%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62854 and previous config saved to /var/cache/conftool/dbconfig/20240522-095507-arnaudb.json [09:55:33] (03PS2) 10Cathal Mooney: Add wikikube-ctrl to Homer wmf plugin to assign to k8s BGP group [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1034853 (https://phabricator.wikimedia.org/T353464) [09:56:42] (03CR) 10Ayounsi: [C:03+1] Add wikikube-ctrl to Homer wmf plugin to assign to k8s BGP group [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1034853 (https://phabricator.wikimedia.org/T353464) (owner: 10Cathal Mooney) [09:56:57] (03CR) 10Hnowlan: [C:03+1] Add wikikube-ctrl to Homer wmf plugin to assign to k8s BGP group [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1034853 (https://phabricator.wikimedia.org/T353464) (owner: 10Cathal Mooney) [09:58:12] (03CR) 10Cathal Mooney: [C:03+2] Add wikikube-ctrl to Homer wmf plugin to assign to k8s BGP group [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/1034853 (https://phabricator.wikimedia.org/T353464) (owner: 10Cathal Mooney) [09:59:22] (03CR) 10Ayounsi: [C:03+2] Revert "Add BGP sessions to KPN in esams" [homer/public] - 10https://gerrit.wikimedia.org/r/1034625 (owner: 10Ayounsi) [09:59:52] (03Merged) 10jenkins-bot: Revert "Add BGP sessions to KPN in esams" [homer/public] - 10https://gerrit.wikimedia.org/r/1034625 (owner: 10Ayounsi) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240522T1000) [10:00:28] !log cmooney@cumin1002 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: Release v0.6.5 update to hostname to bgp group mappings - cmooney@cumin1002 - T353464 [10:00:40] T353464: Migrate wikikube control planes to hardware nodes - https://phabricator.wikimedia.org/T353464 [10:01:22] (03CR) 10Muehlenhoff: [C:03+1] "Oops, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/1034854 (owner: 10Majavah) [10:02:08] !log cmooney@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin1002.eqiad.wmnet with reason: Release v0.6.5 update to hostname to bgp group mappings - cmooney@cumin1002 - T353464 [10:06:41] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2173.codfw.wmnet with OS bookworm [10:07:31] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2173 (re)pooling @ 1%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62855 and previous config saved to /var/cache/conftool/dbconfig/20240522-100730-arnaudb.json [10:08:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'T364290 db2153', diff saved to https://phabricator.wikimedia.org/P62856 and previous config saved to /var/cache/conftool/dbconfig/20240522-100834-arnaudb.json [10:08:38] T364290: Upgrade s1 to MariaDB 10.6 - https://phabricator.wikimedia.org/T364290 [10:09:04] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: name=wikikube-ctrl2001.codfw.wmnet [10:09:27] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2153.codfw.wmnet with reason: reimage [10:09:41] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2153.codfw.wmnet with reason: reimage [10:10:11] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db2153.codfw.wmnet with OS bookworm [10:11:15] (03CR) 10Hnowlan: [C:03+2] Add wikikube-ctrl2002 to server SRV record for etcd [dns] - 10https://gerrit.wikimedia.org/r/1034445 (https://phabricator.wikimedia.org/T353464) (owner: 10JMeybohm) [10:13:51] (03CR) 10Hnowlan: [C:03+2] Add wikikube-ctrl2002 as master_stacked [puppet] - 10https://gerrit.wikimedia.org/r/1034849 (https://phabricator.wikimedia.org/T353464) (owner: 10JMeybohm) [10:13:53] !restart backup2010 T365217 [10:13:54] T365217: Degraded RAID on backup2010 - https://phabricator.wikimedia.org/T365217 [10:18:01] (03CR) 10Majavah: [C:03+2] hieradata: Auto-restart idp memcached on cloud [puppet] - 10https://gerrit.wikimedia.org/r/1034854 (owner: 10Majavah) [10:20:53] (03PS1) 10Daniel Kinzler: REST: fix metrics keys [core] (wmf/1.43.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1034868 (https://phabricator.wikimedia.org/T365111) [10:22:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2173 (re)pooling @ 2%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62857 and previous config saved to /var/cache/conftool/dbconfig/20240522-102236-arnaudb.json [10:23:30] 06SRE, 06Infrastructure-Foundations, 10netops: Move device AS numbers out of Homer YAML and source from Netbox - https://phabricator.wikimedia.org/T365572 (10cmooney) 03NEW p:05Triage→03Low [10:24:18] !log kormat@cumin1002 dbctl commit (dc=all): 'db1182 (re)pooling @ 15%: repool clone source T364552', diff saved to https://phabricator.wikimedia.org/P62858 and previous config saved to /var/cache/conftool/dbconfig/20240522-102418-kormat.json [10:24:22] T364552: Fully format and reclone db1246 - https://phabricator.wikimedia.org/T364552 [10:24:25] FIRING: [2x] SystemdUnitFailed: docker.service on wikikube-ctrl2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:25:08] 06SRE, 10SRE-Access-Requests: Requesting access to analytics for rickijay - https://phabricator.wikimedia.org/T365574 (10RickiJay-WMDE) 03NEW [10:25:17] 06SRE, 10SRE-Access-Requests: Requesting access to analytics for rickijay - https://phabricator.wikimedia.org/T365574#9820423 (10RickiJay-WMDE) [10:27:06] 10SRE-tools, 10homer, 06Infrastructure-Foundations: Homer: add parallelization support - https://phabricator.wikimedia.org/T250415#9820426 (10Volans) I had a quick look to understand our options in terms of parallelization. Keeping in mind the usual 3 possible approaches: multi-process, multi-thread, async.... [10:27:28] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2153.codfw.wmnet with reason: host reimage [10:27:44] (03PS1) 10Kormat: Revert "db1246: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1034871 (https://phabricator.wikimedia.org/T364552) [10:27:56] (03CR) 10CI reject: [V:04-1] Revert "db1246: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1034871 (https://phabricator.wikimedia.org/T364552) (owner: 10Kormat) [10:28:47] (03Abandoned) 10Kormat: Revert "db1246: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1034871 (https://phabricator.wikimedia.org/T364552) (owner: 10Kormat) [10:29:25] RESOLVED: [2x] SystemdUnitFailed: docker.service on wikikube-ctrl2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:29:46] (03PS1) 10Majavah: P:simplelamp2: set missing memcached_user [puppet] - 10https://gerrit.wikimedia.org/r/1034858 [10:30:01] (03PS1) 10Kormat: db1246: Enable notifications. [puppet] - 10https://gerrit.wikimedia.org/r/1034859 (https://phabricator.wikimedia.org/T364552) [10:30:30] (03PS1) 10Mvolz: Update user-agent string in citoid to be like Zot [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034860 [10:31:25] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 11): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2563/consol" [puppet] - 10https://gerrit.wikimedia.org/r/1034858 (owner: 10Majavah) [10:32:20] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2153.codfw.wmnet with reason: host reimage [10:32:51] FIRING: KubernetesCalicoDown: wikikube-ctrl2002.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-ctrl2002.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:33:34] (03CR) 10Majavah: [V:03+1 C:03+2] P:simplelamp2: set missing memcached_user [puppet] - 10https://gerrit.wikimedia.org/r/1034858 (owner: 10Majavah) [10:35:25] (03PS1) 10Daniel Kinzler: Revert "graphite: blackhole MediaWiki.rest_api metrics" [puppet] - 10https://gerrit.wikimedia.org/r/1034872 [10:35:37] godog: --^ [10:36:22] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed - https://phabricator.wikimedia.org/T363119#9820446 (10Kormat) [10:36:29] (03CR) 10Stevemunene: [C:03+2] Setup kubeconfigs for datahub and datahub-next on dse-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1034797 (https://phabricator.wikimedia.org/T363832) (owner: 10Stevemunene) [10:36:56] PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:37:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2173 (re)pooling @ 5%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62859 and previous config saved to /var/cache/conftool/dbconfig/20240522-103742-arnaudb.json [10:39:27] !log kormat@cumin1002 dbctl commit (dc=all): 'db1182 (re)pooling @ 30%: repool clone source T364552', diff saved to https://phabricator.wikimedia.org/P62860 and previous config saved to /var/cache/conftool/dbconfig/20240522-103924-kormat.json [10:39:32] duesen: nice! thank you, I'll merge and deploy [10:39:35] T364552: Fully format and reclone db1246 - https://phabricator.wikimedia.org/T364552 [10:39:58] godog: hodl on a sec, hnowlan said to what until he's done with k8s stuff [10:39:58] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 447, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:40:07] duesen: oh ok, holding off [10:40:36] godog: will you backport the MW patch as well, or should I do that? [10:40:43] !log hnowlan@cumin1002 conftool action : set/pooled=yes:weight=10; selector: name=wikikube-ctrl2002.codfw.wmnet [10:40:50] duesen: please do the MW bits, I'll do graphite [10:41:56] (03CR) 10Muehlenhoff: "This sounds like an ugly hack, why don't we instead fix the Puppet manifests to no longer include the components?" [puppet] - 10https://gerrit.wikimedia.org/r/1010906 (https://phabricator.wikimedia.org/T359619) (owner: 10Arturo Borrero Gonzalez) [10:42:15] 10SRE-tools, 10homer, 06Infrastructure-Foundations: Homer: add parallelization support - https://phabricator.wikimedia.org/T250415#9820460 (10cmooney) One observation is that the config generation could be parallelized separate to the router transport. i.e. once the globbing on hostnames is done spawn separ... [10:42:18] (03CR) 10Cathal Mooney: [C:03+2] Set AS number for BGP EVPN devices globally at site level [homer/public] - 10https://gerrit.wikimedia.org/r/1032505 (https://phabricator.wikimedia.org/T365169) (owner: 10Cathal Mooney) [10:42:21] (03CR) 10Hnowlan: [C:03+2] Add wikikube-ctrl2003 to server SRV record for etcd [dns] - 10https://gerrit.wikimedia.org/r/1034446 (https://phabricator.wikimedia.org/T353464) (owner: 10JMeybohm) [10:42:30] (03PS4) 10Muehlenhoff: profile::parsoid::mediawiki: Don't hardcode the PHP version [puppet] - 10https://gerrit.wikimedia.org/r/1034535 [10:42:51] RESOLVED: KubernetesCalicoDown: wikikube-ctrl2002.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=wikikube-ctrl2002.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [10:42:52] (03Merged) 10jenkins-bot: Set AS number for BGP EVPN devices globally at site level [homer/public] - 10https://gerrit.wikimedia.org/r/1032505 (https://phabricator.wikimedia.org/T365169) (owner: 10Cathal Mooney) [10:43:13] godog: ok, will do. Do you have a way to confirm that the metrics keys have been fixed, without reverting the blackhole? Or should I just deploy and trust that it works? I'm quite confident, but you never know ;) [10:44:38] (03PS2) 10JMeybohm: Add wikikube-ctrl2003 as master_stacked [puppet] - 10https://gerrit.wikimedia.org/r/1034850 (https://phabricator.wikimedia.org/T353464) [10:45:06] duesen: yes I can watch graphite logs and see what metrics are created after the blackhole is gone [10:45:33] (03CR) 10Hnowlan: [C:03+2] Add wikikube-ctrl2003 as master_stacked [puppet] - 10https://gerrit.wikimedia.org/r/1034850 (https://phabricator.wikimedia.org/T353464) (owner: 10JMeybohm) [10:48:01] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1034535 (owner: 10Muehlenhoff) [10:51:01] (03CR) 10Marostegui: [C:03+1] db1246: Enable notifications. [puppet] - 10https://gerrit.wikimedia.org/r/1034859 (https://phabricator.wikimedia.org/T364552) (owner: 10Kormat) [10:51:14] (03PS5) 10Stevemunene: provision datahub service records [dns] - 10https://gerrit.wikimedia.org/r/1032393 (https://phabricator.wikimedia.org/T363299) [10:51:21] (03PS1) 10Daniel Kinzler: REST: fix metrics keys [core] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1034873 (https://phabricator.wikimedia.org/T365111) [10:52:11] (03CR) 10CI reject: [V:04-1] provision datahub service records [dns] - 10https://gerrit.wikimedia.org/r/1032393 (https://phabricator.wikimedia.org/T363299) (owner: 10Stevemunene) [10:52:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2173 (re)pooling @ 10%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62861 and previous config saved to /var/cache/conftool/dbconfig/20240522-105248-arnaudb.json [10:53:12] (03PS5) 10Muehlenhoff: profile::parsoid::mediawiki: Don't hardcode the PHP version [puppet] - 10https://gerrit.wikimedia.org/r/1034535 [10:53:58] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2153.codfw.wmnet with OS bookworm [10:54:14] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1034535 (owner: 10Muehlenhoff) [10:54:25] FIRING: [2x] SystemdUnitFailed: docker.service on wikikube-ctrl2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [10:54:33] !log kormat@cumin1002 dbctl commit (dc=all): 'db1182 (re)pooling @ 45%: repool clone source T364552', diff saved to https://phabricator.wikimedia.org/P62862 and previous config saved to /var/cache/conftool/dbconfig/20240522-105432-kormat.json [10:54:37] T364552: Fully format and reclone db1246 - https://phabricator.wikimedia.org/T364552 [10:56:33] (03PS5) 10Fabfur: cache: remove unused field from HAProxy log and Benthos conf [puppet] - 10https://gerrit.wikimedia.org/r/1034852 (https://phabricator.wikimedia.org/T365566) [10:57:04] (03CR) 10Kormat: [C:03+2] db1246: Enable notifications. [puppet] - 10https://gerrit.wikimedia.org/r/1034859 (https://phabricator.wikimedia.org/T364552) (owner: 10Kormat) [10:57:10] 10SRE-tools, 10homer, 06Infrastructure-Foundations: Homer: add parallelization support - https://phabricator.wikimedia.org/T250415#9820482 (10Volans) Sorry for not mentioning it, the parallelization of the configuration generation was implicit to me, and also easier, but ideally we should parallelize both an... [10:59:25] RESOLVED: [2x] SystemdUnitFailed: docker.service on wikikube-ctrl2003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [11:00:04] mvolz: I, the Bot under the Fountain, call upon thee, The Deployer, to do Services – Citoid / Zotero deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240522T1100). [11:00:17] (03CR) 10Fabfur: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1034852 (https://phabricator.wikimedia.org/T365566) (owner: 10Fabfur) [11:02:41] !log hnowlan@cumin1002 conftool action : set/pooled=yes:weight=10; selector: name=wikikube-ctrl2003.codfw.wmnet [11:07:09] (03PS21) 10EoghanGaffney: lists: Add lists role to list2001 [puppet] - 10https://gerrit.wikimedia.org/r/1025741 (https://phabricator.wikimedia.org/T331706) [11:07:56] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2173 (re)pooling @ 25%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62863 and previous config saved to /var/cache/conftool/dbconfig/20240522-110754-arnaudb.json [11:09:11] (03PS1) 10Fabfur: benthos:cache: switch debug endpoints off for ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1034864 (https://phabricator.wikimedia.org/T360454) [11:09:39] !log kormat@cumin1002 dbctl commit (dc=all): 'db1182 (re)pooling @ 60%: repool clone source T364552', diff saved to https://phabricator.wikimedia.org/P62864 and previous config saved to /var/cache/conftool/dbconfig/20240522-110938-kormat.json [11:09:43] T364552: Fully format and reclone db1246 - https://phabricator.wikimedia.org/T364552 [11:10:13] (03CR) 10Alexandros Kosiaris: [C:04-1] datasets-config: Add volume for configmap (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034581 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [11:10:50] 06SRE, 06Infrastructure-Foundations, 10netops: Move device AS numbers out of Homer YAML and source from Netbox - https://phabricator.wikimedia.org/T365572#9820507 (10cmooney) The other thing that strikes me here is how to manage the individual-device ASNs, for instance 4265003001 on asw1-bw27-esams. And als... [11:12:37] (03CR) 10EoghanGaffney: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2565/co" [puppet] - 10https://gerrit.wikimedia.org/r/1025741 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [11:13:45] (03PS2) 10Fabfur: benthos:cache: switch debug endpoints off for ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1034864 (https://phabricator.wikimedia.org/T360454) [11:15:00] (03CR) 10CI reject: [V:04-1] REST: fix metrics keys [core] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1034873 (https://phabricator.wikimedia.org/T365111) (owner: 10Daniel Kinzler) [11:15:49] (03PS3) 10Fabfur: benthos:cache: switch debug endpoints off for ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1034864 (https://phabricator.wikimedia.org/T360454) [11:16:09] (03CR) 10CI reject: [V:04-1] benthos:cache: switch debug endpoints off for ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1034864 (https://phabricator.wikimedia.org/T360454) (owner: 10Fabfur) [11:17:16] (03CR) 10Mvolz: [C:03+2] citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1033403 (owner: 10PipelineBot) [11:18:09] (03CR) 10TrainBranchBot: [C:03+2] "Approved by daniel@deploy1002 using scap backport" [core] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1034873 (https://phabricator.wikimedia.org/T365111) (owner: 10Daniel Kinzler) [11:18:26] (03PS4) 10Fabfur: benthos:cache: switch debug endpoints off for ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1034864 (https://phabricator.wikimedia.org/T360454) [11:18:28] (03Merged) 10jenkins-bot: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1033403 (owner: 10PipelineBot) [11:19:08] (03PS2) 10Daniel Kinzler: REST: fix metrics keys [core] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1034873 (https://phabricator.wikimedia.org/T365111) [11:19:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by daniel@deploy1002 using scap backport" [core] (wmf/1.43.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1034868 (https://phabricator.wikimedia.org/T365111) (owner: 10Daniel Kinzler) [11:20:12] godog--^ [11:20:30] backporting to wmf.6 now. wmf.5 to come next [11:20:39] duesen: ok! [11:20:54] (03PS1) 10Stevemunene: dns: provision datahub-next subdomain [dns] - 10https://gerrit.wikimedia.org/r/1034887 (https://phabricator.wikimedia.org/T365576) [11:21:52] (03CR) 10Daniel Kinzler: [C:03+2] "merge backport" [core] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1034873 (https://phabricator.wikimedia.org/T365111) (owner: 10Daniel Kinzler) [11:22:12] (03PS2) 10Stevemunene: dns: provision datahub-next subdomain [dns] - 10https://gerrit.wikimedia.org/r/1034887 (https://phabricator.wikimedia.org/T365576) [11:23:02] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2173 (re)pooling @ 50%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62865 and previous config saved to /var/cache/conftool/dbconfig/20240522-112301-arnaudb.json [11:24:15] (03CR) 10Stevemunene: "So this would mean `datahub-gms.svc.eqiad.wmnet` and `datahub-frontend.svc.eqiad.wmnet` pointing to `k8s-ingress-dse.svc.eqiad.wmnet`. How" [puppet] - 10https://gerrit.wikimedia.org/r/1032399 (https://phabricator.wikimedia.org/T363450) (owner: 10Stevemunene) [11:24:46] !log kormat@cumin1002 dbctl commit (dc=all): 'db1182 (re)pooling @ 75%: repool clone source T364552', diff saved to https://phabricator.wikimedia.org/P62866 and previous config saved to /var/cache/conftool/dbconfig/20240522-112444-kormat.json [11:24:52] T364552: Fully format and reclone db1246 - https://phabricator.wikimedia.org/T364552 [11:25:49] (03PS1) 10NMW03: Change $wgUploadNavigationUrl for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1033404 (https://phabricator.wikimedia.org/T364674) [11:26:22] (03CR) 10Muehlenhoff: [C:03+2] profile::kafka::broker: Drop support for non PKI configs [puppet] - 10https://gerrit.wikimedia.org/r/1031813 (owner: 10Muehlenhoff) [11:26:59] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T352010)', diff saved to https://phabricator.wikimedia.org/P62867 and previous config saved to /var/cache/conftool/dbconfig/20240522-112658-ladsgroup.json [11:27:01] duesen: I was about to deploy citoid; should it wait until after you've finished the backport? It in theory shouldn't interfere with mw deployment but maybe better to wait? [11:27:03] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [11:27:44] mvolz: from my end it doesn't matter... godog what do you think? [11:28:00] duesen mvolz also +1 on my end [11:28:07] to go ahead that is [11:28:07] * duesen is having trouble with CI for the wmf.5 backport [11:29:07] !log mvolz@deploy1002 helmfile [staging] START helmfile.d/services/citoid: apply [11:29:35] !log mvolz@deploy1002 helmfile [staging] DONE helmfile.d/services/citoid: apply [11:29:58] (03CR) 10EoghanGaffney: "I've updated it to not include the `check_spamd` section (it still shows up in the pcc diff, but is marked as `ensure => absent`" [puppet] - 10https://gerrit.wikimedia.org/r/1025741 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [11:30:44] (03CR) 10Daniel Kinzler: "recheck" [core] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1034873 (https://phabricator.wikimedia.org/T365111) (owner: 10Daniel Kinzler) [11:30:51] (03PS1) 10Cathal Mooney: Change EVPN BGP YAML to group into clusters and add codfw switches [homer/public] - 10https://gerrit.wikimedia.org/r/1034889 (https://phabricator.wikimedia.org/T365169) [11:31:45] (03CR) 10Daniel Kinzler: [C:03+2] "merge backport for deployment, second attempt." [core] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1034873 (https://phabricator.wikimedia.org/T365111) (owner: 10Daniel Kinzler) [11:32:50] gah. and I have to wait for CI for wmf.5 though I already know it's failing due to some network glitch. And then I have to re-run it and wait another 30 minutes. [11:33:06] i should have merged the patches beforehand [11:35:34] duesen: ok, no worries though I'm expected for lunch, I'm thinking of merging the blackhole revert once I'm back and you have deployed the change? I'll be back in ~1h [11:35:35] (03CR) 10Fabfur: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1034864 (https://phabricator.wikimedia.org/T360454) (owner: 10Fabfur) [11:37:42] gotta go [11:38:08] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2173 (re)pooling @ 75%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62868 and previous config saved to /var/cache/conftool/dbconfig/20240522-113807-arnaudb.json [11:38:56] !log mvolz@deploy1002 helmfile [codfw] START helmfile.d/services/citoid: apply [11:39:11] (03PS1) 10Muehlenhoff: kafka::mirror: Drop support for non PKI configs [puppet] - 10https://gerrit.wikimedia.org/r/1034892 [11:39:30] !log mvolz@deploy1002 helmfile [codfw] DONE helmfile.d/services/citoid: apply [11:39:52] !log kormat@cumin1002 dbctl commit (dc=all): 'db1182 (re)pooling @ 90%: repool clone source T364552', diff saved to https://phabricator.wikimedia.org/P62869 and previous config saved to /var/cache/conftool/dbconfig/20240522-113952-kormat.json [11:39:56] T364552: Fully format and reclone db1246 - https://phabricator.wikimedia.org/T364552 [11:39:57] (03CR) 10CI reject: [V:04-1] REST: fix metrics keys [core] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1034873 (https://phabricator.wikimedia.org/T365111) (owner: 10Daniel Kinzler) [11:41:09] (03CR) 10Daniel Kinzler: [C:03+2] "merge backport for deployment, third attempt." [core] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1034873 (https://phabricator.wikimedia.org/T365111) (owner: 10Daniel Kinzler) [11:41:28] let's hope this one sticks... [11:41:48] !log mvolz@deploy1002 helmfile [eqiad] START helmfile.d/services/citoid: apply [11:42:07] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P62870 and previous config saved to /var/cache/conftool/dbconfig/20240522-114206-ladsgroup.json [11:42:21] !log mvolz@deploy1002 helmfile [eqiad] DONE helmfile.d/services/citoid: apply [11:42:26] (03Merged) 10jenkins-bot: REST: fix metrics keys [core] (wmf/1.43.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1034868 (https://phabricator.wikimedia.org/T365111) (owner: 10Daniel Kinzler) [11:42:58] !log daniel@deploy1002 Started scap: Backport for [[gerrit:1034868|REST: fix metrics keys (T365111)]] [11:43:02] T365111: Per-page graphite metrics created for MediaWiki.rest_api_latency / rest_api_errors - https://phabricator.wikimedia.org/T365111 [11:43:19] (03CR) 10JMeybohm: [C:03+1] pki: add temporary profile for prometheus + k8s [puppet] - 10https://gerrit.wikimedia.org/r/1034048 (https://phabricator.wikimedia.org/T343529) (owner: 10Filippo Giunchedi) [11:45:10] godog: scap for wmf.6 is running. But wmf.5 is s till suck in CI. Going to be a while. [11:45:40] !log daniel@deploy1002 daniel: Backport for [[gerrit:1034868|REST: fix metrics keys (T365111)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [11:45:53] were you also having github issues in CI? or something else? [11:45:54] (03CR) 10JMeybohm: [C:03+1] ipoid: ensure all containers have securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031105 (https://phabricator.wikimedia.org/T346638) (owner: 10Scott French) [11:46:31] (03PS1) 10Btullis: Migrate AQS2 services and image-suggestions to calico network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1033405 (https://phabricator.wikimedia.org/T359423) [11:46:43] (03Merged) 10jenkins-bot: REST: fix metrics keys [core] (wmf/1.43.0-wmf.5) - 10https://gerrit.wikimedia.org/r/1034873 (https://phabricator.wikimedia.org/T365111) (owner: 10Daniel Kinzler) [11:47:35] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1034892 (owner: 10Muehlenhoff) [11:47:42] !log daniel@deploy1002 daniel: Continuing with sync [11:47:43] (03CR) 10JMeybohm: [C:03+1] recommendation-api: add securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032764 (https://phabricator.wikimedia.org/T362978) (owner: 10Kamila Součková) [11:50:15] (03CR) 10JMeybohm: [C:03+1] tegola-vector-tiles: Add securityContext and update dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032524 (https://phabricator.wikimedia.org/T362978) (owner: 10RLazarus) [11:53:14] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2173 (re)pooling @ 100%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62871 and previous config saved to /var/cache/conftool/dbconfig/20240522-115313-arnaudb.json [11:54:58] !log kormat@cumin1002 dbctl commit (dc=all): 'db1182 (re)pooling @ 100%: repool clone source T364552', diff saved to https://phabricator.wikimedia.org/P62872 and previous config saved to /var/cache/conftool/dbconfig/20240522-115458-kormat.json [11:55:03] T364552: Fully format and reclone db1246 - https://phabricator.wikimedia.org/T364552 [11:56:34] !log kormat@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 15%: Repool db1246 T364552', diff saved to https://phabricator.wikimedia.org/P62873 and previous config saved to /var/cache/conftool/dbconfig/20240522-115633-kormat.json [11:57:14] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211', diff saved to https://phabricator.wikimedia.org/P62874 and previous config saved to /var/cache/conftool/dbconfig/20240522-115714-ladsgroup.json [11:58:58] (03CR) 10Krinkle: [C:03+1] Delete docroot/noc/createTxtFileSymlinks.sh [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034183 (https://phabricator.wikimedia.org/T365514) (owner: 10Reedy) [12:00:23] !log daniel@deploy1002 Finished scap: Backport for [[gerrit:1034868|REST: fix metrics keys (T365111)]] (duration: 17m 25s) [12:00:36] T365111: Per-page graphite metrics created for MediaWiki.rest_api_latency / rest_api_errors - https://phabricator.wikimedia.org/T365111 [12:00:44] godog: scap for wmf.6 is complete. starting wmf.5 [12:00:54] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [12:01:01] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 36 probes of 800 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:01:05] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2153 (re)pooling @ 1%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62875 and previous config saved to /var/cache/conftool/dbconfig/20240522-120105-arnaudb.json [12:01:39] !log daniel@deploy1002 Started scap: Backport for [[gerrit:1034873|REST: fix metrics keys (T365111)]] [12:02:24] !log arnaudb@cumin1002 dbctl commit (dc=all): 'T364290 db2145', diff saved to https://phabricator.wikimedia.org/P62876 and previous config saved to /var/cache/conftool/dbconfig/20240522-120223-arnaudb.json [12:02:28] T364290: Upgrade s1 to MariaDB 10.6 - https://phabricator.wikimedia.org/T364290 [12:02:33] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2145.codfw.wmnet with reason: reimage [12:02:46] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2145.codfw.wmnet with reason: reimage [12:02:58] (03CR) 10Brouberol: "Do we need `datahub-gms` to be under ingress? If not, I'd just define `datahub.svc.eqiad.wmnet` and make it point to the k8s DSE ingress r" [puppet] - 10https://gerrit.wikimedia.org/r/1032399 (https://phabricator.wikimedia.org/T363450) (owner: 10Stevemunene) [12:04:04] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db2145.codfw.wmnet with OS bookworm [12:04:15] !log daniel@deploy1002 daniel: Backport for [[gerrit:1034873|REST: fix metrics keys (T365111)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:05:09] (03PS2) 10JMeybohm: Add new mesh.configuration version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028557 (https://phabricator.wikimedia.org/T362310) [12:05:09] (03PS1) 10JMeybohm: mesh.configuration: Add support for rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034896 (https://phabricator.wikimedia.org/T362310) [12:06:36] !log daniel@deploy1002 daniel: Continuing with sync [12:06:42] (03PS2) 10JMeybohm: mesh.configuration: Add support for rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028558 (https://phabricator.wikimedia.org/T362310) [12:07:02] (03Abandoned) 10JMeybohm: mesh.configuration: Add support for rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034896 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [12:07:37] !log stevemunene@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging [12:08:18] (03PS3) 10JMeybohm: mesh.configuration: Add support for rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028558 (https://phabricator.wikimedia.org/T362310) [12:10:11] (03CR) 10JMeybohm: "Thanks @ksouckova@wikimedia.org - I had to rebase to make this 1.7.2 as 1.7.1 has been released since. Would you mind double checking me o" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028558 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [12:11:20] (03PS7) 10Ayounsi: sre.hosts.rename: initial commit [cookbooks] - 10https://gerrit.wikimedia.org/r/1008818 [12:11:41] !log kormat@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 30%: Repool db1246 T364552', diff saved to https://phabricator.wikimedia.org/P62877 and previous config saved to /var/cache/conftool/dbconfig/20240522-121139-kormat.json [12:11:48] T364552: Fully format and reclone db1246 - https://phabricator.wikimedia.org/T364552 [12:12:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2211 (T352010)', diff saved to https://phabricator.wikimedia.org/P62878 and previous config saved to /var/cache/conftool/dbconfig/20240522-121222-ladsgroup.json [12:12:25] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2213.codfw.wmnet with reason: Maintenance [12:12:26] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [12:12:38] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2213.codfw.wmnet with reason: Maintenance [12:12:46] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2213 (T352010)', diff saved to https://phabricator.wikimedia.org/P62879 and previous config saved to /var/cache/conftool/dbconfig/20240522-121245-ladsgroup.json [12:14:24] (03PS1) 10Vgutierrez: depool upload@magru before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1034898 (https://phabricator.wikimedia.org/T357257) [12:16:11] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2153 (re)pooling @ 2%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62880 and previous config saved to /var/cache/conftool/dbconfig/20240522-121611-arnaudb.json [12:18:28] (03CR) 10Vgutierrez: [C:03+2] depool upload@magru before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1034898 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [12:18:32] !log daniel@deploy1002 Finished scap: Backport for [[gerrit:1034873|REST: fix metrics keys (T365111)]] (duration: 16m 53s) [12:18:36] !log depool upload@magru before enabling IPIP encapsulation - T357257 [12:18:38] T365111: Per-page graphite metrics created for MediaWiki.rest_api_latency / rest_api_errors - https://phabricator.wikimedia.org/T365111 [12:18:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:42] T357257: Use IPIP encapsulation on lvs<-->upload cluster - https://phabricator.wikimedia.org/T357257 [12:18:42] !log stevemunene@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging [12:20:48] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2145.codfw.wmnet with reason: host reimage [12:20:53] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 35 probes of 800 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:21:01] godog: deployment is done! rest_api metrics should be sane again. Let me know if there are still any issues. [12:21:34] (03PS1) 10Vgutierrez: hiera: Enable IPIP on high-traffic2@magru [puppet] - 10https://gerrit.wikimedia.org/r/1034899 (https://phabricator.wikimedia.org/T357257) [12:22:57] (03PS1) 10Vgutierrez: hiera: Enable IPIP on upload@magru [puppet] - 10https://gerrit.wikimedia.org/r/1034900 (https://phabricator.wikimedia.org/T357257) [12:23:54] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2145.codfw.wmnet with reason: host reimage [12:24:48] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1034899 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [12:26:13] (03PS1) 10TChin: datasets-config: Add tlsExtraSANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034901 (https://phabricator.wikimedia.org/T357434) [12:26:41] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1034900 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [12:26:48] !log kormat@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 45%: Repool db1246 T364552', diff saved to https://phabricator.wikimedia.org/P62881 and previous config saved to /var/cache/conftool/dbconfig/20240522-122647-kormat.json [12:26:52] T364552: Fully format and reclone db1246 - https://phabricator.wikimedia.org/T364552 [12:27:30] (03CR) 10Vgutierrez: [V:03+1 C:03+2] hiera: Enable IPIP on high-traffic2@magru [puppet] - 10https://gerrit.wikimedia.org/r/1034899 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [12:27:51] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 39 probes of 800 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:28:59] (03CR) 10Brouberol: [C:03+1] "Looks good! I'll take care of the apply in a bit" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034901 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [12:29:45] (03CR) 10DCausse: wdqs: extract categories reload to its own cookbook (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/1032544 (owner: 10DCausse) [12:30:16] (03CR) 10TChin: [C:03+2] datasets-config: Add tlsExtraSANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034901 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [12:31:17] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2153 (re)pooling @ 5%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62882 and previous config saved to /var/cache/conftool/dbconfig/20240522-123116-arnaudb.json [12:31:39] (03CR) 10Volans: [C:03+1] wdqs: extract categories reload to its own cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1032544 (owner: 10DCausse) [12:32:06] !log mwmaint1002: mwscript extensions/GrowthExperiments/maintenance/revalidateLinkRecommendations.php --wiki={enwiki,dewiki} --all --verbose (T308144) [12:32:28] (03CR) 10Vgutierrez: [V:03+1 C:03+2] hiera: Enable IPIP on upload@magru [puppet] - 10https://gerrit.wikimedia.org/r/1034900 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [12:32:41] (03CR) 10Volans: [C:03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/1008818 (owner: 10Ayounsi) [12:33:15] (03Merged) 10jenkins-bot: datasets-config: Add tlsExtraSANs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034901 (https://phabricator.wikimedia.org/T357434) (owner: 10TChin) [12:33:47] (03CR) 10Muehlenhoff: [C:03+2] an-druid: Switch the Zookeeper firewall settings to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1032776 (owner: 10Muehlenhoff) [12:33:51] (03Abandoned) 10Mvolz: citoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/1021405 (owner: 10PipelineBot) [12:34:25] FIRING: SystemdUnitFailed: mediawiki_job_growthexperiments-refreshLinkRecommendations-s1.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:35:57] (03CR) 10Vgutierrez: [C:03+1] benthos:cache: switch debug endpoints off for ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1034864 (https://phabricator.wikimedia.org/T360454) (owner: 10Fabfur) [12:36:46] FIRING: SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [12:37:33] (03PS2) 10Urbanecm: Remove forward slashes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034495 (https://phabricator.wikimedia.org/T332580) (owner: 10Cyndywikime) [12:37:40] jouncebot: nowandnext [12:37:41] No deployments scheduled for the next 0 hour(s) and 22 minute(s) [12:37:41] In 0 hour(s) and 22 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240522T1300) [12:37:41] (03CR) 10Vgutierrez: "Looks good but haproxy_cache_systemd_socket.yaml needs to be updated as well" [puppet] - 10https://gerrit.wikimedia.org/r/1034864 (https://phabricator.wikimedia.org/T360454) (owner: 10Fabfur) [12:37:46] duesen: ok! reverting now and will ket you know [12:37:47] (03CR) 10Fabfur: [V:03+1 C:03+2] benthos:cache: switch debug endpoints off for ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/1034864 (https://phabricator.wikimedia.org/T360454) (owner: 10Fabfur) [12:37:51] (03CR) 10Muehlenhoff: [C:03+2] an-conf: Switch the Zookeeper firewall settings to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1032778 (owner: 10Muehlenhoff) [12:39:07] (03CR) 10Filippo Giunchedi: [C:03+2] Revert "graphite: blackhole MediaWiki.rest_api metrics" [puppet] - 10https://gerrit.wikimedia.org/r/1034872 (owner: 10Daniel Kinzler) [12:39:32] godog: shall I merge your patch along? [12:39:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T364299)', diff saved to https://phabricator.wikimedia.org/P62884 and previous config saved to /var/cache/conftool/dbconfig/20240522-123938-marostegui.json [12:39:39] moritzm: yes please, thank you [12:40:23] (03PS1) 10Urbanecm: foundationwiki: Grant autopatrol to the editor group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034903 (https://phabricator.wikimedia.org/T365584) [12:40:41] (03PS3) 10Urbanecm: Remove forward slashes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034495 (https://phabricator.wikimedia.org/T332580) (owner: 10Cyndywikime) [12:40:43] running puppet-agent on all cp-ulsfo to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/1034864 [12:40:50] godog: merged [12:40:51] (03PS4) 10Urbanecm: Remove forward slashes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034495 (https://phabricator.wikimedia.org/T332580) (owner: 10Cyndywikime) [12:40:54] !log running puppet-agent on all cp-ulsfo to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/1034864 [12:40:54] cheers [12:41:23] (03PS8) 10DCausse: wdqs: extract categories reload to its own cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/1032544 [12:41:23] (03PS17) 10DCausse: wdqs.data-reload: support HDFS as a source [cookbooks] - 10https://gerrit.wikimedia.org/r/1031933 (https://phabricator.wikimedia.org/T349069) [12:41:37] !log rolling restart of pybal on lvs7003 and lvs7002 - T357257 [12:41:38] (03CR) 10Urbanecm: [C:03+2] foundationwiki: Grant autopatrol to the editor group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034903 (https://phabricator.wikimedia.org/T365584) (owner: 10Urbanecm) [12:41:38] (03CR) 10Urbanecm: [C:03+2] Remove forward slashes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034495 (https://phabricator.wikimedia.org/T332580) (owner: 10Cyndywikime) [12:41:54] !log kormat@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 60%: Repool db1246 T364552', diff saved to https://phabricator.wikimedia.org/P62885 and previous config saved to /var/cache/conftool/dbconfig/20240522-124153-kormat.json [12:41:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034903 (https://phabricator.wikimedia.org/T365584) (owner: 10Urbanecm) [12:41:58] (03CR) 10TrainBranchBot: [C:03+2] "Approved by urbanecm@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034495 (https://phabricator.wikimedia.org/T332580) (owner: 10Cyndywikime) [12:42:20] (03Merged) 10jenkins-bot: foundationwiki: Grant autopatrol to the editor group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034903 (https://phabricator.wikimedia.org/T365584) (owner: 10Urbanecm) [12:42:22] (03Merged) 10jenkins-bot: Remove forward slashes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034495 (https://phabricator.wikimedia.org/T332580) (owner: 10Cyndywikime) [12:42:52] !log urbanecm@deploy1002 Started scap: Backport for [[gerrit:1034903|foundationwiki: Grant autopatrol to the editor group (T365584)]], [[gerrit:1034495|Remove forward slashes (T332580 T363815)]] [12:42:53] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 35 probes of 800 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:43:21] duesen: we're good! e.g. creating database metric MediaWiki.rest_api_latency._v1_revision_from_compare_to_.GET.200.median I'll update the task too [12:44:27] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2145.codfw.wmnet with OS bookworm [12:45:36] !log urbanecm@deploy1002 urbanecm and cyndywikime: Backport for [[gerrit:1034903|foundationwiki: Grant autopatrol to the editor group (T365584)]], [[gerrit:1034495|Remove forward slashes (T332580 T363815)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:45:47] T365584: Grant `autopatrol` to the `editor` group at foundation.wikimedia.org - https://phabricator.wikimedia.org/T365584 [12:45:47] T332580: Upgrade an-launcher1002 to bullseye - https://phabricator.wikimedia.org/T332580 [12:45:48] T363815: Enable instrumentation for Temporary accounts <-> registered accounts flow - https://phabricator.wikimedia.org/T363815 [12:46:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2153 (re)pooling @ 10%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62886 and previous config saved to /var/cache/conftool/dbconfig/20240522-124622-arnaudb.json [12:46:33] (03PS1) 10Vgutierrez: Revert "depool upload@magru before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1034877 (https://phabricator.wikimedia.org/T357257) [12:46:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2145 (re)pooling @ 1%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62887 and previous config saved to /var/cache/conftool/dbconfig/20240522-124648-arnaudb.json [12:47:11] (03PS1) 10Krinkle: password: Document wmgPasswordSecretKey [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034905 (https://phabricator.wikimedia.org/T150647) [12:49:15] (03PS1) 10Marostegui: db1232: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1034906 [12:49:53] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 36 probes of 800 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [12:50:01] (03CR) 10Vgutierrez: [C:03+2] Revert "depool upload@magru before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1034877 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [12:50:18] !log repool upload@magru with IPIP encapsulation enabled - T357257 [12:50:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:22] T357257: Use IPIP encapsulation on lvs<-->upload cluster - https://phabricator.wikimedia.org/T357257 [12:50:27] sukhe: ^^ [12:50:43] nice [12:50:49] (03CR) 10Marostegui: [C:03+2] db1232: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1034906 (owner: 10Marostegui) [12:51:01] (03CR) 10Ssingh: "Needs the two additional records as mentioned in the task -- which I know you are aware of." [dns] - 10https://gerrit.wikimedia.org/r/1034565 (https://phabricator.wikimedia.org/T365435) (owner: 10Dzahn) [12:51:42] (03CR) 10Muehlenhoff: [C:03+2] zk/flink: Switch the Zookeeper firewall settings to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1032773 (owner: 10Muehlenhoff) [12:53:56] (03CR) 10Filippo Giunchedi: "+Moritz for heads up" [puppet] - 10https://gerrit.wikimedia.org/r/1034048 (https://phabricator.wikimedia.org/T343529) (owner: 10Filippo Giunchedi) [12:54:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P62888 and previous config saved to /var/cache/conftool/dbconfig/20240522-125446-marostegui.json [12:54:52] (03CR) 10JMeybohm: [C:03+2] Add and enable default audit logging policy in staging-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1034531 (https://phabricator.wikimedia.org/T290020) (owner: 10JMeybohm) [12:55:13] !log urbanecm@deploy1002 urbanecm and cyndywikime: Continuing with sync [12:55:55] (03CR) 10Muehlenhoff: [C:03+2] zookeeper/test: Switch the Zookeeper firewall settings to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1032771 (owner: 10Muehlenhoff) [12:56:43] (03PS1) 10Marostegui: db1231: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1034907 [12:57:01] !log kormat@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 75%: Repool db1246 T364552', diff saved to https://phabricator.wikimedia.org/P62889 and previous config saved to /var/cache/conftool/dbconfig/20240522-125659-kormat.json [12:57:05] T364552: Fully format and reclone db1246 - https://phabricator.wikimedia.org/T364552 [12:57:11] (03PS1) 10Vgutierrez: depool upload@eqiad before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1034908 (https://phabricator.wikimedia.org/T357257) [12:57:34] (03CR) 10Marostegui: [C:03+2] db1231: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1034907 (owner: 10Marostegui) [12:57:46] moritzm: ok to merge? [12:59:10] marostegui: yes, please [12:59:16] ok merging! [12:59:20] cheers [12:59:50] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [12:59:55] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 35 probes of 800 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:00:01] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:00:12] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240522T1300). [13:00:12] duesen and Nemoralis: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:26] (03PS2) 10Btullis: Migrate AQS2 services and image-suggestions to calico network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1033405 (https://phabricator.wikimedia.org/T359423) [13:00:32] (03CR) 10Vgutierrez: [C:03+2] depool upload@eqiad before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1034908 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [13:00:44] !log depool upload@eqiad before enabling IPIP encapsulation - T357257 [13:00:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:00:50] T357257: Use IPIP encapsulation on lvs<-->upload cluster - https://phabricator.wikimedia.org/T357257 [13:00:54] I already deployed my patch, so never mind :) [13:01:03] :) [13:01:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2153 (re)pooling @ 25%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62890 and previous config saved to /var/cache/conftool/dbconfig/20240522-130128-arnaudb.json [13:01:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2145 (re)pooling @ 2%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62891 and previous config saved to /var/cache/conftool/dbconfig/20240522-130154-arnaudb.json [13:03:06] (03CR) 10Brouberol: "The apps have now been deployed!" [puppet] - 10https://gerrit.wikimedia.org/r/1021383 (https://phabricator.wikimedia.org/T357434) (owner: 10Brouberol) [13:04:00] (03PS1) 10Vgutierrez: hiera: Enable IPIP on high-traffic2@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1034909 (https://phabricator.wikimedia.org/T357257) [13:04:01] (03PS1) 10Vgutierrez: hiera: Enable IPIP on upload@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1034910 (https://phabricator.wikimedia.org/T357257) [13:05:07] !log restarting all benthos instances in A:cp-ulsfo [13:05:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:56] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (NOOP 4 CORE_DIFF 2 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compile" [puppet] - 10https://gerrit.wikimedia.org/r/1034910 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [13:08:01] !log urbanecm@deploy1002 Finished scap: Backport for [[gerrit:1034903|foundationwiki: Grant autopatrol to the editor group (T365584)]], [[gerrit:1034495|Remove forward slashes (T332580 T363815)]] (duration: 25m 09s) [13:08:09] T365584: Grant `autopatrol` to the `editor` group at foundation.wikimedia.org - https://phabricator.wikimedia.org/T365584 [13:08:10] T332580: Upgrade an-launcher1002 to bullseye - https://phabricator.wikimedia.org/T332580 [13:08:10] T363815: Enable instrumentation for Temporary accounts <-> registered accounts flow - https://phabricator.wikimedia.org/T363815 [13:08:14] (03CR) 10Ssingh: [C:03+1] hiera: Enable IPIP on upload@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1034910 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [13:08:55] (03PS1) 10Jelto: aptrepo: upgrade gitlab-ce and gitlab-runner package to 16.10 [puppet] - 10https://gerrit.wikimedia.org/r/1034912 (https://phabricator.wikimedia.org/T365587) [13:09:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203', diff saved to https://phabricator.wikimedia.org/P62892 and previous config saved to /var/cache/conftool/dbconfig/20240522-130954-marostegui.json [13:09:55] (03PS2) 10Fabfur: hiera: test Benthos socket activation on cp4037 [puppet] - 10https://gerrit.wikimedia.org/r/1034099 (https://phabricator.wikimedia.org/T364379) [13:10:14] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1034910 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [13:11:12] (03CR) 10Alexandros Kosiaris: [C:04-1] "This is the copy paste patch, right?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028557 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [13:12:07] !log kormat@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 90%: Repool db1246 T364552', diff saved to https://phabricator.wikimedia.org/P62893 and previous config saved to /var/cache/conftool/dbconfig/20240522-131206-kormat.json [13:12:12] T364552: Fully format and reclone db1246 - https://phabricator.wikimedia.org/T364552 [13:12:28] (03CR) 10Fabfur: [V:03+1 C:03+2] "Thanks for the reminder, I'll do in Ia5061fddb784759b2f5e81368070a6ccc3b3f252" [puppet] - 10https://gerrit.wikimedia.org/r/1034864 (https://phabricator.wikimedia.org/T360454) (owner: 10Fabfur) [13:12:35] (03CR) 10Muehlenhoff: [C:03+2] conf: Switch the Zookeeper firewall settings to firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/1032780 (owner: 10Muehlenhoff) [13:12:40] (03CR) 10Brouberol: [C:04-1] "The `cassandra-` prefix must be dropped, after which I expect this to work!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1033405 (https://phabricator.wikimedia.org/T359423) (owner: 10Btullis) [13:12:57] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 38 probes of 800 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [13:13:31] (03CR) 10Alexandros Kosiaris: [C:04-1] "+1 on implementation, -1 on versioning. See previous patch (the copy/paste one) for a reasoning on why this should be 1.8.0, not 1.7.2" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028558 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [13:13:40] FIRING: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:14:02] (03CR) 10JMeybohm: "> Any concerns with the field mapping as you see it now?" [puppet] - 10https://gerrit.wikimedia.org/r/1031602 (https://phabricator.wikimedia.org/T290020) (owner: 10Cwhite) [13:14:07] (03PS2) 10Vgutierrez: hiera: Enable IPIP on high-traffic2@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1034909 (https://phabricator.wikimedia.org/T357257) [13:14:07] (03PS2) 10Vgutierrez: hiera: Enable IPIP on upload@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1034910 (https://phabricator.wikimedia.org/T357257) [13:14:56] o/ [13:15:01] I can deploy [13:16:10] HI [13:16:30] 10ops-codfw, 06SRE, 06DC-Ops: ManagementSSHDown - https://phabricator.wikimedia.org/T365543#9820998 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm this was me yesterday. was fixing a cabling issue and must have unseated it by accident. reseated and pinging. [13:16:33] * Lucas_WMDE looking through the task [13:16:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2153 (re)pooling @ 50%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62894 and previous config saved to /var/cache/conftool/dbconfig/20240522-131634-arnaudb.json [13:17:00] (03PS3) 10Vgutierrez: hiera: Enable IPIP on high-traffic2@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1034909 (https://phabricator.wikimedia.org/T357257) [13:17:00] (03PS3) 10Vgutierrez: hiera: Enable IPIP on upload@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1034910 (https://phabricator.wikimedia.org/T357257) [13:17:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2145 (re)pooling @ 5%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62895 and previous config saved to /var/cache/conftool/dbconfig/20240522-131700-arnaudb.json [13:17:05] okay, https://az.wikipedia.org/w/api.php?action=query&meta=siteinfo&siprop=usergroups confirms that only uploaders (and sysops) have the upload right these days [13:17:37] and I see https://az.wikipedia.org/wiki/Vikipediya:Y%C3%BCkl%C9%99m%C9%99_sehrbaz%C4%B1 is a wiki page linking to both Commons and local upload [13:17:45] that all looks deployable to me then :) [13:17:48] yes, we changed that [13:18:07] (03PS2) 10NMW03: Change $wgUploadNavigationUrl for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1033404 (https://phabricator.wikimedia.org/T364674) [13:18:24] (03CR) 10Klausman: [C:03+2] Add new version for amd-pytorch image (torch 2.3.0 - rocm 6.0) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1032725 (https://phabricator.wikimedia.org/T365166) (owner: 10Ilias Sarantopoulos) [13:18:53] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (DIFF 2 NOOP 4 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compile" [puppet] - 10https://gerrit.wikimedia.org/r/1034909 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [13:19:01] (03PS1) 10Lucas Werkmeister (WMDE): PrefixSearch: Make sure $prefix is a string [core] (wmf/1.43.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1034878 (https://phabricator.wikimedia.org/T365565) [13:19:01] (03CR) 10Klausman: [V:03+2 C:03+2] Add new version for amd-pytorch image (torch 2.3.0 - rocm 6.0) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1032725 (https://phabricator.wikimedia.org/T365166) (owner: 10Ilias Sarantopoulos) [13:19:17] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1033404 (https://phabricator.wikimedia.org/T364674) (owner: 10NMW03) [13:20:00] (03Merged) 10jenkins-bot: Change $wgUploadNavigationUrl for azwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1033404 (https://phabricator.wikimedia.org/T364674) (owner: 10NMW03) [13:20:23] I’m also adding https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1034878 to the window [13:20:29] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1033404|Change $wgUploadNavigationUrl for azwiki (T364674)]] [13:20:37] T364674: Make $wgUploadNavigationUrl link to local page on azwiki - https://phabricator.wikimedia.org/T364674 [13:21:13] 10ops-codfw, 06SRE, 06DC-Ops, 06Machine-Learning-Team: ml-serve2002 memory errors on DIMM_B1 - https://phabricator.wikimedia.org/T365291#9821016 (10Jhancock.wm) 05Open→03Resolved no new errors. [13:22:16] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1034910 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [13:22:46] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "starting gate-and-submit ahead of deployment" [core] (wmf/1.43.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1034878 (https://phabricator.wikimedia.org/T365565) (owner: 10Lucas Werkmeister (WMDE)) [13:23:17] !log lucaswerkmeister-wmde@deploy1002 nmw03 and lucaswerkmeister-wmde: Backport for [[gerrit:1033404|Change $wgUploadNavigationUrl for azwiki (T364674)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:23:39] Nemoralis: please test [13:24:11] (for me it seems to work ^^) [13:24:19] yes, LGTM [13:24:22] !log lucaswerkmeister-wmde@deploy1002 nmw03 and lucaswerkmeister-wmde: Continuing with sync [13:24:24] \o/ [13:25:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1203 (T364299)', diff saved to https://phabricator.wikimedia.org/P62896 and previous config saved to /var/cache/conftool/dbconfig/20240522-132501-marostegui.json [13:25:05] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance [13:25:08] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [13:25:18] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1211.eqiad.wmnet with reason: Maintenance [13:25:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1211 (T364299)', diff saved to https://phabricator.wikimedia.org/P62897 and previous config saved to /var/cache/conftool/dbconfig/20240522-132526-marostegui.json [13:26:25] (03CR) 10Vgutierrez: [C:03+1] cache: remove unused field from HAProxy log and Benthos conf [puppet] - 10https://gerrit.wikimedia.org/r/1034852 (https://phabricator.wikimedia.org/T365566) (owner: 10Fabfur) [13:26:47] (03CR) 10Vgutierrez: [V:03+1 C:03+2] hiera: Enable IPIP on high-traffic2@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1034909 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [13:27:13] !log kormat@cumin1002 dbctl commit (dc=all): 'db1246 (re)pooling @ 100%: Repool db1246 T364552', diff saved to https://phabricator.wikimedia.org/P62898 and previous config saved to /var/cache/conftool/dbconfig/20240522-132712-kormat.json [13:27:17] T364552: Fully format and reclone db1246 - https://phabricator.wikimedia.org/T364552 [13:28:26] (03CR) 10Vgutierrez: [V:03+1 C:03+2] hiera: Enable IPIP on upload@eqiad [puppet] - 10https://gerrit.wikimedia.org/r/1034910 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [13:29:25] RESOLVED: SystemdUnitFailed: mediawiki_job_growthexperiments-refreshLinkRecommendations-s1.service on mwmaint1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:29:47] (03PS3) 10Btullis: Migrate AQS2 services and image-suggestions to calico network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1033405 (https://phabricator.wikimedia.org/T359423) [13:30:58] (03PS3) 10JMeybohm: Add new mesh.configuration version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028557 (https://phabricator.wikimedia.org/T362310) [13:30:58] (03PS4) 10JMeybohm: mesh.configuration: Add support for rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028558 (https://phabricator.wikimedia.org/T362310) [13:31:04] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1034912 (https://phabricator.wikimedia.org/T365587) (owner: 10Jelto) [13:31:17] (03CR) 10Hashar: "recheck after I have deployed https://gerrit.wikimedia.org/r/c/integration/config/+/1034915" [software/bitu] - 10https://gerrit.wikimedia.org/r/1030743 (https://phabricator.wikimedia.org/T362318) (owner: 10Slyngshede) [13:31:23] (03PS4) 10Btullis: Migrate AQS2 services and image-suggestions to calico network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1033405 (https://phabricator.wikimedia.org/T359423) [13:31:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2153 (re)pooling @ 75%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62899 and previous config saved to /var/cache/conftool/dbconfig/20240522-133140-arnaudb.json [13:32:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2145 (re)pooling @ 10%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62900 and previous config saved to /var/cache/conftool/dbconfig/20240522-133209-arnaudb.json [13:32:34] (03CR) 10Jelto: [C:03+2] aptrepo: upgrade gitlab-ce and gitlab-runner package to 16.10 [puppet] - 10https://gerrit.wikimedia.org/r/1034912 (https://phabricator.wikimedia.org/T365587) (owner: 10Jelto) [13:32:56] (03CR) 10Btullis: Migrate AQS2 services and image-suggestions to calico network policies (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1033405 (https://phabricator.wikimedia.org/T359423) (owner: 10Btullis) [13:34:45] ugh, my gate-and-submit build is failing [13:35:27] (03CR) 10Slyngshede: [C:03+2] Build Bitu contain image using Blubber. [software/bitu] - 10https://gerrit.wikimedia.org/r/1030743 (https://phabricator.wikimedia.org/T362318) (owner: 10Slyngshede) [13:36:10] FIRING: [3x] HelmReleaseBadStatus: Helm release device-analytics/main on k8s@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:36:34] (03CR) 10CI reject: [V:04-1] PrefixSearch: Make sure $prefix is a string [core] (wmf/1.43.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1034878 (https://phabricator.wikimedia.org/T365565) (owner: 10Lucas Werkmeister (WMDE)) [13:36:53] (03CR) 10Lucas Werkmeister (WMDE): [C:03+2] "try again" [core] (wmf/1.43.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1034878 (https://phabricator.wikimedia.org/T365565) (owner: 10Lucas Werkmeister (WMDE)) [13:36:56] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1033404|Change $wgUploadNavigationUrl for azwiki (T364674)]] (duration: 16m 27s) [13:37:00] T364674: Make $wgUploadNavigationUrl link to local page on azwiki - https://phabricator.wikimedia.org/T364674 [13:37:07] (03Merged) 10jenkins-bot: Build Bitu contain image using Blubber. [software/bitu] - 10https://gerrit.wikimedia.org/r/1030743 (https://phabricator.wikimedia.org/T362318) (owner: 10Slyngshede) [13:37:08] Nemoralis: should be deployed everywhere now [13:37:35] thanks! [13:37:36] jouncebot: next [13:37:37] In 0 hour(s) and 22 minute(s): Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240522T1400) [13:37:41] (03CR) 10TrainBranchBot: [C:03+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [core] (wmf/1.43.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1034878 (https://phabricator.wikimedia.org/T365565) (owner: 10Lucas Werkmeister (WMDE)) [13:37:46] *might* run into that window :S [13:37:51] depending on how long CI + deployment take [13:38:06] btullis: did you see device-analytics is in a bad state after your deployment? https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [13:39:12] !log rolling restart of pybal on lvs1020 and lvs1018 - T357257 [13:39:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:16] T357257: Use IPIP encapsulation on lvs<-->upload cluster - https://phabricator.wikimedia.org/T357257 [13:40:58] (03CR) 10Muehlenhoff: "I don't think this is a good idea, this file currently only contains definitions of our internal networks and at some point we'll most pro" [puppet] - 10https://gerrit.wikimedia.org/r/1017367 (owner: 10Dzahn) [13:41:02] FIRING: PuppetCertificateAboutToExpire: Puppet CA certificate sessionstore.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [13:41:47] (03CR) 10Muehlenhoff: [C:03+2] Remove obsolete Druid cert [puppet] - 10https://gerrit.wikimedia.org/r/1034366 (owner: 10Muehlenhoff) [13:43:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_rsyslog.service on ml-serve2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [13:45:02] 10ops-eqiad, 06SRE, 06DBA, 06DC-Ops: db1246 crashed - https://phabricator.wikimedia.org/T363119#9821214 (10Marostegui) @Jclark-ctr is there anything else left from your side or can this be closed too? [13:45:13] FIRING: [2x] CertAlmostExpired: Certificate for service sessionstore:8081 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#sessionstore:8081 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [13:46:47] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2153 (re)pooling @ 100%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62901 and previous config saved to /var/cache/conftool/dbconfig/20240522-134646-arnaudb.json [13:47:00] 10ops-eqiad, 06SRE, 06cloud-services-team, 10Cloud-VPS, 06DC-Ops: Move cloudsw2-d5-eqiad servers to cloudsw1-d5-eqiad - https://phabricator.wikimedia.org/T334644#9821211 (10aborrero) [13:47:20] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2145 (re)pooling @ 25%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62902 and previous config saved to /var/cache/conftool/dbconfig/20240522-134717-arnaudb.json [13:47:20] 06SRE, 06cloud-services-team, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Move WMCS servers to 1 single NIC - https://phabricator.wikimedia.org/T319184#9821206 (10aborrero) 05Open→03Stalled blocking until {T364984} is fixed, so we don't risk having another cloudvirt offline. [13:48:00] !log installing bind9 security updates (client-side tools/libs) [13:48:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:48:43] (03PS1) 10Vgutierrez: Revert "depool upload@eqiad before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1034880 (https://phabricator.wikimedia.org/T357257) [13:50:51] reported my gate-and-submit failure at T365596 btw, let’s hope the second try works better 🤷 [13:50:52] T365596: Random test failure: DatabaseMysqlTest::testQueryTimeout: No DBQueryTimeoutError caught - https://phabricator.wikimedia.org/T365596 [13:51:01] 10ops-codfw, 06SRE, 06Data-Persistence, 10Data-Persistence-Backup, 06DC-Ops: Degraded RAID on backup2010 - https://phabricator.wikimedia.org/T365217#9821267 (10jcrespo) 05Open→03Resolved a:03jcrespo I did a disk stress test for an hour or so, saw no media errors, smart errors or raid controller... [13:51:55] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T365337#9821262 (10BTullis) Thanks @Dzahn - As it happens, these cassandra servers have been handed over to #data-persistence as per https://github.com/wikimedia/operations-puppet/blob/production/... [13:52:46] (03PS1) 10Zabe: Stop writing to af_user(_text)/afh_user(_text) in group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034919 (https://phabricator.wikimedia.org/T337920) [13:52:59] (03CR) 10Vgutierrez: [C:03+2] Revert "depool upload@eqiad before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1034880 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [13:53:14] 10ops-eqiad, 06SRE, 06DC-Ops, 10cloud-services-team (Hardware): Q4:rack/setup/install new cloudcephmon hosts - https://phabricator.wikimedia.org/T364870#9821300 (10Jclark-ctr) @Andrew @dcaro once we find out racking information we will be able to rack and image these fairly quickly these have arrived [13:53:14] !log repool upload@eqiad with IPIP encapsulation enabled - T357257 [13:53:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:53:18] T357257: Use IPIP encapsulation on lvs<-->upload cluster - https://phabricator.wikimedia.org/T357257 [13:53:24] arnoldokoth, XioNoX, eoghan: ^^ [13:53:34] ack ty! [13:53:42] nice! [13:54:07] (03CR) 10CDanis: [C:03+1] pki: add temporary profile for prometheus + k8s [puppet] - 10https://gerrit.wikimedia.org/r/1034048 (https://phabricator.wikimedia.org/T343529) (owner: 10Filippo Giunchedi) [13:54:26] vgutierrez: awesome [13:55:52] !log installing libcaca security updates [13:55:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:33] I still laugh at libcaca name [13:58:53] enwiki doesn’t say but I assume the name is a reference to aalib? ^^ [13:59:05] (03Merged) 10jenkins-bot: PrefixSearch: Make sure $prefix is a string [core] (wmf/1.43.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1034878 (https://phabricator.wikimedia.org/T365565) (owner: 10Lucas Werkmeister (WMDE)) [13:59:09] whee [13:59:20] oh damn, only one minute left in the window. definitely overrunning then, sorry :( [13:59:36] !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:1034878|PrefixSearch: Make sure $prefix is a string (T365565)]] [13:59:42] T365565: InvalidArgumentException: Value 4 must be either string or LikeMatch - https://phabricator.wikimedia.org/T365565 [14:00:05] Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240522T1400) [14:00:13] I’m still deploying, sorry :( [14:00:23] It's fine, I'll not use the window. [14:00:28] ok [14:02:20] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Backport for [[gerrit:1034878|PrefixSearch: Make sure $prefix is a string (T365565)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:02:25] testing… [14:02:26] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2145 (re)pooling @ 50%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62904 and previous config saved to /var/cache/conftool/dbconfig/20240522-140225-arnaudb.json [14:02:35] yup, looks good [14:02:37] !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde: Continuing with sync [14:02:37] (03PS1) 10Zabe: beta: Remove password config override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034922 [14:04:52] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T365337#9821382 (10Eevans) We're aware, yeah, current efforts being tracked in {T362033} (no shortage of tickets on this one 😞). [14:05:22] 10ops-eqiad, 06SRE, 06Data-Persistence, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T365337#9821387 (10Eevans) →14Duplicate dup:03T362033 [14:05:31] 10ops-eqiad, 06SRE, 10Cassandra, 06DC-Ops: Degraded RAID on aqs1013 - https://phabricator.wikimedia.org/T362033#9821384 (10Eevans) [14:12:26] (03CR) 10Eevans: [C:03+1] admin: add user sg912 to cassandra-staging-devs [puppet] - 10https://gerrit.wikimedia.org/r/1034599 (https://phabricator.wikimedia.org/T365118) (owner: 10Dzahn) [14:14:35] !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:1034878|PrefixSearch: Make sure $prefix is a string (T365565)]] (duration: 14m 58s) [14:14:40] T365565: InvalidArgumentException: Value 4 must be either string or LikeMatch - https://phabricator.wikimedia.org/T365565 [14:14:52] * Lucas_WMDE done [14:17:32] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2145 (re)pooling @ 75%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62905 and previous config saved to /var/cache/conftool/dbconfig/20240522-141732-arnaudb.json [14:18:11] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2213 (T352010)', diff saved to https://phabricator.wikimedia.org/P62906 and previous config saved to /var/cache/conftool/dbconfig/20240522-141809-ladsgroup.json [14:18:22] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [14:23:01] (03CR) 10JHathaway: [C:03+2] gerrit: Move outbound email to mx-out{1001,2001}.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1034593 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [14:23:29] (03PS1) 10Kosta Harlan: WikimediaEvents: Set IPoid URL and enable ip_reputation/score (3rd attempt) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034881 (https://phabricator.wikimedia.org/T354597) [14:23:48] (03PS1) 10Kosta Harlan: ext-EventLogging: Add mediawiki.ip_reputation.score (2nd attempt) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034882 (https://phabricator.wikimedia.org/T354597) [14:25:18] (03PS2) 10Kosta Harlan: WikimediaEvents: Set IPoid URL and enable ip_reputation/score (3rd attempt) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034881 (https://phabricator.wikimedia.org/T354597) [14:25:39] (03PS2) 10Kosta Harlan: ext-EventLogging: Add mediawiki.ip_reputation.score (2nd attempt) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034882 (https://phabricator.wikimedia.org/T354597) [14:27:12] (03PS1) 10Ayounsi: Update requirements to pickup new django-storage-swift [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1034946 [14:27:41] (03PS3) 10Kosta Harlan: EventLogging: Enable IP reputation logging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034882 (https://phabricator.wikimedia.org/T354597) [14:27:55] (03Abandoned) 10Kosta Harlan: WikimediaEvents: Set IPoid URL and enable ip_reputation/score (3rd attempt) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034881 (https://phabricator.wikimedia.org/T354597) (owner: 10Kosta Harlan) [14:28:21] !log disabling puppet on all cp-ulsfo to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/1034852 selectively (T365566) [14:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:25] T365566: HAProxy should not log information we don't actually need - https://phabricator.wikimedia.org/T365566 [14:28:32] !log copy calico, istio-cni, kubernetes-node packages from bullseye-wikimedia to bookworm-wikimedia - T365253 [14:28:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:36] T365253: Allow Kubernetes workers to be deployed on Bookworm - https://phabricator.wikimedia.org/T365253 [14:28:52] (03PS2) 10Ayounsi: Update requirements [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1034946 [14:29:34] (03CR) 10Fabfur: [V:03+1 C:03+2] cache: remove unused field from HAProxy log and Benthos conf [puppet] - 10https://gerrit.wikimedia.org/r/1034852 (https://phabricator.wikimedia.org/T365566) (owner: 10Fabfur) [14:30:12] (03CR) 10EoghanGaffney: [V:03+1] "PCC SUCCESS (CORE_DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2580/co" [puppet] - 10https://gerrit.wikimedia.org/r/1025741 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [14:31:20] (03CR) 10Andrea Denisse: "Hi team, this change is ready for review. Please a look at it if you can, thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1029664 (https://phabricator.wikimedia.org/T359255) (owner: 10Andrea Denisse) [14:31:46] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:32:10] !log jayme@cumin1002 conftool action : set/pooled=inactive; selector: name=kubernetes20(23|32).codfw.wmnet [14:32:38] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2145 (re)pooling @ 100%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62907 and previous config saved to /var/cache/conftool/dbconfig/20240522-143238-arnaudb.json [14:32:39] (03CR) 10Elukey: [C:03+1] kafka::mirror: Drop support for non PKI configs [puppet] - 10https://gerrit.wikimedia.org/r/1034892 (owner: 10Muehlenhoff) [14:33:18] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2213', diff saved to https://phabricator.wikimedia.org/P62908 and previous config saved to /var/cache/conftool/dbconfig/20240522-143318-ladsgroup.json [14:33:21] (03PS1) 10Vgutierrez: Depool upload@drmrs before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1034947 (https://phabricator.wikimedia.org/T357257) [14:33:24] !log drained, cordoned and pooled=inactive kubernetes2023 and kubernetes2032 for cookbook testing - T350152 T365571 [14:33:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:30] T350152: Automation to change a server's vlan - https://phabricator.wikimedia.org/T350152 [14:33:31] T365571: Rename wikikube worker nodes during OS reimage - https://phabricator.wikimedia.org/T365571 [14:34:00] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T352010)', diff saved to https://phabricator.wikimedia.org/P62909 and previous config saved to /var/cache/conftool/dbconfig/20240522-143359-ladsgroup.json [14:34:09] (03CR) 10Volans: [C:03+1] "LGTM, question inline" [cookbooks] - 10https://gerrit.wikimedia.org/r/1007652 (https://phabricator.wikimedia.org/T350152) (owner: 10Volans) [14:34:15] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [14:34:18] (03CR) 10Dzahn: [C:03+2] admin: add user sg912 to cassandra-staging-devs [puppet] - 10https://gerrit.wikimedia.org/r/1034599 (https://phabricator.wikimedia.org/T365118) (owner: 10Dzahn) [14:35:16] (03CR) 10Herron: [C:03+1] "🧼🧹" [puppet] - 10https://gerrit.wikimedia.org/r/1034892 (owner: 10Muehlenhoff) [14:36:12] (03PS1) 10Vgutierrez: hiera: Enable IPIP on high-traffic2@drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1034948 (https://phabricator.wikimedia.org/T357257) [14:36:13] (03PS1) 10Vgutierrez: hiera: Enable IPIP on upload@drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1034949 (https://phabricator.wikimedia.org/T357257) [14:36:18] (03CR) 10Arturo Borrero Gonzalez: [C:03+1] "LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/1034855 (owner: 10Majavah) [14:36:46] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:37:01] (03CR) 10Muehlenhoff: [C:03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/1034048 (https://phabricator.wikimedia.org/T343529) (owner: 10Filippo Giunchedi) [14:39:46] (03CR) 10Majavah: [V:03+1 C:03+2] O:openstack: merge net_ovs role back to the main net one [puppet] - 10https://gerrit.wikimedia.org/r/1034855 (owner: 10Majavah) [14:40:30] (03PS1) 10GergesShamon: Revert "arwiki: Disable Extension:ContentTranslation for non-autoreview users" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034884 [14:41:07] (03PS2) 10GergesShamon: Revert "arwiki: Disable Extension:ContentTranslation for non-autoreview users" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034884 (https://phabricator.wikimedia.org/T255022) [14:41:08] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1034948 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [14:41:40] (03PS1) 10Bking: data-engineering: add scaffolding for airflow service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034951 (https://phabricator.wikimedia.org/T363001) [14:42:23] (03CR) 10CI reject: [V:04-1] data-engineering: add scaffolding for airflow service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034951 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [14:42:46] (03CR) 10Vgutierrez: [C:03+2] Depool upload@drmrs before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1034947 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [14:42:57] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1034949 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [14:43:10] !log depool upload@drmrs before enabling IPIP encapsulation - T357257 [14:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:14] T357257: Use IPIP encapsulation on lvs<-->upload cluster - https://phabricator.wikimedia.org/T357257 [14:44:23] (03CR) 10Cathal Mooney: [C:03+1] sre.hosts.reimage: add support for VLAN move [cookbooks] - 10https://gerrit.wikimedia.org/r/1007652 (https://phabricator.wikimedia.org/T350152) (owner: 10Volans) [14:46:29] 06SRE, 10SRE-Access-Requests: Give access to Anti Harassment Tools team to production deployment - https://phabricator.wikimedia.org/T246053#9821658 (10Tchanders) 05In progress→03Resolved Thanks @Dzahn, I've followed up in Slack. I'll close this again since it seems to be a local problem. [14:46:37] 10ops-eqiad, 06SRE, 10Data-Persistence-Backup, 06DC-Ops, 10media-backups: backup1005 crashed - https://phabricator.wikimedia.org/T361087#9821686 (10jcrespo) [14:48:26] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2213', diff saved to https://phabricator.wikimedia.org/P62910 and previous config saved to /var/cache/conftool/dbconfig/20240522-144826-ladsgroup.json [14:48:42] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to cassandra-staging-devs for sg912 - https://phabricator.wikimedia.org/T365118#9821664 (10Dzahn) 05In progress→03Resolved Hello @SGupta-WMF, your user has been added to the cassandra-staging-devs and has been created on the 3 cassa... [14:48:56] FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [14:49:08] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P62911 and previous config saved to /var/cache/conftool/dbconfig/20240522-144907-ladsgroup.json [14:53:12] (03PS2) 10BryanDavis: wikitech: Add credentials for GitLab account blocking [puppet] - 10https://gerrit.wikimedia.org/r/1034532 (https://phabricator.wikimedia.org/T316418) [14:54:12] (03PS2) 10Vgutierrez: hiera: Enable IPIP on high-traffic2@drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1034948 (https://phabricator.wikimedia.org/T357257) [14:54:12] (03PS2) 10Vgutierrez: hiera: Enable IPIP on upload@drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1034949 (https://phabricator.wikimedia.org/T357257) [14:55:08] (03PS1) 10Hnowlan: utils: remove pem copying step from ecdsa_cert [puppet] - 10https://gerrit.wikimedia.org/r/1034952 (https://phabricator.wikimedia.org/T363996) [14:55:50] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1034948 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [14:56:42] (03CR) 10Kamila Součková: [C:03+1] Add new chart: ratelimit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026564 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [14:56:45] (03CR) 10Alexandros Kosiaris: [C:03+1] utils: remove pem copying step from ecdsa_cert [puppet] - 10https://gerrit.wikimedia.org/r/1034952 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan) [14:57:01] !log running `puppet cert revoke sessionstore.discovery.wmnet ` T363996 [14:57:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:05] T363996: Sessionstore's discovery TLS cert will expire before end of May 2024 - https://phabricator.wikimedia.org/T363996 [14:57:19] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1034949 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [14:58:40] !log elukey@cumin1002 START - Cookbook sre.hosts.reimage for host ml-staging2001.codfw.wmnet with OS bookworm [14:58:46] (03CR) 10Vgutierrez: [V:03+1 C:03+2] hiera: Enable IPIP on high-traffic2@drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1034948 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [14:59:06] (03CR) 10Elukey: [V:03+1 C:03+2] Skip ROCm packages for ml-staging2001 [puppet] - 10https://gerrit.wikimedia.org/r/1032765 (https://phabricator.wikimedia.org/T363191) (owner: 10Elukey) [14:59:33] (03PS1) 10Fabfur: benthos:cache: more robust grok parsing [puppet] - 10https://gerrit.wikimedia.org/r/1034953 (https://phabricator.wikimedia.org/T365566) [14:59:47] (03CR) 10Vgutierrez: [V:03+1 C:03+2] hiera: Enable IPIP on upload@drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1034949 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [15:00:42] (03Abandoned) 10Ayounsi: Update requirements [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/1034946 (owner: 10Ayounsi) [15:01:41] !log stopping eqiad mediabackups for cleaning up missing files T365607 [15:01:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:50] T365607: Reprovision missing files due to backup1005 hw issues - https://phabricator.wikimedia.org/T365607 [15:02:21] (03PS1) 10Dzahn: site: remove contint1003 - former testing machine [puppet] - 10https://gerrit.wikimedia.org/r/1034954 (https://phabricator.wikimedia.org/T358237) [15:02:56] (03CR) 10Fabfur: [C:03+2] benthos:cache: more robust grok parsing [puppet] - 10https://gerrit.wikimedia.org/r/1034953 (https://phabricator.wikimedia.org/T365566) (owner: 10Fabfur) [15:03:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2213 (T352010)', diff saved to https://phabricator.wikimedia.org/P62912 and previous config saved to /var/cache/conftool/dbconfig/20240522-150333-ladsgroup.json [15:03:39] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [15:04:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220', diff saved to https://phabricator.wikimedia.org/P62913 and previous config saved to /var/cache/conftool/dbconfig/20240522-150415-ladsgroup.json [15:04:35] (03PS1) 10Dzahn: role: delete ci_test role, not used anymore [puppet] - 10https://gerrit.wikimedia.org/r/1034955 (https://phabricator.wikimedia.org/T358237) [15:05:18] !log arnaudb@cumin1002 dbctl commit (dc=all): 'T364290 db2130', diff saved to https://phabricator.wikimedia.org/P62914 and previous config saved to /var/cache/conftool/dbconfig/20240522-150516-arnaudb.json [15:05:22] T364290: Upgrade s1 to MariaDB 10.6 - https://phabricator.wikimedia.org/T364290 [15:05:47] RESOLVED: PuppetCertificateAboutToExpire: Puppet CA certificate sessionstore.discovery.wmnet is about to expire - https://wikitech.wikimedia.org/wiki/Puppet#Renew_agent_certificate - TODO - https://alerts.wikimedia.org/?q=alertname%3DPuppetCertificateAboutToExpire [15:05:55] !log dzahn@cumin1002 START - Cookbook sre.hosts.decommission for hosts contint1003.eqiad.wmnet [15:06:23] !log arnaudb@cumin1002 START - Cookbook sre.hosts.reimage for host db2130.codfw.wmnet with OS bookworm [15:06:42] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2130.codfw.wmnet with reason: reimage [15:06:44] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2130.codfw.wmnet with reason: reimage [15:07:08] (03CR) 10Kamila Součková: [C:03+1] "Good point about the versioning, I should have realised '^^" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028558 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [15:07:48] (03CR) 10Scott French: [C:03+2] dbctl: extend dbconfig checks to external sections [software/conftool] - 10https://gerrit.wikimedia.org/r/1032849 (https://phabricator.wikimedia.org/T365123) (owner: 10Scott French) [15:08:13] (03PS1) 10JMeybohm: Add wikikube-worker config [puppet] - 10https://gerrit.wikimedia.org/r/1034956 (https://phabricator.wikimedia.org/T365571) [15:09:50] (03PS1) 10Fabfur: benthos:cache: fix typo in grok pattern [puppet] - 10https://gerrit.wikimedia.org/r/1034958 (https://phabricator.wikimedia.org/T365566) [15:10:13] !log rolling restart of pybal on lvs6003 and lvs6002 - T357257 [15:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:10:18] T357257: Use IPIP encapsulation on lvs<-->upload cluster - https://phabricator.wikimedia.org/T357257 [15:10:36] !log dzahn@cumin1002 START - Cookbook sre.dns.netbox [15:11:51] (03Merged) 10jenkins-bot: dbctl: extend dbconfig checks to external sections [software/conftool] - 10https://gerrit.wikimedia.org/r/1032849 (https://phabricator.wikimedia.org/T365123) (owner: 10Scott French) [15:12:55] (03PS1) 10Bking: dse-k8s: add new airflow service to k8s cluster [puppet] - 10https://gerrit.wikimedia.org/r/1034961 (https://phabricator.wikimedia.org/T363001) [15:13:10] (03CR) 10Scott French: "Thank you both for the review!" [software/conftool] - 10https://gerrit.wikimedia.org/r/1034163 (https://phabricator.wikimedia.org/T365123) (owner: 10Scott French) [15:13:12] (03CR) 10Fabfur: [C:03+2] benthos:cache: fix typo in grok pattern [puppet] - 10https://gerrit.wikimedia.org/r/1034958 (https://phabricator.wikimedia.org/T365566) (owner: 10Fabfur) [15:13:13] (03CR) 10Scott French: [C:03+2] dbctl: break up test_check_config test case [software/conftool] - 10https://gerrit.wikimedia.org/r/1034163 (https://phabricator.wikimedia.org/T365123) (owner: 10Scott French) [15:14:17] (03PS1) 10Vgutierrez: Revert "Depool upload@drmrs before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1034885 (https://phabricator.wikimedia.org/T357257) [15:14:32] !log dzahn@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: contint1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - dzahn@cumin1002" [15:15:34] (03PS1) 10Ayounsi: Replace django-auth-ldap with ApereoSocialPipeline [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/1034962 (https://phabricator.wikimedia.org/T308002) [15:15:36] (03PS1) 10Ayounsi: Update requirements [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/1034963 [15:16:07] !log dzahn@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: contint1003.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - dzahn@cumin1002" [15:16:07] !log dzahn@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:16:08] !log dzahn@cumin1002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts contint1003.eqiad.wmnet [15:16:21] 06SRE, 06collaboration-services, 10Continuous-Integration-Infrastructure, 10vm-requests, 13Patch-For-Review: Ganeti VM for contint migration - https://phabricator.wikimedia.org/T358237#9821997 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by dzahn@cumin1002 for hosts: `contint1003.... [15:16:45] (03CR) 10Vgutierrez: [C:03+2] Revert "Depool upload@drmrs before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1034885 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [15:16:46] FIRING: [2x] SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:16:47] !log enabling puppet on all cp-ulsfo (T365566) [15:16:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:51] T365566: HAProxy should not log information we don't actually need - https://phabricator.wikimedia.org/T365566 [15:16:54] !log repool upload@drmrs with IPIP encapsulation enabled - T357257 [15:16:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:58] T357257: Use IPIP encapsulation on lvs<-->upload cluster - https://phabricator.wikimedia.org/T357257 [15:17:05] (03Merged) 10jenkins-bot: dbctl: break up test_check_config test case [software/conftool] - 10https://gerrit.wikimedia.org/r/1034163 (https://phabricator.wikimedia.org/T365123) (owner: 10Scott French) [15:17:15] cwhite, arnoldokoth ^^ [15:17:21] I got unrelated hiera changes while running a decom cookbook again. [15:17:50] (03CR) 10JMeybohm: [C:03+2] Add new chart: ratelimit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026564 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [15:17:51] part of it was that parse1002 failed yesterday [15:18:03] (03CR) 10JMeybohm: [V:03+2 C:03+2] New chart from scaffold: ratelimit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026563 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [15:18:05] the other part something for a cloudvirt mgmt interface [15:18:32] merged all because they seemed harmless enough but it could easily be something else [15:18:47] (03CR) 10CI reject: [V:04-1] New chart from scaffold: ratelimit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026563 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [15:18:47] (03CR) 10CI reject: [V:04-1] Add new chart: ratelimit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026564 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [15:19:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2220 (T352010)', diff saved to https://phabricator.wikimedia.org/P62915 and previous config saved to /var/cache/conftool/dbconfig/20240522-151923-ladsgroup.json [15:19:28] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [15:19:32] !log elukey@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on ml-staging2001.codfw.wmnet with reason: host reimage [15:19:46] (03CR) 10Ottomata: [C:03+1] kafka::mirror: Drop support for non PKI configs [puppet] - 10https://gerrit.wikimedia.org/r/1034892 (owner: 10Muehlenhoff) [15:19:53] (03CR) 10JMeybohm: [V:03+2 C:03+2] Add new chart: ratelimit [deployment-charts] - 10https://gerrit.wikimedia.org/r/1026564 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [15:20:58] (03CR) 10Dzahn: [C:03+2] "machine has been decom'ed by cookbook" [puppet] - 10https://gerrit.wikimedia.org/r/1034954 (https://phabricator.wikimedia.org/T358237) (owner: 10Dzahn) [15:22:39] (03PS1) 10Dzahn: hieradata: delete host contint1003 [puppet] - 10https://gerrit.wikimedia.org/r/1034965 (https://phabricator.wikimedia.org/T358237) [15:22:45] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ml-staging2001.codfw.wmnet with reason: host reimage [15:23:27] (03CR) 10Dzahn: [C:03+2] hieradata: delete host contint1003 [puppet] - 10https://gerrit.wikimedia.org/r/1034965 (https://phabricator.wikimedia.org/T358237) (owner: 10Dzahn) [15:24:01] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2130.codfw.wmnet with reason: host reimage [15:24:30] (03CR) 10Dzahn: "I mean, you could argue that next time we upgrade we want to do the same thing again and some amount of work went into creating this.. so." [puppet] - 10https://gerrit.wikimedia.org/r/1034955 (https://phabricator.wikimedia.org/T358237) (owner: 10Dzahn) [15:24:54] (03CR) 10Volans: [C:03+1] "LGTM, does it need to be deployed together with a config change for the CAS stuff?" [software/netbox-deploy] (wmf-next) - 10https://gerrit.wikimedia.org/r/1034963 (owner: 10Ayounsi) [15:27:27] (03CR) 10Dzahn: [C:03+1] "lgtm, it's currently 7.4 on both buster and bullseye" [puppet] - 10https://gerrit.wikimedia.org/r/1034535 (owner: 10Muehlenhoff) [15:27:44] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2130.codfw.wmnet with reason: host reimage [15:28:09] (03CR) 10Dzahn: [C:03+1] "I think we need more reviews, especially Effie since she suggested it before." [puppet] - 10https://gerrit.wikimedia.org/r/1029235 (https://phabricator.wikimedia.org/T236253) (owner: 10Ahmon Dancy) [15:29:48] (03CR) 10Dzahn: "Oh, yea.. I somehow managed to overlook the other 2 I think. Amending!" [dns] - 10https://gerrit.wikimedia.org/r/1034565 (https://phabricator.wikimedia.org/T365435) (owner: 10Dzahn) [15:32:20] (03PS4) 10Dzahn: add additional Amazon domainkey values to learn.wiki domain [dns] - 10https://gerrit.wikimedia.org/r/1034565 (https://phabricator.wikimedia.org/T365435) [15:32:48] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: apply [15:32:55] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [15:33:33] (03PS2) 10Scott French: aqs-http-gateway: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031497 (https://phabricator.wikimedia.org/T362978) [15:34:40] (03CR) 10Reedy: [C:04-1] Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [15:35:08] (03PS5) 10Dzahn: add additional Amazon domainkey values to learn.wiki domain [dns] - 10https://gerrit.wikimedia.org/r/1034565 (https://phabricator.wikimedia.org/T365435) [15:35:39] FIRING: [3x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [15:36:14] 06SRE, 10DNS, 06Traffic, 10WikiLearn, 13Patch-For-Review: DNS records for WikiLearn - https://phabricator.wikimedia.org/T365435#9822077 (10Dzahn) Hi @Asaf I have this change in code review now, adding 3 new values. https://gerrit.wikimedia.org/r/c/operations/dns/+/1034565/4/templates/learn.wiki I am... [15:36:53] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - parse1002 - https://phabricator.wikimedia.org/T365531#9822087 (10Dzahn) [15:37:15] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown - parse1002 - https://phabricator.wikimedia.org/T365531#9822092 (10Dzahn) →14Duplicate dup:03T365310 [15:37:17] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops-radar: ManagementSSHDown - parse1002 - https://phabricator.wikimedia.org/T365310#9822090 (10Dzahn) [15:37:31] (03CR) 10Kamila Součková: [C:03+2] recommendation-api: add securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032764 (https://phabricator.wikimedia.org/T362978) (owner: 10Kamila Součková) [15:37:56] 06SRE, 10DNS, 06Traffic, 10WikiLearn, 13Patch-For-Review: DNS records for WikiLearn - https://phabricator.wikimedia.org/T365435#9822081 (10Dzahn) 05Open→03In progress p:05Triage→03High [15:38:22] (03Merged) 10jenkins-bot: recommendation-api: add securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032764 (https://phabricator.wikimedia.org/T362978) (owner: 10Kamila Součková) [15:39:10] !log kamila@deploy1002 helmfile [staging] START helmfile.d/services/recommendation-api: apply [15:39:13] !log kamila@deploy1002 helmfile [staging] DONE helmfile.d/services/recommendation-api: apply [15:40:26] !log kamila@deploy1002 helmfile [staging] START helmfile.d/services/recommendation-api: apply [15:40:28] (03PS1) 10Vgutierrez: hiera: Enable IPIP on high-traffic2@esams [puppet] - 10https://gerrit.wikimedia.org/r/1034968 (https://phabricator.wikimedia.org/T357257) [15:40:28] !log kamila@deploy1002 helmfile [staging] DONE helmfile.d/services/recommendation-api: apply [15:40:30] (03PS1) 10Vgutierrez: hiera: Enable IPIP on esams@upload [puppet] - 10https://gerrit.wikimedia.org/r/1034969 (https://phabricator.wikimedia.org/T357257) [15:42:16] !log elukey@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ml-staging2001.codfw.wmnet with OS bookworm [15:42:25] !log kamila@deploy1002 helmfile [staging] START helmfile.d/services/recommendation-api: apply [15:42:28] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (NOOP 1 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1034968 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [15:42:50] !log kamila@deploy1002 helmfile [staging] DONE helmfile.d/services/recommendation-api: apply [15:44:17] (03CR) 10Kosta Harlan: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [15:44:18] (03PS1) 10BryanDavis: gitlab.runners: Add *.toolforge.org to allowed services [puppet] - 10https://gerrit.wikimedia.org/r/1034971 (https://phabricator.wikimedia.org/T365561) [15:44:36] !log upload to bookworm-wikimedia dragonfly-{dfdaemon,dfget}, calicoctl, calico-cni - T365253 [15:44:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:40] T365253: Allow Kubernetes workers to be deployed on Bookworm - https://phabricator.wikimedia.org/T365253 [15:45:01] (03PS1) 10Hnowlan: sessionstore: use new cert to go with new key in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034972 (https://phabricator.wikimedia.org/T363996) [15:46:49] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (NOOP 6 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1034969 (https://phabricator.wikimedia.org/T357257) (owner: 10Vgutierrez) [15:47:23] (03CR) 10Ottomata: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [15:50:33] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2130.codfw.wmnet with OS bookworm [15:52:59] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 34 probes of 800 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [15:53:06] (03PS1) 10Ilias Sarantopoulos: fix: remove --cache-dir from pytorch image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1034975 (https://phabricator.wikimedia.org/T365166) [15:53:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2130 (re)pooling @ 1%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62916 and previous config saved to /var/cache/conftool/dbconfig/20240522-155315-arnaudb.json [15:53:58] (03CR) 10Dreamy Jazz: "Why we are modifying a `wmf` branch here that is not used on production? Does this not need to be in the `master` branch too?" [extensions/SecurePoll] (wmf/1.42.0-wmf.24) - 10https://gerrit.wikimedia.org/r/1014053 (https://phabricator.wikimedia.org/T291821) (owner: 10Driedmueller) [15:55:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1192 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P62917 and previous config saved to /var/cache/conftool/dbconfig/20240522-155533-root.json [15:55:35] !log kamila@deploy1002 helmfile [eqiad] START helmfile.d/services/recommendation-api: apply [15:56:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db1226', diff saved to https://phabricator.wikimedia.org/P62918 and previous config saved to /var/cache/conftool/dbconfig/20240522-155621-root.json [15:56:35] !log kamila@deploy1002 helmfile [eqiad] DONE helmfile.d/services/recommendation-api: apply [15:56:39] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Long schema change [15:56:52] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1226.eqiad.wmnet with reason: Long schema change [15:58:03] (03CR) 10Ilias Sarantopoulos: "I rebuilt the image and it is now 13.5GB vs 15.9GB previously." [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1034975 (https://phabricator.wikimedia.org/T365166) (owner: 10Ilias Sarantopoulos) [15:59:09] (03CR) 10Volans: [C:03+1] "Looks sane, I think it's ready for testing. Couple of nits inline." [cookbooks] - 10https://gerrit.wikimedia.org/r/981472 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [15:59:20] (03PS1) 10JMeybohm: Rename kubernetes2023 to wikikube-worker2001 [puppet] - 10https://gerrit.wikimedia.org/r/1034976 (https://phabricator.wikimedia.org/T365571) [15:59:22] (03PS1) 10JMeybohm: Rename kubernetes2032 to wikikube-worker2002 [puppet] - 10https://gerrit.wikimedia.org/r/1034977 (https://phabricator.wikimedia.org/T365571) [16:00:01] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 52 probes of 800 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:00:12] (03CR) 10Klausman: [C:03+2] fix: remove --cache-dir from pytorch image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1034975 (https://phabricator.wikimedia.org/T365166) (owner: 10Ilias Sarantopoulos) [16:00:43] (03CR) 10JMeybohm: [C:03+2] mesh.configuration: Add support for rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028558 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [16:00:46] (03CR) 10JMeybohm: [C:03+2] Add new mesh.configuration version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028557 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [16:01:38] (03Merged) 10jenkins-bot: Add new mesh.configuration version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028557 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [16:01:45] (03Merged) 10jenkins-bot: mesh.configuration: Add support for rate limiting [deployment-charts] - 10https://gerrit.wikimedia.org/r/1028558 (https://phabricator.wikimedia.org/T362310) (owner: 10JMeybohm) [16:02:54] (03PS1) 10CDanis: Move opentelemetry-collector to admin_ng [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034978 (https://phabricator.wikimedia.org/T365626) [16:03:22] (03CR) 10JMeybohm: [C:03+1] "Issuer: CN = Puppet CA: palladium.eqiad.wmnet" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034972 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan) [16:06:27] (03CR) 10Klausman: [V:03+2 C:03+2] fix: remove --cache-dir from pytorch image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1034975 (https://phabricator.wikimedia.org/T365166) (owner: 10Ilias Sarantopoulos) [16:07:13] (03CR) 10Hnowlan: [C:03+2] sessionstore: use new cert to go with new key in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034972 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan) [16:08:05] (03Merged) 10jenkins-bot: sessionstore: use new cert to go with new key in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034972 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan) [16:08:22] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2130 (re)pooling @ 2%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62919 and previous config saved to /var/cache/conftool/dbconfig/20240522-160821-arnaudb.json [16:08:44] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: apply [16:08:49] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [16:08:55] 10SRE-tools, 06Infrastructure-Foundations, 10Spicerack: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#9822235 (10joanna_borun) [16:08:56] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [16:09:16] (03PS2) 10Btullis: [WIP] Add radosgw services to the cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/1034973 (https://phabricator.wikimedia.org/T330152) [16:10:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1192 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P62920 and previous config saved to /var/cache/conftool/dbconfig/20240522-161039-root.json [16:11:42] (03CR) 10CI reject: [V:04-1] [WIP] Add radosgw services to the cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/1034973 (https://phabricator.wikimedia.org/T330152) (owner: 10Btullis) [16:13:33] !log Running `mwscript extensions/WikiLambda/maintenance/migrateZ16K1StringsToZ61s.php --wiki=wikifunctionswiki --implement` on mwmaint1002 for T287153 [16:13:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:38] T287153: Switch WikiLambda front-end from hard-coded strings to using Z61/Programming language objects - https://phabricator.wikimedia.org/T287153 [16:15:02] 06SRE, 10DNS, 06Traffic, 10WikiLearn, 13Patch-For-Review: DNS records for WikiLearn - https://phabricator.wikimedia.org/T365435#9822258 (10ssingh) If you want to send from the wikimedia.org domain, then yes, that's what you will need. There is no concern with the records themselves. [16:15:50] (03CR) 10Ssingh: "Per Asaf's comments, it seems clear that the DKIM records should go to wikimedia.org since that's the domain they want to send the email f" [dns] - 10https://gerrit.wikimedia.org/r/1034565 (https://phabricator.wikimedia.org/T365435) (owner: 10Dzahn) [16:17:03] (03CR) 10Dzahn: "I'm confused now. Didn't you ask me if it should be learn.wiki and not wikimedia.org and then he said "agreed"" [dns] - 10https://gerrit.wikimedia.org/r/1034565 (https://phabricator.wikimedia.org/T365435) (owner: 10Dzahn) [16:17:52] (03CR) 10Ssingh: "Yes but he updated his comment https://phabricator.wikimedia.org/T365435#9822154 that it should not be learn.wiki :)" [dns] - 10https://gerrit.wikimedia.org/r/1034565 (https://phabricator.wikimedia.org/T365435) (owner: 10Dzahn) [16:18:01] (03CR) 10Dzahn: "If it's really for wikimedia.org then I would expect that to be an issue for domain reputation?" [dns] - 10https://gerrit.wikimedia.org/r/1034565 (https://phabricator.wikimedia.org/T365435) (owner: 10Dzahn) [16:19:11] !log kamila@deploy1002 helmfile [codfw] START helmfile.d/services/recommendation-api: apply [16:19:13] 06SRE, 10DNS, 06Traffic, 10WikiLearn, 13Patch-For-Review: DNS records for WikiLearn - https://phabricator.wikimedia.org/T365435#9822279 (10Asaf) In that case, I apologize for the confusion I created. We do want to send from comdevteam@wikimedia.org, so please amend the patch to the original CNAMEs reques... [16:19:31] (03Abandoned) 10Dzahn: add additional Amazon domainkey values to learn.wiki domain [dns] - 10https://gerrit.wikimedia.org/r/1034565 (https://phabricator.wikimedia.org/T365435) (owner: 10Dzahn) [16:19:49] !log kamila@deploy1002 helmfile [codfw] DONE helmfile.d/services/recommendation-api: apply [16:20:12] 10SRE-tools, 06DC-Ops, 06Infrastructure-Foundations, 10Spicerack: Spicerack: expand Supermicro support in the Redfish module - https://phabricator.wikimedia.org/T365372#9822300 (10Volans) [16:20:22] 06SRE, 10DNS, 06Traffic, 10WikiLearn, 13Patch-For-Review: DNS records for WikiLearn - https://phabricator.wikimedia.org/T365435#9822302 (10Dzahn) 05In progress→03Open [16:23:28] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2130 (re)pooling @ 5%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62921 and previous config saved to /var/cache/conftool/dbconfig/20240522-162327-arnaudb.json [16:25:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1192 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P62922 and previous config saved to /var/cache/conftool/dbconfig/20240522-162546-root.json [16:28:05] (03PS3) 10Btullis: [WIP] Add radosgw services to the cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/1034973 (https://phabricator.wikimedia.org/T330152) [16:28:22] (03PS5) 10David Caro: openstack: use bobcat/supported os for all tests [puppet] - 10https://gerrit.wikimedia.org/r/1031957 [16:28:22] (03PS10) 10David Caro: openstack::bobcat: apply cloud yaml patch [puppet] - 10https://gerrit.wikimedia.org/r/1031953 [16:28:51] (03PS1) 10Andrea Denisse: grafana: Replace Fixnum with Integer in ldap.toml.erb template [puppet] - 10https://gerrit.wikimedia.org/r/1034981 (https://phabricator.wikimedia.org/T358506) [16:28:51] (03CR) 10Andrea Denisse: "I found this issue while testing a patch #1034626 with PCC." [puppet] - 10https://gerrit.wikimedia.org/r/1034981 (https://phabricator.wikimedia.org/T358506) (owner: 10Andrea Denisse) [16:29:23] (03CR) 10David Caro: [C:03+2] openstack: use bobcat/supported os for all tests [puppet] - 10https://gerrit.wikimedia.org/r/1031957 (owner: 10David Caro) [16:29:37] (03CR) 10David Caro: openstack::bobcat: apply cloud yaml patch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1031953 (owner: 10David Caro) [16:29:47] (03CR) 10David Caro: [C:03+2] openstack::bobcat: apply cloud yaml patch [puppet] - 10https://gerrit.wikimedia.org/r/1031953 (owner: 10David Caro) [16:29:58] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (DIFF 2 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1034973 (https://phabricator.wikimedia.org/T330152) (owner: 10Btullis) [16:30:42] (03CR) 10CI reject: [V:04-1] [WIP] Add radosgw services to the cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/1034973 (https://phabricator.wikimedia.org/T330152) (owner: 10Btullis) [16:32:25] (03CR) 10Dr0ptp4kt: [C:03+1] "Generally, looks okay to me - discussion today suggested that the main thing to run it against a safe host to make sure it works." [cookbooks] - 10https://gerrit.wikimedia.org/r/1032544 (owner: 10DCausse) [16:33:00] (03PS1) 10DCausse: extension registration: Fix handling of null default values [core] (wmf/1.43.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1034989 (https://phabricator.wikimedia.org/T365190) [16:33:00] (03CR) 10Cwhite: grafana: Replace Fixnum with Integer in ldap.toml.erb template (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1034981 (https://phabricator.wikimedia.org/T358506) (owner: 10Andrea Denisse) [16:33:33] (03PS3) 10Stevemunene: dns: provision datahub-next subdomain [dns] - 10https://gerrit.wikimedia.org/r/1034887 (https://phabricator.wikimedia.org/T365576) [16:33:45] (03PS4) 10Btullis: [WIP] Add radosgw services to the cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/1034973 (https://phabricator.wikimedia.org/T330152) [16:34:22] (03CR) 10Brouberol: [C:03+1] "LGTM" [dns] - 10https://gerrit.wikimedia.org/r/1034887 (https://phabricator.wikimedia.org/T365576) (owner: 10Stevemunene) [16:34:41] (03PS2) 10Andrea Denisse: grafana: Replace Fixnum with Integer in ldap.toml.erb template [puppet] - 10https://gerrit.wikimedia.org/r/1034981 (https://phabricator.wikimedia.org/T358506) [16:35:09] (03CR) 10Andrea Denisse: grafana: Replace Fixnum with Integer in ldap.toml.erb template (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1034981 (https://phabricator.wikimedia.org/T358506) (owner: 10Andrea Denisse) [16:35:51] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (DIFF 2 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1034973 (https://phabricator.wikimedia.org/T330152) (owner: 10Btullis) [16:35:54] (03CR) 10Andrea Denisse: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2593/console" [puppet] - 10https://gerrit.wikimedia.org/r/1034981 (https://phabricator.wikimedia.org/T358506) (owner: 10Andrea Denisse) [16:36:18] (03CR) 10CI reject: [V:04-1] [WIP] Add radosgw services to the cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/1034973 (https://phabricator.wikimedia.org/T330152) (owner: 10Btullis) [16:36:33] (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/output/1034981/2593/" [puppet] - 10https://gerrit.wikimedia.org/r/1034981 (https://phabricator.wikimedia.org/T358506) (owner: 10Andrea Denisse) [16:36:46] FIRING: SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [16:38:34] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2130 (re)pooling @ 10%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62923 and previous config saved to /var/cache/conftool/dbconfig/20240522-163834-arnaudb.json [16:40:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1192 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P62924 and previous config saved to /var/cache/conftool/dbconfig/20240522-164052-root.json [16:43:22] (03PS2) 10Cathal Mooney: Change EVPN BGP YAML to group into clusters and add codfw switches [homer/public] - 10https://gerrit.wikimedia.org/r/1034889 (https://phabricator.wikimedia.org/T365169) [16:44:14] (03CR) 10Andrea Denisse: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2594/console" [puppet] - 10https://gerrit.wikimedia.org/r/1034981 (https://phabricator.wikimedia.org/T358506) (owner: 10Andrea Denisse) [16:45:01] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 34 probes of 800 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:46:32] (03CR) 10Cwhite: [C:03+1] grafana: Replace Fixnum with Integer in ldap.toml.erb template [puppet] - 10https://gerrit.wikimedia.org/r/1034981 (https://phabricator.wikimedia.org/T358506) (owner: 10Andrea Denisse) [16:48:19] (03PS3) 10Cathal Mooney: Change EVPN BGP YAML to group into clusters and add codfw switches [homer/public] - 10https://gerrit.wikimedia.org/r/1034889 (https://phabricator.wikimedia.org/T365169) [16:48:31] (03CR) 10David Caro: [C:03+1] "Might need a rebase, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/890001 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [16:49:29] 06SRE, 10SRE-Access-Requests: Requesting access to analytics for rickijay - https://phabricator.wikimedia.org/T365574#9822552 (10Dzahn) Hi! If possible could you please specifiy which of the following you are requesting? a) analytics-privatedata-users (no kerberos, no ssh) b) analytics-privatedata-users (no... [16:52:01] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 45 probes of 800 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [16:53:42] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2130 (re)pooling @ 25%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62925 and previous config saved to /var/cache/conftool/dbconfig/20240522-165340-arnaudb.json [16:55:12] (03CR) 10Andrea Denisse: [V:03+1 C:03+2] grafana: Replace Fixnum with Integer in ldap.toml.erb template [puppet] - 10https://gerrit.wikimedia.org/r/1034981 (https://phabricator.wikimedia.org/T358506) (owner: 10Andrea Denisse) [16:55:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1192 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P62926 and previous config saved to /var/cache/conftool/dbconfig/20240522-165558-root.json [16:56:01] (03PS1) 10Elukey: cache: fix and improve the code in the s3 module that allows a proxy [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/1035006 (https://phabricator.wikimedia.org/T344324) [17:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240522T1700) [17:02:01] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 33 probes of 800 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:07:21] !log cmooney@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2002.wikimedia.org with OS bookworm [17:07:33] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Problem re-imaging hosts on row-wide vlan on EVPN switches - https://phabricator.wikimedia.org/T365204#9822720 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1002 for host sretest2002.wikimedi... [17:08:48] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2130 (re)pooling @ 50%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62928 and previous config saved to /var/cache/conftool/dbconfig/20240522-170848-arnaudb.json [17:09:07] (03PS1) 10Hnowlan: kask: checksum tls certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035010 (https://phabricator.wikimedia.org/T363996) [17:09:47] (03CR) 10CI reject: [V:04-1] kask: checksum tls certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035010 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan) [17:10:39] FIRING: [3x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [17:12:18] (03PS1) 10SBassett: Add wikimedia.org as a connect-src withing doc.wikimedia.org's CSP [puppet] - 10https://gerrit.wikimedia.org/r/1034928 (https://phabricator.wikimedia.org/T365097) [17:12:21] (03PS5) 10Btullis: [WIP] Add radosgw services to the cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/1034973 (https://phabricator.wikimedia.org/T330152) [17:12:45] (03PS2) 10Hnowlan: kask: checksum tls certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035010 (https://phabricator.wikimedia.org/T363996) [17:13:31] (03CR) 10CI reject: [V:04-1] kask: checksum tls certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035010 (https://phabricator.wikimedia.org/T363996) (owner: 10Hnowlan) [17:14:02] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (DIFF 2 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1034973 (https://phabricator.wikimedia.org/T330152) (owner: 10Btullis) [17:15:20] (03CR) 10CI reject: [V:04-1] [WIP] Add radosgw services to the cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/1034973 (https://phabricator.wikimedia.org/T330152) (owner: 10Btullis) [17:15:39] FIRING: [3x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [17:16:28] (03PS6) 10Btullis: [WIP] Add radosgw services to the cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/1034973 (https://phabricator.wikimedia.org/T330152) [17:17:51] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (DIFF 2 CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1034973 (https://phabricator.wikimedia.org/T330152) (owner: 10Btullis) [17:18:53] (03CR) 10CI reject: [V:04-1] [WIP] Add radosgw services to the cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/1034973 (https://phabricator.wikimedia.org/T330152) (owner: 10Btullis) [17:21:37] (03PS7) 10Btullis: [WIP] Add radosgw services to the cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/1034973 (https://phabricator.wikimedia.org/T330152) [17:22:16] (03PS1) 10CDobbins: varnish: add better error page when HTTP status code 429 is returned [puppet] - 10https://gerrit.wikimedia.org/r/1035011 (https://phabricator.wikimedia.org/T354718) [17:22:57] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (DIFF 3 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1034973 (https://phabricator.wikimedia.org/T330152) (owner: 10Btullis) [17:23:55] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2130 (re)pooling @ 75%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62929 and previous config saved to /var/cache/conftool/dbconfig/20240522-172354-arnaudb.json [17:24:00] (03PS3) 10Hnowlan: kask: checksum tls certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035010 (https://phabricator.wikimedia.org/T363996) [17:24:07] !log Setting DHCP in codfw row A to 'forward-only' mode to troubleshoot DHCP bug T365204 [17:24:10] 06SRE, 10DNS, 06Traffic, 10WikiLearn: DNS records for WikiLearn - https://phabricator.wikimedia.org/T365435#9822762 (10ssingh) Hi @jhathaway: I wanted to get your input about this. The request here is to add a DKIM record for wikimedia.org so that learn.wiki can allow sending email from comdevteam@wikimedi... [17:24:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:24:11] T365204: Problem re-imaging hosts on row-wide vlan on EVPN switches - https://phabricator.wikimedia.org/T365204 [17:31:23] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 38 probes of 800 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:35:46] (03Abandoned) 10Bking: data-engineering: add scaffolding for airflow service [deployment-charts] - 10https://gerrit.wikimedia.org/r/1034951 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [17:36:24] FIRING: [3x] HelmReleaseBadStatus: Helm release device-analytics/main on k8s@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [17:37:07] (03PS1) 10Bking: dse-k8s: Add net-new service scaffolding for airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035013 (https://phabricator.wikimedia.org/T363001) [17:37:48] (03CR) 10CI reject: [V:04-1] dse-k8s: Add net-new service scaffolding for airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035013 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [17:39:01] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2130 (re)pooling @ 100%: post reimage repool', diff saved to https://phabricator.wikimedia.org/P62930 and previous config saved to /var/cache/conftool/dbconfig/20240522-173900-arnaudb.json [17:41:05] (03PS2) 10Bking: dse-k8s: Add net-new service scaffolding for airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035013 (https://phabricator.wikimedia.org/T363001) [17:41:49] (03CR) 10CI reject: [V:04-1] dse-k8s: Add net-new service scaffolding for airflow [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035013 (https://phabricator.wikimedia.org/T363001) (owner: 10Bking) [17:45:13] FIRING: [2x] CertAlmostExpired: Certificate for service sessionstore:8081 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#sessionstore:8081 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [17:48:18] (03PS1) 10Bking: dse-k8s: add airflow namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035015 (https://phabricator.wikimedia.org/T363001) [17:49:25] (03PS2) 10Bking: dse-k8s: add airflow namespace [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035015 (https://phabricator.wikimedia.org/T363001) [17:51:03] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 35 probes of 800 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [17:51:17] (03CR) 10Scott French: [C:03+2] aqs-http-gateway: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031497 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [17:53:30] (03Merged) 10jenkins-bot: aqs-http-gateway: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1031497 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [17:55:49] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2002.wikimedia.org with OS bookworm [17:56:03] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Problem re-imaging hosts on row-wide vlan on EVPN switches - https://phabricator.wikimedia.org/T365204#9822855 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmooney@cumin1002 for host sretest2002.wikimedia.or... [17:57:44] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops-radar: ManagementSSHDown - parse1002 - https://phabricator.wikimedia.org/T365310#9822859 (10VRiley-WMF) 05Open→03In progress a:03VRiley-WMF [17:58:12] !log swfrench@deploy1002 helmfile [staging] START helmfile.d/services/device-analytics: apply [17:58:35] !log swfrench@deploy1002 helmfile [staging] DONE helmfile.d/services/device-analytics: apply [17:59:05] (03PS8) 10Btullis: [WIP] Add radosgw services to the cephosd servers [puppet] - 10https://gerrit.wikimedia.org/r/1034973 (https://phabricator.wikimedia.org/T330152) [18:00:26] (03CR) 10Btullis: [V:03+1] "PCC SUCCESS (DIFF 3 CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1034973 (https://phabricator.wikimedia.org/T330152) (owner: 10Btullis) [18:02:25] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops-radar: ManagementSSHDown - parse1002 - https://phabricator.wikimedia.org/T365310#9822887 (10VRiley-WMF) →14Duplicate dup:03T363086 [18:03:05] 10ops-eqiad, 06SRE, 06DC-Ops: ManagementSSHDown parse1002.eqiad.wmnet - https://phabricator.wikimedia.org/T363086#9822889 (10VRiley-WMF) [18:05:18] !log cmooney@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2002.wikimedia.org with OS bookworm [18:05:32] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Problem re-imaging hosts on row-wide vlan on EVPN switches - https://phabricator.wikimedia.org/T365204#9822898 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmooney@cumin1002 for host sretest2002.wikimedi... [18:06:15] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Problem re-imaging hosts on row-wide vlan on EVPN switches - https://phabricator.wikimedia.org/T365204#9822895 (10cmooney) Ok seems like we have a solution. I added the "forward-only" statement to the EVPN switches in codfw row A: `... [18:09:31] FIRING: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:10:37] yeah that happened again [18:14:31] RESOLVED: [2x] ProbeDown: Service gerrit1003:443 has failed probes (http_gerrit_tls_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#gerrit1003:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:16:55] !log cmooney@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=93) for host sretest2002.wikimedia.org with OS bookworm [18:18:14] !log cmooney@cumin1002 START - Cookbook sre.hosts.decommission for hosts sretest2002.wikimedia.org [18:22:29] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 121, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:23:21] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:29:19] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [18:29:44] (03PS1) 10Scott French: Revert "aqs-http-gateway: add securityContext to all containers" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035017 (https://phabricator.wikimedia.org/T362978) [18:30:57] (03PS1) 10Jdlrobson: Small font size is not applying to excluded pages [skins/Vector] (wmf/1.43.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1034992 (https://phabricator.wikimedia.org/T364887) [18:31:15] FIRING: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 0.3106% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:32:31] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: sretest2002.wikimedia.org decommissioned, removing all IPs except the asset tag one - cmooney@cumin1002" [18:32:54] (03PS1) 10Jdlrobson: Revert "Add exclusion behaviour for "width" option in Appearance menu" [skins/Vector] (wmf/1.43.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1034993 (https://phabricator.wikimedia.org/T364015) [18:32:55] (03CR) 10Scott French: "+cc @btullis@wikimedia.org and @jmeybohm@wikimedia.org FYI" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035017 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [18:33:21] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: sretest2002.wikimedia.org decommissioned, removing all IPs except the asset tag one - cmooney@cumin1002" [18:33:21] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [18:33:22] !log cmooney@cumin1002 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts sretest2002.wikimedia.org [18:33:33] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Problem re-imaging hosts on row-wide vlan on EVPN switches - https://phabricator.wikimedia.org/T365204#9822951 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by cmooney@cumin1002 for hosts: `sretest2002.wikimedia.or... [18:33:35] (03CR) 10Scott French: [C:03+2] Revert "aqs-http-gateway: add securityContext to all containers" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035017 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [18:33:38] (03PS1) 10Brennen Bearnes: WIP: fix or suppress various shellcheck warnings [puppet] - 10https://gerrit.wikimedia.org/r/1035018 (https://phabricator.wikimedia.org/T364083) [18:33:57] FIRING: ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-ext:4447 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:34:31] PROBLEM - Host mwlog1002 is DOWN: PING CRITICAL - Packet loss = 100% [18:35:12] (03Merged) 10jenkins-bot: Revert "aqs-http-gateway: add securityContext to all containers" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035017 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [18:35:15] FIRING: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 3.506s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [18:37:27] !incidents [18:37:28] 4697 (ACKED) ProbeDown sre (10.2.2.76 ip4 mw-api-ext:4447 probes/service http_mw-api-ext_ip4 eqiad) [18:37:28] 4691 (RESOLVED) ProbeDown sre (2620:0:861:101:10:64:0:117 ip6 kubemaster1001:6443 probes/custom http_eqiad_kube_apiserver_ip6 eqiad) [18:38:57] RESOLVED: ProbeDown: Service mw-api-ext:4447 has failed probes (http_mw-api-ext_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#mw-api-ext:4447 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:40:35] mc1050? https://grafana.wikimedia.org/goto/uaqC1-EIg?orgId=1 [18:41:08] hm I can't restore that link sukhe [18:41:15] RESOLVED: PHPFPMTooBusy: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at eqiad: 10.47% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [18:41:35] yeah weird sorry let me try again [18:41:56] https://grafana.wikimedia.org/d/000000316/memcache?from=1716402244172&orgId=1&to=1716402795716&viewPanel=58 [18:42:42] (03PS1) 10Majavah: Revert "openstack::bobcat: apply cloud yaml patch" [puppet] - 10https://gerrit.wikimedia.org/r/1034994 [18:42:45] (03PS1) 10Majavah: Revert "openstack: use bobcat/supported os for all tests" [puppet] - 10https://gerrit.wikimedia.org/r/1034995 [18:43:01] (03Abandoned) 10Majavah: Revert "openstack: use bobcat/supported os for all tests" [puppet] - 10https://gerrit.wikimedia.org/r/1034995 (owner: 10Majavah) [18:43:24] 10ops-codfw, 06SRE, 06DC-Ops, 06Infrastructure-Foundations, and 2 others: Problem re-imaging hosts on row-wide vlan on EVPN switches - https://phabricator.wikimedia.org/T365204#9822970 (10cmooney) @Papaul @Jhancock.wm I'm done with sretest2002 now and ran the decom cookbook so feel free to put it back in s... [18:43:35] sukhe: I see what you mean though, https://grafana-rw.wikimedia.org/d/000000316/memcache?orgId=1&from=1716400837691&to=1716403362127&var-datasource=eqiad%20prometheus%2Fops&var-cluster=memcached&var-instance=All&viewPanel=58 [18:43:46] yep that [18:43:47] in the past when we have seen this, this is the symptom and not the cause [18:43:53] (03CR) 10Majavah: [C:03+2] Revert "openstack::bobcat: apply cloud yaml patch" [puppet] - 10https://gerrit.wikimedia.org/r/1034994 (owner: 10Majavah) [18:43:56] (03PS1) 10Cathal Mooney: Set DHCP relay for EVPN switches in codfw to 'forward-only' mode [homer/public] - 10https://gerrit.wikimedia.org/r/1035019 (https://phabricator.wikimedia.org/T365204) [18:44:24] https://grafana.wikimedia.org/d/U7JT--knk/mediawiki-on-k8s?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&viewPanel=28&from=1716401582685&to=1716403436281 [18:44:41] https://grafana.wikimedia.org/d/U7JT--knk/mediawiki-on-k8s?orgId=1&var-dc=eqiad%20prometheus%2Fk8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All&from=1716401778603&to=1716403467832&viewPanel=27 [18:45:00] (03PS2) 10Cathal Mooney: Set DHCP relay for EVPN switches in codfw to 'forward-only' mode [homer/public] - 10https://gerrit.wikimedia.org/r/1035019 (https://phabricator.wikimedia.org/T365204) [18:45:15] :D [18:45:36] in the past this has been due to something like an edit made to a very widely-used template [18:45:49] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:45:52] and lots of pages being reparsed by jobqueue? or something [18:45:58] (pages == wiki pages) [18:46:25] and the one random memcache machine with lots of tx traffic, is the one with the key containing the wikitext of the template in question [18:46:25] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [18:46:54] so far what I'm seeing is consistent with an event like that [18:46:57] https://logstash.wikimedia.org/goto/476035c4221b5acddd325365754d770a [18:47:11] the decreasing `limit` line on that graph of CPU time has me a bit concerned we had a bunch of pods get killed because of the traffic [18:47:23] another thing to look at is parsercache db traffic during this time interval [18:48:05] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 98 probes of 800 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [18:49:43] certainly this seems consistent with pod restarts https://grafana.wikimedia.org/d/000000472/kubernetes-kubelets?orgId=1&from=1716402103969&to=1716403668286 [18:50:00] (03PS1) 10Scott French: aqs-http-gateway: no-op chart version bump after revert [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035020 (https://phabricator.wikimedia.org/T362978) [18:50:21] timing matches yep [18:50:48] (03Abandoned) 10Cathal Mooney: Drop NAK outbound from IRB interface with EVPN Anycast IRB [homer/public] - 10https://gerrit.wikimedia.org/r/1032791 (https://phabricator.wikimedia.org/T365204) (owner: 10Cathal Mooney) [18:51:56] (03PS2) 10Jdlrobson: Small font size is not applying to excluded pages [skins/Vector] (wmf/1.43.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1034992 (https://phabricator.wikimedia.org/T364887) [18:52:44] (03CR) 10Scott French: "+cc @btullis@wikimedia.org @jmeybohm@wikimedia.org FYI" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035020 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [18:52:53] yeap https://grafana-rw.wikimedia.org/d/-OcleDKIz/oom-kill?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=kubernetes&from=1716400904221&to=1716403945253 [18:53:39] RECOVERY - Host mwlog1002 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [18:53:44] (03CR) 10Scott French: [C:03+2] aqs-http-gateway: no-op chart version bump after revert [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035020 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [18:55:07] hmmm mwlog being down at the same time [18:55:12] (03Merged) 10jenkins-bot: aqs-http-gateway: no-op chart version bump after revert [deployment-charts] - 10https://gerrit.wikimedia.org/r/1035020 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [18:56:33] https://grafana.wikimedia.org/goto/yXGuxaEIR?orgId=1 hehe [18:56:52] yeah, I think it got DDoSed by appservers [18:56:56] so it's clear why mwlog went down I guess [18:57:09] thcipriani: would deploying the automoderator extension to testwiki warrant a dedicated deployment window? Trying to figure out what my next step is. [18:57:40] cdanis: so the OOM kills are a result of what exactly? [18:59:22] sukhe: I am guessing from mediawiki using too much memory and getting its pod killed [19:00:15] RESOLVED: MediaWikiLatencyExceeded: p75 latency high: eqiad mw-api-ext (k8s) 1.03s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [19:00:24] the kubernetes-level events I see are pods being restarted, and then taking too long to come back up [19:05:55] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 2/2 UP : OSPFv3: 2/2 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:06:36] (03PS3) 10Jdlrobson: Drop unused config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031459 (https://phabricator.wikimedia.org/T301212) [19:08:05] PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 80, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:12:12] 06SRE, 10DNS, 06Traffic, 10WikiLearn: DNS records for WikiLearn - https://phabricator.wikimedia.org/T365435#9823026 (10jhathaway) @ssingh thanks for raising your concerns. I agree that our concerns are similar to those in T231387 and @mark's [recommendations](https://phabricator.wikimedia.org/T231387#54681... [19:13:50] (03CR) 10CI reject: [V:04-1] Small font size is not applying to excluded pages [skins/Vector] (wmf/1.43.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1034992 (https://phabricator.wikimedia.org/T364887) (owner: 10Jdlrobson) [19:16:46] FIRING: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [19:18:12] (03PS4) 10Mabualruz: deploy(Popups): Make use of conditional user defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034480 (https://phabricator.wikimedia.org/T364347) [19:18:39] (03PS5) 10Mabualruz: deploy(Popups): Make use of conditional user defaults [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034480 (https://phabricator.wikimedia.org/T364347) [19:18:54] (03CR) 10Mabualruz: deploy(Popups): Make use of conditional user defaults (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034480 (https://phabricator.wikimedia.org/T364347) (owner: 10Mabualruz) [19:25:39] FIRING: [3x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [19:26:13] RECOVERY - Router interfaces on cr2-eqsin is OK: OK: host 103.102.166.130, interfaces up: 82, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [19:27:09] (03PS27) 10Ottomata: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) [19:27:23] (03CR) 10Ottomata: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [19:27:51] (03CR) 10CI reject: [V:04-1] Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [19:27:59] PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 2/3 UP : OSPFv3: 2/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [19:43:38] (03CR) 10Kosta Harlan: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [19:48:46] (03CR) 10RLazarus: "Adding Daniel for Collab Services SRE, who (I think) own doc.wm.o -- I'm happy to help out, just don't want to step on toes. :)" [puppet] - 10https://gerrit.wikimedia.org/r/1034928 (https://phabricator.wikimedia.org/T365097) (owner: 10SBassett) [19:53:59] (03PS1) 10Fabfur: benthos:cache: drop part of haproxy internal messages [puppet] - 10https://gerrit.wikimedia.org/r/1035029 (https://phabricator.wikimedia.org/T359627) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC late backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240522T2000). [20:00:05] _Gerges and jan_drewniak: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [20:00:09] (03CR) 10Reedy: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [20:02:57] 06SRE, 10Language-Technical Support, 06serviceops, 10Wikimedia-Site-requests: Raise limit of $wgMaxArticleSize for Hebrew Wikisource - https://phabricator.wikimedia.org/T275319#9823208 (10Roxette5) @cscott - many thanks to you for your help - both in regard to this thread and the work you do for wikimedia... [20:04:34] _Gerges, I see you're deploying a config patch, I can deploy that for you [20:06:14] so ping me if you're around, otherwise I'll start on my backports [20:07:26] (03CR) 10Jdrewniak: "recheck" [skins/Vector] (wmf/1.43.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1034992 (https://phabricator.wikimedia.org/T364887) (owner: 10Jdlrobson) [20:08:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy1002 using scap backport" [skins/Vector] (wmf/1.43.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1034993 (https://phabricator.wikimedia.org/T364015) (owner: 10Jdlrobson) [20:08:36] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy1002 using scap backport" [skins/Vector] (wmf/1.43.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1034992 (https://phabricator.wikimedia.org/T364887) (owner: 10Jdlrobson) [20:10:16] (03PS28) 10Ottomata: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) [20:10:22] (03CR) 10Ottomata: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [20:10:55] (03CR) 10CI reject: [V:04-1] Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [20:12:41] (03PS29) 10Ottomata: Create eventlogging-processor legacy converter to proxy to eventgate for mediawiki.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/985023 (https://phabricator.wikimedia.org/T353817) [20:14:05] (03PS1) 10Cwhite: logstash: drop executing cql messages from image-suggestion [puppet] - 10https://gerrit.wikimedia.org/r/1034929 (https://phabricator.wikimedia.org/T365643) [20:15:12] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy1002 using scap backport" [skins/Vector] (wmf/1.43.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1034993 (https://phabricator.wikimedia.org/T364015) (owner: 10Jdlrobson) [20:15:13] (03CR) 10TrainBranchBot: [C:03+2] "Approved by jdrewniak@deploy1002 using scap backport" [skins/Vector] (wmf/1.43.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1034992 (https://phabricator.wikimedia.org/T364887) (owner: 10Jdlrobson) [20:17:44] (03CR) 10Cwhite: [C:03+2] logstash: drop executing cql messages from image-suggestion [puppet] - 10https://gerrit.wikimedia.org/r/1034929 (https://phabricator.wikimedia.org/T365643) (owner: 10Cwhite) [20:22:31] (03CR) 10Stoyofuku-wmf: [C:03+1] "Looks good, thanks for sticking with this!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1034480 (https://phabricator.wikimedia.org/T364347) (owner: 10Mabualruz) [20:22:49] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:23:01] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 32 probes of 800 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [20:23:19] RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 3/3 UP : OSPFv3: 3/3 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [20:28:56] RESOLVED: SystemdUnitFailed: update-ubuntu-mirror.service on mirror1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:32:44] (03Merged) 10jenkins-bot: Revert "Add exclusion behaviour for "width" option in Appearance menu" [skins/Vector] (wmf/1.43.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1034993 (https://phabricator.wikimedia.org/T364015) (owner: 10Jdlrobson) [20:32:45] (03PS1) 10JHathaway: jhathaway: update dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/1035040 [20:32:56] (03Merged) 10jenkins-bot: Small font size is not applying to excluded pages [skins/Vector] (wmf/1.43.0-wmf.6) - 10https://gerrit.wikimedia.org/r/1034992 (https://phabricator.wikimedia.org/T364887) (owner: 10Jdlrobson) [20:33:29] !log jdrewniak@deploy1002 Started scap: Backport for [[gerrit:1034993|Revert "Add exclusion behaviour for "width" option in Appearance menu" (T364015)]], [[gerrit:1034992|Small font size is not applying to excluded pages (T364887 T365408)]] [20:33:38] T364015: Exception handling for appearance settings (width) - Vector - https://phabricator.wikimedia.org/T364015 [20:33:39] T364887: It should be possible to set a different default font size on different pages for Vector 2022 - https://phabricator.wikimedia.org/T364887 [20:33:39] T365408: [Subtask[Regression] Vector 2022 uses wrong font size on talk pages - https://phabricator.wikimedia.org/T365408 [20:34:12] (03CR) 10JHathaway: [C:03+2] jhathaway: update dotfiles [puppet] - 10https://gerrit.wikimedia.org/r/1035040 (owner: 10JHathaway) [20:34:57] 10ops-eqiad, 06SRE, 06DC-Ops, 06serviceops: Q4:rack/setup/install kafka-main100[6789] and kafka-main1010 - https://phabricator.wikimedia.org/T363212#9823336 (10VRiley-WMF) We have an open ticket with Dell for kafka-main1009. We have been through a flea power drain, NVRAM clear, trying to enter the server i... [20:36:22] !log jdrewniak@deploy1002 jdrewniak and jdlrobson: Backport for [[gerrit:1034993|Revert "Add exclusion behaviour for "width" option in Appearance menu" (T364015)]], [[gerrit:1034992|Small font size is not applying to excluded pages (T364887 T365408)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [20:36:46] FIRING: SystemdUnitFailed: logrotate.service on moss-be1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [20:37:48] !log jdrewniak@deploy1002 jdrewniak and jdlrobson: Continuing with sync [20:39:36] !log STOPPED lucaswerkmeister-wmde@mwmaint1002:~$ time mwscript extensions/DiscussionTools/maintenance/persistRevisionThreadItems.php --wiki enwiki --current --all --start '["76318767"]' 2>&1 | tee -a ~/T315510-enwiki-5; date # ca. 1 hour and 20 minutes ago, after running for a bit over 6 days; some errors [20:39:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:54] 06SRE, 06collaboration-services, 06Infrastructure-Foundations, 10Mail: Postfix outbound rollout sequence, mx-out - https://phabricator.wikimedia.org/T365395#9823354 (10jhathaway) [20:50:16] !log jdrewniak@deploy1002 Finished scap: Backport for [[gerrit:1034993|Revert "Add exclusion behaviour for "width" option in Appearance menu" (T364015)]], [[gerrit:1034992|Small font size is not applying to excluded pages (T364887 T365408)]] (duration: 16m 46s) [20:50:23] T364015: Exception handling for appearance settings (width) - Vector - https://phabricator.wikimedia.org/T364015 [20:50:23] T364887: It should be possible to set a different default font size on different pages for Vector 2022 - https://phabricator.wikimedia.org/T364887 [20:50:23] T365408: [Subtask[Regression] Vector 2022 uses wrong font size on talk pages - https://phabricator.wikimedia.org/T365408 [20:55:39] FIRING: [3x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [21:00:02] (03PS2) 10RLazarus: deployment_server: Rework mwscript_k8s flags [puppet] - 10https://gerrit.wikimedia.org/r/1034632 (https://phabricator.wikimedia.org/T341553) [21:00:05] Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240522T2100) [21:02:49] (03CR) 10CI reject: [V:04-1] deployment_server: Rework mwscript_k8s flags [puppet] - 10https://gerrit.wikimedia.org/r/1034632 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [21:03:37] (03PS3) 10RLazarus: deployment_server: Rework mwscript_k8s flags [puppet] - 10https://gerrit.wikimedia.org/r/1034632 (https://phabricator.wikimedia.org/T341553) [21:07:12] (03PS1) 10JHathaway: postfix: pull certs directly from acme chief [puppet] - 10https://gerrit.wikimedia.org/r/1035046 (https://phabricator.wikimedia.org/T364589) [21:07:38] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1035046 (https://phabricator.wikimedia.org/T364589) (owner: 10JHathaway) [21:08:19] (03CR) 10RLazarus: [C:03+2] deployment_server: Rework mwscript_k8s flags (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1034632 (https://phabricator.wikimedia.org/T341553) (owner: 10RLazarus) [21:09:05] 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650 (10RobH) 03NEW [21:09:27] 10ops-eqiad, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti1039 to ganeti1052 - https://phabricator.wikimedia.org/T365650#9823428 (10RobH) [21:13:40] (03PS2) 10JHathaway: postfix: pull certs directly from acme chief [puppet] - 10https://gerrit.wikimedia.org/r/1035046 (https://phabricator.wikimedia.org/T364589) [21:15:27] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1035046 (https://phabricator.wikimedia.org/T364589) (owner: 10JHathaway) [21:19:24] (03CR) 10Dzahn: [C:03+2] Add wikimedia.org as a connect-src withing doc.wikimedia.org's CSP [puppet] - 10https://gerrit.wikimedia.org/r/1034928 (https://phabricator.wikimedia.org/T365097) (owner: 10SBassett) [21:24:14] (03CR) 10JHathaway: [C:03+2] postfix: pull certs directly from acme chief [puppet] - 10https://gerrit.wikimedia.org/r/1035046 (https://phabricator.wikimedia.org/T364589) (owner: 10JHathaway) [21:24:30] 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651 (10RobH) 03NEW [21:24:52] 10ops-codfw, 06DC-Ops, 06Infrastructure-Foundations: Q4:rack/setup/install ganeti2035 to ganeti2044 - https://phabricator.wikimedia.org/T365651#9823486 (10RobH) [21:30:39] FIRING: [3x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [21:34:45] PROBLEM - ElasticSearch health check for shards on 9200 on cloudelastic1005 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f81f189e280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wik [21:34:45] imedia.org/wiki/Search%23Administration [21:36:24] FIRING: [3x] HelmReleaseBadStatus: Helm release device-analytics/main on k8s@codfw in state failed - https://wikitech.wikimedia.org/wiki/Kubernetes/Deployments#Rolling_back_in_an_emergency - https://alerts.wikimedia.org/?q=alertname%3DHelmReleaseBadStatus [21:36:45] RECOVERY - ElasticSearch health check for shards on 9200 on cloudelastic1005 is OK: OK - elasticsearch status cloudelastic-chi-eqiad: cluster_name: cloudelastic-chi-eqiad, status: yellow, timed_out: False, number_of_nodes: 6, number_of_data_nodes: 6, active_primary_shards: 798, active_shards: 1429, relocating_shards: 0, initializing_shards: 26, unassigned_shards: 110, delayed_unassigned_shards: 0, number_of_pending_tasks: 7, number_of_in [21:36:45] etch: 0, task_max_waiting_in_queue_millis: 257, active_shards_percent_as_number: 91.30990415335464 https://wikitech.wikimedia.org/wiki/Search%23Administration [21:40:39] FIRING: [3x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1005-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [21:42:05] (03PS1) 10JHathaway: phabricator: Move outbound email to mx-out{1001,2001}.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1035050 (https://phabricator.wikimedia.org/T365395) [21:44:23] (03CR) 10JHathaway: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1035050 (https://phabricator.wikimedia.org/T365395) (owner: 10JHathaway) [21:45:13] FIRING: [2x] CertAlmostExpired: Certificate for service sessionstore:8081 is about to expire - https://wikitech.wikimedia.org/wiki/TLS/Runbook#sessionstore:8081 - TODO - https://alerts.wikimedia.org/?q=alertname%3DCertAlmostExpired [21:47:38] !log @deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [21:47:43] !log @deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:54:06] !log T363973 Finished manual rolling restart of hadoop masters `an-master100[3,4].eqiad.wmnet` [21:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:53] !log ryankemper@cumin2002 START - Cookbook sre.hadoop.roll-restart-masters restart masters for Hadoop test cluster: Restart of jvm daemons. [22:12:18] (03PS1) 10Ryan Kemper: ryankemper: add some bash config [puppet] - 10https://gerrit.wikimedia.org/r/1035052 [22:15:39] FIRING: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [22:23:01] PROBLEM - Postgres Replication Lag on puppetdb2003 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 29802536 and 2 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [22:24:01] RECOVERY - Postgres Replication Lag on puppetdb2003 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 44304 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [22:24:03] !log ryankemper@cumin2002 END (PASS) - Cookbook sre.hadoop.roll-restart-masters (exit_code=0) restart masters for Hadoop test cluster: Restart of jvm daemons. [22:25:39] RESOLVED: [2x] CirrusSearchHighOldGCFrequency: Elasticsearch instance cloudelastic1006-cloudelastic-chi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [23:03:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T364299)', diff saved to https://phabricator.wikimedia.org/P62932 and previous config saved to /var/cache/conftool/dbconfig/20240522-230350-marostegui.json [23:03:55] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [23:18:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P62933 and previous config saved to /var/cache/conftool/dbconfig/20240522-231858-marostegui.json [23:28:17] (03PS1) 10Ebernhardson: cirrus: Keep archive writes running through cirrus [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1035061 [23:34:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211', diff saved to https://phabricator.wikimedia.org/P62934 and previous config saved to /var/cache/conftool/dbconfig/20240522-233406-marostegui.json [23:38:07] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1034930 [23:38:07] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1034930 (owner: 10TrainBranchBot) [23:49:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1211 (T364299)', diff saved to https://phabricator.wikimedia.org/P62935 and previous config saved to /var/cache/conftool/dbconfig/20240522-234914-marostegui.json [23:49:16] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance [23:49:19] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [23:49:30] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1214.eqiad.wmnet with reason: Maintenance [23:49:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1214 (T364299)', diff saved to https://phabricator.wikimedia.org/P62936 and previous config saved to /var/cache/conftool/dbconfig/20240522-234937-marostegui.json [23:59:01] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1034930 (owner: 10TrainBranchBot)