[00:37:52] (03PS1) 10Eevans: sessionstore: updated list of Cassandra hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/994357 (https://phabricator.wikimedia.org/T353402) [00:38:28] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/994370 [00:38:34] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/994370 (owner: 10TrainBranchBot) [00:45:41] (ProbeDown) firing: (4) Service debmonitor2002:7443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [00:47:06] PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:01:15] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/994370 (owner: 10TrainBranchBot) [01:35:04] (03CR) 10Ssingh: [C: 03+1] "Thanks for the patch and the cleanup!" [dns] - 10https://gerrit.wikimedia.org/r/994333 (https://phabricator.wikimedia.org/T355833) (owner: 10JHathaway) [01:38:58] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: (2) The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [02:26:17] (03PS2) 10Andrea Denisse: grafana: Ensure user traffic goes to grafana2001 [puppet] - 10https://gerrit.wikimedia.org/r/992719 (https://phabricator.wikimedia.org/T352665) [02:39:25] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:05:19] !log fab@deploy2002 Started deploy [airflow-dags/research@6a97a34]: (no justification provided) [03:05:42] !log fab@deploy2002 Finished deploy [airflow-dags/research@6a97a34]: (no justification provided) (duration: 00m 23s) [03:09:25] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:22:34] PROBLEM - Host fasw-c-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [03:22:58] PROBLEM - Host asw2-a-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [03:23:12] PROBLEM - Host asw2-b-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [03:23:14] PROBLEM - Host ps1-e6-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [03:23:28] PROBLEM - Host ps1-e2-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [03:23:28] PROBLEM - Host ps1-e3-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [03:23:28] PROBLEM - Host ps1-f5-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [03:23:28] PROBLEM - Host ps1-e8-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [03:23:28] PROBLEM - Host ps1-f3-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [03:23:32] PROBLEM - Host ps1-f8-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [03:23:36] PROBLEM - Host asw2-d-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [03:23:42] PROBLEM - Host ps1-f4-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [03:23:48] PROBLEM - Host asw2-c-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [03:23:56] PROBLEM - Host ps1-e1-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [03:24:00] PROBLEM - Host ps1-f6-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [03:24:00] PROBLEM - Host ps1-f2-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [03:24:00] PROBLEM - Host ps1-e7-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [03:24:04] PROBLEM - Host ps1-f7-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [03:24:04] PROBLEM - Host ps1-e4-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [03:24:04] PROBLEM - Host ps1-e5-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [03:24:08] PROBLEM - Host ps1-f1-eqiad is DOWN: PING CRITICAL - Packet loss = 100% [03:25:18] PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:26:00] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:27:42] PROBLEM - Host mr1-eqiad.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [03:28:26] PROBLEM - Host mr1-eqiad IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [03:29:25] (JobUnavailable) firing: Reduced availability for job pdu_sentry4 in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:34:08] RECOVERY - Host ps1-f6-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.23 ms [03:34:08] RECOVERY - Host ps1-f1-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.30 ms [03:34:08] RECOVERY - Host ps1-f5-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.15 ms [03:34:08] RECOVERY - Host ps1-f3-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.22 ms [03:34:08] RECOVERY - Host ps1-f7-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.88 ms [03:34:08] RECOVERY - Host ps1-e5-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.67 ms [03:34:10] RECOVERY - Host ps1-e6-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.14 ms [03:34:10] RECOVERY - Host ps1-e8-eqiad is UP: PING OK - Packet loss = 0%, RTA = 4.59 ms [03:34:10] RECOVERY - Host ps1-f8-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.05 ms [03:34:10] RECOVERY - Host ps1-e2-eqiad is UP: PING OK - Packet loss = 0%, RTA = 3.17 ms [03:34:11] RECOVERY - Host ps1-f2-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.23 ms [03:34:11] RECOVERY - Host ps1-e4-eqiad is UP: PING OK - Packet loss = 0%, RTA = 3.74 ms [03:34:12] RECOVERY - Host ps1-e3-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.08 ms [03:34:12] RECOVERY - Host ps1-e7-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.20 ms [03:34:13] RECOVERY - Host ps1-f4-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.15 ms [03:34:13] RECOVERY - Host ps1-e1-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.30 ms [03:34:14] RECOVERY - Host asw2-a-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.15 ms [03:34:18] RECOVERY - Host asw2-c-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.80 ms [03:34:24] RECOVERY - Host asw2-d-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.75 ms [03:34:34] RECOVERY - Host asw2-b-eqiad is UP: PING OK - Packet loss = 0%, RTA = 1.04 ms [03:34:40] RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:34:58] RECOVERY - Host fasw-c-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.58 ms [03:35:18] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:38:58] RECOVERY - Host mr1-eqiad.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 0.69 ms [03:39:01] (JobUnavailable) resolved: Reduced availability for job pdu_sentry4 in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:39:42] RECOVERY - Host mr1-eqiad IPv6 is UP: PING OK - Packet loss = 0%, RTA = 1.34 ms [04:45:41] (ProbeDown) firing: (4) Service debmonitor2002:7443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [05:11:48] (03PS1) 10DLynch: decodeURI fragments before sending them to discussiontoolsfindcomment [extensions/DiscussionTools] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/994234 (https://phabricator.wikimedia.org/T356199) [05:12:12] (03PS1) 10DLynch: decodeURI fragments before sending them to discussiontoolsfindcomment [extensions/DiscussionTools] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/994235 (https://phabricator.wikimedia.org/T356199) [05:30:16] !log fab@deploy2002 Started deploy [airflow-dags/research@97c6a4e]: (no justification provided) [05:30:30] !log fab@deploy2002 Finished deploy [airflow-dags/research@97c6a4e]: (no justification provided) (duration: 00m 14s) [05:33:38] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350 (10Marostegui) >>! In T355350#9500458, @Papaul wrote: > @Marostegui if those hosts have a 10G NIC you don't have a problem for those going into row A and B to connect them to... [05:37:15] (03PS1) 10Marostegui: Revert "db1224: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/994236 [05:38:58] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: (2) The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [05:50:20] (03CR) 10Marostegui: [C: 03+2] Revert "db1224: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/994236 (owner: 10Marostegui) [05:50:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1224 (re)pooling @ 1%: After onsite maintenance', diff saved to https://phabricator.wikimedia.org/P55904 and previous config saved to /var/cache/conftool/dbconfig/20240131-055057-root.json [05:51:19] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: db1224 crashed - hardware error - https://phabricator.wikimedia.org/T354591 (10Marostegui) 05Open→03Resolved I have started to repool this host. Thanks for your help John! [05:53:32] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2097.codfw.wmnet with reason: Maintenance [05:53:46] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2097.codfw.wmnet with reason: Maintenance [06:03:06] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2107.codfw.wmnet with reason: Maintenance [06:03:31] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2107.codfw.wmnet with reason: Maintenance [06:03:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2107 (T355609)', diff saved to https://phabricator.wikimedia.org/P55905 and previous config saved to /var/cache/conftool/dbconfig/20240131-060337-marostegui.json [06:03:43] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [06:06:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1224 (re)pooling @ 5%: After onsite maintenance', diff saved to https://phabricator.wikimedia.org/P55906 and previous config saved to /var/cache/conftool/dbconfig/20240131-060602-root.json [06:13:40] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B3 from asw-b3-codfw to lsw1-b3-codfw - https://phabricator.wikimedia.org/T355870 (10Marostegui) Once T355862 is done, es2021 needs to be switched back to be es4 slave (reverting all this T356064) [06:13:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2107 (T355609)', diff saved to https://phabricator.wikimedia.org/P55907 and previous config saved to /var/cache/conftool/dbconfig/20240131-061340-marostegui.json [06:13:46] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [06:18:16] 10SRE, 10DBA, 10Infrastructure-Foundations, 10netops: Migrate servers in codfw rack B3 from asw-b3-codfw to lsw1-b3-codfw - https://phabricator.wikimedia.org/T355870 (10Marostegui) [06:19:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2114 T354506', diff saved to https://phabricator.wikimedia.org/P55908 and previous config saved to /var/cache/conftool/dbconfig/20240131-061932-root.json [06:19:38] T354506: Upgrade s6 hosts to Bookworm - https://phabricator.wikimedia.org/T354506 [06:20:53] (03PS1) 10Marostegui: db2114: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/994610 (https://phabricator.wikimedia.org/T354506) [06:21:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1224 (re)pooling @ 10%: After onsite maintenance', diff saved to https://phabricator.wikimedia.org/P55909 and previous config saved to /var/cache/conftool/dbconfig/20240131-062109-root.json [06:22:18] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2114.codfw.wmnet with OS bookworm [06:23:18] (03CR) 10Marostegui: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/994610 (https://phabricator.wikimedia.org/T354506) (owner: 10Marostegui) [06:25:07] (03CR) 10Marostegui: [C: 03+2] db2114: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/994610 (https://phabricator.wikimedia.org/T354506) (owner: 10Marostegui) [06:28:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2107', diff saved to https://phabricator.wikimedia.org/P55910 and previous config saved to /var/cache/conftool/dbconfig/20240131-062846-marostegui.json [06:29:14] (03PS1) 10Marostegui: db-production.php: Disable writes on es5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994611 (https://phabricator.wikimedia.org/T356235) [06:35:17] !log marostegui@cumin1002 START - Cookbook sre.hosts.reimage for host db2142.codfw.wmnet with OS bookworm [06:36:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1224 (re)pooling @ 25%: After onsite maintenance', diff saved to https://phabricator.wikimedia.org/P55911 and previous config saved to /var/cache/conftool/dbconfig/20240131-063613-root.json [06:36:21] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2114.codfw.wmnet with reason: host reimage [06:39:36] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2114.codfw.wmnet with reason: host reimage [06:43:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2107', diff saved to https://phabricator.wikimedia.org/P55912 and previous config saved to /var/cache/conftool/dbconfig/20240131-064353-marostegui.json [06:47:14] !log installing glibc security updates on bookworm [06:47:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:48:25] (03PS1) 10Marostegui: Revert "db2114: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/994237 [06:51:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1224 (re)pooling @ 50%: After onsite maintenance', diff saved to https://phabricator.wikimedia.org/P55913 and previous config saved to /var/cache/conftool/dbconfig/20240131-065118-root.json [06:53:16] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2142.codfw.wmnet with reason: host reimage [06:54:32] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2114.codfw.wmnet with OS bookworm [06:56:32] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2142.codfw.wmnet with reason: host reimage [06:59:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2107 (T355609)', diff saved to https://phabricator.wikimedia.org/P55914 and previous config saved to /var/cache/conftool/dbconfig/20240131-065901-marostegui.json [06:59:03] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2125.codfw.wmnet with reason: Maintenance [06:59:12] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [06:59:16] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2125.codfw.wmnet with reason: Maintenance [06:59:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2125 (T355609)', diff saved to https://phabricator.wikimedia.org/P55915 and previous config saved to /var/cache/conftool/dbconfig/20240131-065922-marostegui.json [07:00:05] Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240131T0700) [07:00:52] (03CR) 10Marostegui: [C: 03+2] Revert "db2114: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/994237 (owner: 10Marostegui) [07:01:11] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2114 (re)pooling @ 1%: After Bookworm upgrade T354506', diff saved to https://phabricator.wikimedia.org/P55916 and previous config saved to /var/cache/conftool/dbconfig/20240131-070111-root.json [07:01:17] T354506: Upgrade s6 hosts to Bookworm - https://phabricator.wikimedia.org/T354506 [07:06:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1224 (re)pooling @ 75%: After onsite maintenance', diff saved to https://phabricator.wikimedia.org/P55917 and previous config saved to /var/cache/conftool/dbconfig/20240131-070624-root.json [07:08:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host debmonitor2003.codfw.wmnet [07:10:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T355609)', diff saved to https://phabricator.wikimedia.org/P55918 and previous config saved to /var/cache/conftool/dbconfig/20240131-071002-marostegui.json [07:10:10] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [07:10:28] RECOVERY - Check systemd state on debmonitor2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:12:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2003.codfw.wmnet [07:12:50] (03CR) 10Ayounsi: [C: 03+2] DNS: add includes for private1-virtual-codfw DNS PTRs [dns] - 10https://gerrit.wikimedia.org/r/994246 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [07:13:00] (03PS2) 10Ayounsi: DNS: add includes for private1-virtual-codfw DNS PTRs [dns] - 10https://gerrit.wikimedia.org/r/994246 (https://phabricator.wikimedia.org/T300152) [07:16:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2114 (re)pooling @ 5%: After Bookworm upgrade T354506', diff saved to https://phabricator.wikimedia.org/P55919 and previous config saved to /var/cache/conftool/dbconfig/20240131-071616-root.json [07:16:22] T354506: Upgrade s6 hosts to Bookworm - https://phabricator.wikimedia.org/T354506 [07:16:33] (03CR) 10Ayounsi: "For what I saw, it *should* be enough, but of course we're not safe from some special cases. That said, for special cases, if they happen," [puppet] - 10https://gerrit.wikimedia.org/r/994223 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [07:16:53] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db2142.codfw.wmnet with OS bookworm [07:19:05] (03CR) 10Arnaudb: [C: 03+1] db-production.php: Disable writes on es5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994611 (https://phabricator.wikimedia.org/T356235) (owner: 10Marostegui) [07:21:29] !log marostegui@cumin1002 dbctl commit (dc=all): 'db1224 (re)pooling @ 100%: After onsite maintenance', diff saved to https://phabricator.wikimedia.org/P55920 and previous config saved to /var/cache/conftool/dbconfig/20240131-072129-root.json [07:21:41] (03CR) 10Arnaudb: [C: 03+1] "it makes the CI fail to remove an entry apparently" [puppet] - 10https://gerrit.wikimedia.org/r/994252 (https://phabricator.wikimedia.org/T356203) (owner: 10Dzahn) [07:24:19] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host debmonitor1003.eqiad.wmnet [07:25:09] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P55921 and previous config saved to /var/cache/conftool/dbconfig/20240131-072509-marostegui.json [07:30:05] RECOVERY - debmonitor.wikimedia.org:7443 CDN SSL Expiry on debmonitor1003 is OK: OK - Certificate debmonitor.discovery.wmnet will expire on Tue 27 Feb 2024 10:49:00 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Debmonitor [07:30:05] RECOVERY - debmonitor.wikimedia.org:7443 CDN on debmonitor1003 is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 552 bytes in 0.050 second response time https://wikitech.wikimedia.org/wiki/Debmonitor [07:30:05] RECOVERY - Check systemd state on debmonitor1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:30:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1003.eqiad.wmnet [07:31:22] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2114 (re)pooling @ 10%: After Bookworm upgrade T354506', diff saved to https://phabricator.wikimedia.org/P55922 and previous config saved to /var/cache/conftool/dbconfig/20240131-073121-root.json [07:31:28] T354506: Upgrade s6 hosts to Bookworm - https://phabricator.wikimedia.org/T354506 [07:33:34] (03PS1) 10Muehlenhoff: Switch ganeti/routed PoC servers to nftables [puppet] - 10https://gerrit.wikimedia.org/r/994661 (https://phabricator.wikimedia.org/T300152) [07:38:28] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [07:38:32] !log ayounsi@cumin1002 END (ERROR) - Cookbook sre.dns.netbox (exit_code=97) [07:39:03] (03CR) 10Ayounsi: [C: 03+1] Switch ganeti/routed PoC servers to nftables [puppet] - 10https://gerrit.wikimedia.org/r/994661 (https://phabricator.wikimedia.org/T300152) (owner: 10Muehlenhoff) [07:39:49] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [07:40:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P55923 and previous config saved to /var/cache/conftool/dbconfig/20240131-074015-marostegui.json [07:41:37] (03CR) 10Andrea Denisse: "Hi, I added failover process instructions in: https://phabricator.wikimedia.org/T352665#9501142" [puppet] - 10https://gerrit.wikimedia.org/r/992719 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse) [07:42:10] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: codfw routed cluster tap - ayounsi@cumin1002" [07:42:34] (03CR) 10Andrea Denisse: "Hi, I added failover process instructions in: https://phabricator.wikimedia.org/T352665#9501142" [puppet] - 10https://gerrit.wikimedia.org/r/992710 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse) [07:43:14] (03CR) 10Muehlenhoff: [C: 03+2] Switch ganeti/routed PoC servers to nftables [puppet] - 10https://gerrit.wikimedia.org/r/994661 (https://phabricator.wikimedia.org/T300152) (owner: 10Muehlenhoff) [07:43:18] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: codfw routed cluster tap - ayounsi@cumin1002" [07:43:18] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [07:45:30] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-rw1001.wikimedia.org [07:45:40] (03CR) 10Slyngshede: [C: 03+2] P:debmonitor::server rework debmonitor http monitoring. [puppet] - 10https://gerrit.wikimedia.org/r/988490 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [07:46:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2114 (re)pooling @ 25%: After Bookworm upgrade T354506', diff saved to https://phabricator.wikimedia.org/P55924 and previous config saved to /var/cache/conftool/dbconfig/20240131-074627-root.json [07:46:33] T354506: Upgrade s6 hosts to Bookworm - https://phabricator.wikimedia.org/T354506 [07:49:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw1001.wikimedia.org [07:50:47] PROBLEM - Check systemd state on ganeti2033 is CRITICAL: CRITICAL - degraded: The following units failed: nftables.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:54:53] (03CR) 10WMDE-Fisch: "Thanks, this was quite unfortunate." [extensions/Popups] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/994028 (https://phabricator.wikimedia.org/T355933) (owner: 10WMDE-Fisch) [07:54:56] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-rw2001.wikimedia.org [07:55:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T355609)', diff saved to https://phabricator.wikimedia.org/P55925 and previous config saved to /var/cache/conftool/dbconfig/20240131-075522-marostegui.json [07:55:25] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2126.codfw.wmnet with reason: Maintenance [07:55:28] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [07:55:39] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2126.codfw.wmnet with reason: Maintenance [07:55:40] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2187.codfw.wmnet with reason: Maintenance [07:55:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2187.codfw.wmnet with reason: Maintenance [07:56:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2126 (T355609)', diff saved to https://phabricator.wikimedia.org/P55926 and previous config saved to /var/cache/conftool/dbconfig/20240131-075600-marostegui.json [07:57:45] PROBLEM - Host sretest2005 is DOWN: PING CRITICAL - Packet loss = 100% [07:59:01] PROBLEM - Host ganeti2034 is DOWN: PING CRITICAL - Packet loss = 100% [07:59:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-rw2001.wikimedia.org [07:59:24] PROBLEM - Host ganeti2033 is DOWN: PING CRITICAL - Packet loss = 100% [08:00:07] Amir1 and Urbanecm: Time to snap out of that daydream and deploy UTC morning backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240131T0800). [08:00:07] No Gerrit patches in the queue for this window AFAICS. [08:01:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T355609)', diff saved to https://phabricator.wikimedia.org/P55927 and previous config saved to /var/cache/conftool/dbconfig/20240131-080117-marostegui.json [08:01:22] RECOVERY - Host ganeti2033 is UP: PING OK - Packet loss = 0%, RTA = 30.39 ms [08:01:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2114 (re)pooling @ 50%: After Bookworm upgrade T354506', diff saved to https://phabricator.wikimedia.org/P55928 and previous config saved to /var/cache/conftool/dbconfig/20240131-080132-root.json [08:01:34] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [08:01:40] RECOVERY - Host ganeti2034 is UP: PING OK - Packet loss = 0%, RTA = 30.84 ms [08:01:46] PROBLEM - Check unit status of netbox_ganeti_codfw02_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw02_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:01:50] T354506: Upgrade s6 hosts to Bookworm - https://phabricator.wikimedia.org/T354506 [08:02:51] (03PS2) 10Filippo Giunchedi: oauth2-proxy: run as nobody or explicit uid [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/994182 (https://phabricator.wikimedia.org/T320555) [08:04:40] PROBLEM - Check systemd state on ganeti2034 is CRITICAL: CRITICAL - degraded: The following units failed: nftables.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:05:17] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host apt1002.wikimedia.org [08:05:23] !log slyngshede@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM idm-test1001.wikimedia.org [08:06:12] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] oauth2-proxy: run as nobody or explicit uid [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/994182 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi) [08:06:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:06:38] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw02_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:09:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt1002.wikimedia.org [08:09:14] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idm-test1001.wikimedia.org [08:09:39] (03CR) 10Volans: [C: 03+1] "Ok, I guess it's ok for both opt82 and MAC matching in spicerack for physical and VMs respectively. I'm not sure this will have any effect" [puppet] - 10https://gerrit.wikimedia.org/r/994223 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [08:09:41] !log slyngshede@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM idm2001.wikimedia.org [08:09:51] !log installing ca-certificates-java bugfix updates from bookworm 12.4 point release [08:09:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:12:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:13:33] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.3/12.4 point update - https://phabricator.wikimedia.org/T353057 (10MoritzMuehlenhoff) [08:13:40] !log filippo@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [08:13:47] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idm2001.wikimedia.org [08:14:00] !log slyngshede@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM idm1001.wikimedia.org [08:14:42] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:16:04] PROBLEM - Check systemd state on ganeti2033 is CRITICAL: CRITICAL - degraded: The following units failed: nftables.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:16:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P55929 and previous config saved to /var/cache/conftool/dbconfig/20240131-081624-marostegui.json [08:16:30] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testvm2004.codfw.wmnet [08:16:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2114 (re)pooling @ 75%: After Bookworm upgrade T354506', diff saved to https://phabricator.wikimedia.org/P55930 and previous config saved to /var/cache/conftool/dbconfig/20240131-081637-root.json [08:16:43] T354506: Upgrade s6 hosts to Bookworm - https://phabricator.wikimedia.org/T354506 [08:17:32] !log filippo@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [08:18:04] !log slyngshede@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idm1001.wikimedia.org [08:20:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testvm2004.codfw.wmnet [08:20:41] (ProbeDown) firing: (8) Service debmonitor1002:443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:21:19] !log installing systemd bugfix updates from bookworm 12.4 point release [08:21:20] RECOVERY - Check systemd state on ganeti2033 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [08:22:18] RECOVERY - Check unit status of netbox_ganeti_codfw02_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_codfw02_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [08:23:29] (03PS1) 10Ayounsi: Enable forwarding more broadly and fix nftables bug [puppet] - 10https://gerrit.wikimedia.org/r/994663 (https://phabricator.wikimedia.org/T300152) [08:24:38] (03CR) 10CI reject: [V: 04-1] Enable forwarding more broadly and fix nftables bug [puppet] - 10https://gerrit.wikimedia.org/r/994663 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [08:25:46] (03PS1) 10Filippo Giunchedi: WIP add jaeger tbd to idp [puppet] - 10https://gerrit.wikimedia.org/r/994664 [08:25:48] (03PS1) 10Filippo Giunchedi: hieradata: use trace.w.o for oidc jaeger [puppet] - 10https://gerrit.wikimedia.org/r/994665 (https://phabricator.wikimedia.org/T320555) [08:26:47] (03PS2) 10Filippo Giunchedi: hieradata: add jaeger config for SSO oidc [puppet] - 10https://gerrit.wikimedia.org/r/994664 (https://phabricator.wikimedia.org/T320555) [08:26:53] (03PS2) 10Ayounsi: Enable forwarding more broadly and fix nftables bug [puppet] - 10https://gerrit.wikimedia.org/r/994663 (https://phabricator.wikimedia.org/T300152) [08:27:42] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/994663 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [08:27:47] !log installing systemd bugfix updates from bookworm 12.4 point release [08:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:04] (03CR) 10CI reject: [V: 04-1] Enable forwarding more broadly and fix nftables bug [puppet] - 10https://gerrit.wikimedia.org/r/994663 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [08:29:10] (03PS3) 10Ayounsi: Enable forwarding more broadly and fix nftables bug [puppet] - 10https://gerrit.wikimedia.org/r/994663 (https://phabricator.wikimedia.org/T300152) [08:29:15] (03Abandoned) 10Filippo Giunchedi: hieradata: use trace.w.o for oidc jaeger [puppet] - 10https://gerrit.wikimedia.org/r/994665 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi) [08:29:17] (03CR) 10Muehlenhoff: "nft syntax is fine, question inline for the other part." [puppet] - 10https://gerrit.wikimedia.org/r/994663 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [08:31:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P55931 and previous config saved to /var/cache/conftool/dbconfig/20240131-083130-marostegui.json [08:31:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2114 (re)pooling @ 100%: After Bookworm upgrade T354506', diff saved to https://phabricator.wikimedia.org/P55932 and previous config saved to /var/cache/conftool/dbconfig/20240131-083142-root.json [08:31:48] T354506: Upgrade s6 hosts to Bookworm - https://phabricator.wikimedia.org/T354506 [08:32:23] (03CR) 10Ayounsi: Enable forwarding more broadly and fix nftables bug (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/994663 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [08:33:13] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/994663 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [08:34:06] 10SRE, 10Infrastructure-Foundations: Integrate Bookworm 12.3/12.4 point update - https://phabricator.wikimedia.org/T353057 (10MoritzMuehlenhoff) [08:34:28] (03CR) 10Ayounsi: [C: 03+2] Enable forwarding more broadly and fix nftables bug [puppet] - 10https://gerrit.wikimedia.org/r/994663 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [08:36:33] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host crm2001.codfw.wmnet [08:36:46] RECOVERY - Host sretest2005 is UP: PING OK - Packet loss = 0%, RTA = 30.95 ms [08:40:19] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host grafana2001.codfw.wmnet [08:40:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host crm2001.codfw.wmnet [08:44:29] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host grafana2001.codfw.wmnet [08:44:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1002.eqiad.wmnet [08:45:02] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host titan2001.codfw.wmnet [08:46:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T355609)', diff saved to https://phabricator.wikimedia.org/P55934 and previous config saved to /var/cache/conftool/dbconfig/20240131-084637-marostegui.json [08:46:40] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2138.codfw.wmnet with reason: Maintenance [08:46:43] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [08:46:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2138.codfw.wmnet with reason: Maintenance [08:47:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2138:3312 (T355609)', diff saved to https://phabricator.wikimedia.org/P55935 and previous config saved to /var/cache/conftool/dbconfig/20240131-084700-marostegui.json [08:47:40] (03PS1) 10Ayounsi: Routed Ganeti: rollback global v6 forwarding [puppet] - 10https://gerrit.wikimedia.org/r/994666 (https://phabricator.wikimedia.org/T300152) [08:48:36] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: (2) The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [08:49:22] (ProbeDown) firing: (16) Service debmonitor1002:443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:51:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host sretest1002.eqiad.wmnet [08:52:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host sretest1003.eqiad.wmnet [08:53:36] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: (2) The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [08:54:14] (03PS1) 10Ayounsi: Routed Ganeti: fix nftables bug [puppet] - 10https://gerrit.wikimedia.org/r/994667 (https://phabricator.wikimedia.org/T300152) [08:55:01] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host titan2001.codfw.wmnet [08:55:41] (ProbeDown) firing: (18) Service debmonitor1002:443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:57:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T355609)', diff saved to https://phabricator.wikimedia.org/P55936 and previous config saved to /var/cache/conftool/dbconfig/20240131-085719-marostegui.json [08:57:23] jouncebot: nowandnext [08:57:23] For the next 0 hour(s) and 2 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240131T0800) [08:57:24] In 0 hour(s) and 2 minute(s): MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240131T0900) [08:57:25] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [08:57:36] oh, it's almost train time [09:00:05] dancy and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240131T0900) [09:01:11] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host titan2002.codfw.wmnet [09:02:08] PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast, AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [09:05:40] (03CR) 10Hashar: gerrit: sync soy email template with version 3.7 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/993695 (https://phabricator.wikimedia.org/T355259) (owner: 10Hashar) [09:06:10] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/994667 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [09:06:12] (03PS3) 10Hashar: gerrit: sync soy email template with version 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/993695 (https://phabricator.wikimedia.org/T355259) [09:07:59] !log jmm@cumin2002 END (ERROR) - Cookbook sre.hosts.reboot-single (exit_code=97) for host sretest1003.eqiad.wmnet [09:08:15] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host titan2002.codfw.wmnet [09:09:22] (ProbeDown) firing: (16) Service debmonitor1002:443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:12:10] (03CR) 10Ayounsi: [C: 03+2] Routed Ganeti: rollback global v6 forwarding [puppet] - 10https://gerrit.wikimedia.org/r/994666 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [09:12:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P55937 and previous config saved to /var/cache/conftool/dbconfig/20240131-091226-marostegui.json [09:12:29] (03CR) 10Ayounsi: [C: 03+2] Routed Ganeti: fix nftables bug [puppet] - 10https://gerrit.wikimedia.org/r/994667 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [09:14:28] (ProbeDown) firing: Service vrts1001:1443 has failed probes (http_ticket_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#vrts1001:1443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:14:34] PROBLEM - SSH on vrts1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:18:29] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast4005.wikimedia.org [09:22:41] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4005.wikimedia.org [09:24:27] (ProbeDown) firing: (3) Service vrts1001:1443 has failed probes (http_ticket_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:25:14] (03PS1) 10Clément Goubert: kubernetes: make 3 api_appservers kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/994669 (https://phabricator.wikimedia.org/T351074) [09:25:48] (PuppetFailure) firing: Puppet has failed on ganeti2034:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:26:36] RECOVERY - SSH on vrts1001 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:26:38] PROBLEM - Check systemd state on vrts1001 is CRITICAL: CRITICAL - degraded: The following units failed: apache2.service,cadvisor.service,clamav-freshclam.service,exim4.service,rsync.service,stunnel4.service,ulogd2.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:27:04] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast5004.wikimedia.org [09:27:18] PROBLEM - freshclam running on vrts1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (clamav), command name freshclam https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [09:27:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P55938 and previous config saved to /var/cache/conftool/dbconfig/20240131-092733-marostegui.json [09:28:10] RECOVERY - Check systemd state on vrts1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:28:52] RECOVERY - freshclam running on vrts1001 is OK: PROCS OK: 1 process with UID = 114 (clamav), command name freshclam https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [09:29:26] RECOVERY - Check systemd state on ganeti2034 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:29:27] (ProbeDown) resolved: (3) Service vrts1001:1443 has failed probes (http_ticket_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [09:33:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast5004.wikimedia.org [09:35:48] (PuppetFailure) resolved: Puppet has failed on ganeti2034:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [09:38:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast2003.wikimedia.org [09:39:02] 10SRE, 10MW-on-K8s, 10Quality-and-Test-Engineering-Team, 10serviceops: Move testwiki over to mw-on-k8s - https://phabricator.wikimedia.org/T355534 (10Clement_Goubert) p:05Triage→03Medium [09:39:33] 10SRE, 10Infrastructure-Foundations, 10serviceops, 10Patch-For-Review: httpbb needs to be setup on cumin1002 and removed from cumin1001 - https://phabricator.wikimedia.org/T356054 (10Clement_Goubert) p:05Triage→03Medium [09:41:10] (03CR) 10Clément Goubert: [V: 03+1] "Scott do you want to take a look at how we can remove these timers through puppet?" [puppet] - 10https://gerrit.wikimedia.org/r/993710 (https://phabricator.wikimedia.org/T356054) (owner: 10Clément Goubert) [09:42:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T355609)', diff saved to https://phabricator.wikimedia.org/P55939 and previous config saved to /var/cache/conftool/dbconfig/20240131-094239-marostegui.json [09:42:42] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2148.codfw.wmnet with reason: Maintenance [09:42:45] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [09:42:55] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2148.codfw.wmnet with reason: Maintenance [09:43:02] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2148 (T355609)', diff saved to https://phabricator.wikimedia.org/P55940 and previous config saved to /var/cache/conftool/dbconfig/20240131-094301-marostegui.json [09:47:42] !log ayounsi@cumin1002 START - Cookbook sre.ganeti.makevm for new host testvm2006.codfw.wmnet [09:47:43] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [09:49:40] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2006.codfw.wmnet - ayounsi@cumin1002" [09:50:30] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM testvm2006.codfw.wmnet - ayounsi@cumin1002" [09:50:31] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [09:50:31] !log ayounsi@cumin1002 START - Cookbook sre.dns.wipe-cache testvm2006.codfw.wmnet on all recursors [09:50:34] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) testvm2006.codfw.wmnet on all recursors [09:51:00] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM testvm2006.codfw.wmnet - ayounsi@cumin1002" [09:51:51] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM testvm2006.codfw.wmnet - ayounsi@cumin1002" [09:52:46] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host moss-be2003.codfw.wmnet [09:52:48] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T355609)', diff saved to https://phabricator.wikimedia.org/P55941 and previous config saved to /var/cache/conftool/dbconfig/20240131-095247-marostegui.json [09:52:57] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [09:53:01] !log mvernon@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host moss-be2003.codfw.wmnet [09:53:48] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host moss-be2003.codfw.wmnet [10:00:00] We seem to have a bit of a situation with many (all?) quibble jobs failing in CI: [⚓ T356247 CI test failure due to Git error](https://phabricator.wikimedia.org/T356247). I'm not sure what is going on. [10:00:01] T356247: CI test failure due to Git error - https://phabricator.wikimedia.org/T356247 [10:00:58] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be2003.codfw.wmnet [10:01:50] (03CR) 10MVernon: [C: 04-1] "Hi," [deployment-charts] - 10https://gerrit.wikimedia.org/r/994357 (https://phabricator.wikimedia.org/T353402) (owner: 10Eevans) [10:02:24] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host testvm2006.codfw.wmnet with OS bookworm [10:03:35] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host moss-be1001.eqiad.wmnet [10:04:14] (03PS1) 10Slyngshede: P:debmonitor::server client HEAD check requires certificates. [puppet] - 10https://gerrit.wikimedia.org/r/994672 [10:05:46] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1250/console" [puppet] - 10https://gerrit.wikimedia.org/r/994672 (owner: 10Slyngshede) [10:07:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P55942 and previous config saved to /var/cache/conftool/dbconfig/20240131-100754-marostegui.json [10:10:49] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be1001.eqiad.wmnet [10:13:57] (03PS2) 10Slyngshede: P:debmonitor::server client HEAD check is on port 443. [puppet] - 10https://gerrit.wikimedia.org/r/994672 [10:14:37] (03PS3) 10Slyngshede: P:debmonitor::server client HEAD check is on port 443. [puppet] - 10https://gerrit.wikimedia.org/r/994672 [10:20:08] !log cgoubert@cumin2002 START - Cookbook sre.hosts.reboot-single for host testreduce1002.eqiad.wmnet [10:20:35] !log cgoubert@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host testreduce1002.eqiad.wmnet [10:21:18] !log cgoubert@cumin2002 START - Cookbook sre.hosts.reboot-single for host testreduce1002.eqiad.wmnet [10:22:19] (03PS2) 10Muehlenhoff: admin: absent maurelio from ldap_only admins [puppet] - 10https://gerrit.wikimedia.org/r/994252 (https://phabricator.wikimedia.org/T356203) (owner: 10Dzahn) [10:23:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P55943 and previous config saved to /var/cache/conftool/dbconfig/20240131-102300-marostegui.json [10:23:08] (03CR) 10Muehlenhoff: "The entry in absent_ldap_users was missing, I've amended the patch and will merge after I've removed Marco's access." [puppet] - 10https://gerrit.wikimedia.org/r/994252 (https://phabricator.wikimedia.org/T356203) (owner: 10Dzahn) [10:24:44] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host moss-be1002.eqiad.wmnet [10:25:08] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testreduce1002.eqiad.wmnet [10:25:24] (03CR) 10Muehlenhoff: [C: 03+2] admin: absent maurelio from ldap_only admins [puppet] - 10https://gerrit.wikimedia.org/r/994252 (https://phabricator.wikimedia.org/T356203) (owner: 10Dzahn) [10:26:47] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Remove `maurelio` from the `ldap/nda` group - https://phabricator.wikimedia.org/T356203 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff >>! In T356203#9499899, @Dzahn wrote: > @ABran-WMF I uploaded a patch. Wanna review and take on the re... [10:29:01] !log stevemunene@cumin1002 START - Cookbook sre.hosts.reboot-single for host an-worker1157.eqiad.wmnet [10:30:28] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be1002.eqiad.wmnet [10:30:32] !log ayounsi@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm2006.codfw.wmnet with reason: host reimage [10:30:51] !log btullis@deploy2002 Started deploy [analytics/refinery@13f7a06] (hadoop-test): Ad-hoc deploy of refinery TEST for T354703 [analytics/refinery@13f7a06c] [10:30:56] T354703: analytics/refinery scap deploy on test cluster fails with permission error - https://phabricator.wikimedia.org/T354703 [10:30:57] !log btullis@deploy2002 Finished deploy [analytics/refinery@13f7a06] (hadoop-test): Ad-hoc deploy of refinery TEST for T354703 [analytics/refinery@13f7a06c] (duration: 00m 05s) [10:33:54] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm2006.codfw.wmnet with reason: host reimage [10:34:25] (03CR) 10Effie Mouzeli: [C: 03+1] kubernetes: make 3 api_appservers kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/994669 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [10:35:12] !log mvernon@cumin2002 START - Cookbook sre.hosts.reboot-single for host moss-be1003.eqiad.wmnet [10:35:40] (03CR) 10Slyngshede: "Removing the CDN check for now. I suspect that the Icinga check may not actually check that the certificate matches the host name. The bla" [puppet] - 10https://gerrit.wikimedia.org/r/994672 (owner: 10Slyngshede) [10:35:49] !log btullis@deploy2002 Started deploy [analytics/refinery@13f7a06] (hadoop-test): Ad-hoc deploy of refinery TEST for T354703 [analytics/refinery@13f7a06c] [10:35:57] !log btullis@deploy2002 Finished deploy [analytics/refinery@13f7a06] (hadoop-test): Ad-hoc deploy of refinery TEST for T354703 [analytics/refinery@13f7a06c] (duration: 00m 07s) [10:36:02] T354703: analytics/refinery scap deploy on test cluster fails with permission error - https://phabricator.wikimedia.org/T354703 [10:36:47] !log stevemunene@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-worker1157.eqiad.wmnet [10:37:04] (03CR) 10Hnowlan: [C: 03+1] mobileapps: Fix MW core request template name [deployment-charts] - 10https://gerrit.wikimedia.org/r/994215 (https://phabricator.wikimedia.org/T339865) (owner: 10Jgiannelos) [10:38:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T355609)', diff saved to https://phabricator.wikimedia.org/P55944 and previous config saved to /var/cache/conftool/dbconfig/20240131-103807-marostegui.json [10:38:09] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2170.codfw.wmnet with reason: Maintenance [10:38:13] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [10:38:23] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2170.codfw.wmnet with reason: Maintenance [10:38:30] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2170:3312 (T355609)', diff saved to https://phabricator.wikimedia.org/P55945 and previous config saved to /var/cache/conftool/dbconfig/20240131-103830-marostegui.json [10:39:42] (03CR) 10Jgiannelos: [C: 03+2] mobileapps: Fix MW core request template name [deployment-charts] - 10https://gerrit.wikimedia.org/r/994215 (https://phabricator.wikimedia.org/T339865) (owner: 10Jgiannelos) [10:40:27] (03CR) 10Ayounsi: [C: 03+1] postgres backups: add hard link for latest [puppet] - 10https://gerrit.wikimedia.org/r/994184 (https://phabricator.wikimedia.org/T316655) (owner: 10Volans) [10:40:45] (03Merged) 10jenkins-bot: mobileapps: Fix MW core request template name [deployment-charts] - 10https://gerrit.wikimedia.org/r/994215 (https://phabricator.wikimedia.org/T339865) (owner: 10Jgiannelos) [10:40:56] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moss-be1003.eqiad.wmnet [10:41:37] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [10:41:42] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [10:42:27] 10SRE, 10SRE-swift-storage, 10Data-Persistence, 10Traffic, 10Wikimedia-Site-requests: Changing default image thumbnail size on English Wikipedia - https://phabricator.wikimedia.org/T355914 (10Joe) Given the chosen size is both non-standard (meaning it's not used on most large wikis) and not in the list o... [10:42:44] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [10:43:19] (03CR) 10Clément Goubert: [C: 03+2] kubernetes: make 3 api_appservers kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/994669 (https://phabricator.wikimedia.org/T351074) (owner: 10Clément Goubert) [10:43:21] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [10:43:36] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:46:39] !log ayounsi@cumin1002 START - Cookbook sre.dns.netbox [10:48:08] !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw1425.eqiad.wmnet with OS bullseye [10:48:32] !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw1423.eqiad.wmnet with OS bullseye [10:48:51] !log cgoubert@cumin2002 START - Cookbook sre.hosts.reimage for host mw1424.eqiad.wmnet with OS bullseye [10:49:36] !log ayounsi@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: codfw routed cluster tap - ayounsi@cumin1002" [10:49:36] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T355609)', diff saved to https://phabricator.wikimedia.org/P55946 and previous config saved to /var/cache/conftool/dbconfig/20240131-104936-marostegui.json [10:49:45] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [10:51:24] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: codfw routed cluster tap - ayounsi@cumin1002" [10:51:24] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:53:28] !log ayounsi@cumin1002 START - Cookbook sre.dns.wipe-cache testvm2006.codfw.wmnet on all recursors [10:53:32] !log ayounsi@cumin1002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) testvm2006.codfw.wmnet on all recursors [10:53:50] 10SRE, 10Wikimedia-Etherpad, 10collaboration-services: Upgrade etherpad.wikimedia.org to v1.9.6 - https://phabricator.wikimedia.org/T316421 (10Jelto) To build the new Debian package for etherpad 1.9.6 I need access to the `packaging` wmcs project. According to [openstack explorer](https://openstack-browser.... [10:55:09] (03PS1) 10Jgiannelos: mobileapps: Disable trace logs on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/994676 [10:56:28] (03PS1) 10Clément Goubert: mw-web, mw-api-ext: Raise replicas for 35% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/994677 (https://phabricator.wikimedia.org/T355532) [10:57:56] (03PS1) 10Clément Goubert: trafficserver: move 35% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/994679 (https://phabricator.wikimedia.org/T355532) [10:58:50] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-base-images.service,docker-reporter-k8s-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240131T1100) [11:00:43] (03CR) 10Clément Goubert: [C: 03+1] mobileapps: Disable trace logs on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/994676 (owner: 10Jgiannelos) [11:01:02] (03CR) 10Jgiannelos: [C: 03+2] mobileapps: Disable trace logs on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/994676 (owner: 10Jgiannelos) [11:01:52] (03Merged) 10jenkins-bot: mobileapps: Disable trace logs on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/994676 (owner: 10Jgiannelos) [11:01:57] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1425.eqiad.wmnet with reason: host reimage [11:02:29] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1423.eqiad.wmnet with reason: host reimage [11:02:37] !log cgoubert@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1424.eqiad.wmnet with reason: host reimage [11:04:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P55947 and previous config saved to /var/cache/conftool/dbconfig/20240131-110442-marostegui.json [11:05:09] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1425.eqiad.wmnet with reason: host reimage [11:05:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:07:59] ^on it [11:08:00] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1423.eqiad.wmnet with reason: host reimage [11:10:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [11:11:10] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1424.eqiad.wmnet with reason: host reimage [11:11:16] PROBLEM - Check systemd state on mwdebug1001 is CRITICAL: CRITICAL - degraded: The following units failed: php7.4-fpm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:12:22] RECOVERY - Check systemd state on mwdebug1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:15:54] PROBLEM - Check systemd state on kubemaster1002 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:19:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P55948 and previous config saved to /var/cache/conftool/dbconfig/20240131-111949-marostegui.json [11:23:06] moritzm: ping for T354855 on kubemaster1002 ^ [11:23:06] T354855: ferm sometimes fails to restart on Kubernetes workers via xtables lock held by kube-proxy - https://phabricator.wikimedia.org/T354855 [11:24:15] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1425.eqiad.wmnet with OS bullseye [11:25:21] 10SRE, 10SRE-swift-storage, 10Data-Persistence, 10Thumbor, and 2 others: Changing default image thumbnail size on English Wikipedia - https://phabricator.wikimedia.org/T355914 (10MatthewVernon) [11:26:07] claime: ack, thanks. I've taken some notes on how to detect this and restarted ferm there [11:26:08] RECOVERY - Check systemd state on kubemaster1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:26:19] ty <3 [11:26:52] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1423.eqiad.wmnet with OS bullseye [11:26:59] (03CR) 10Muehlenhoff: [C: 03+2] Remove obsolete Hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/993065 (https://phabricator.wikimedia.org/T349936) (owner: 10Muehlenhoff) [11:27:05] As far as I can tell it's the first time it's happened on a control plane node [11:27:07] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, seem sensible" [puppet] - 10https://gerrit.wikimedia.org/r/994672 (owner: 10Slyngshede) [11:27:19] s/tell/remember/ would be more accurate [11:27:22] I haven't checked [11:27:24] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host titan1001.eqiad.wmnet [11:27:56] !log ayounsi@cumin1002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host testvm2006.codfw.wmnet with OS bookworm [11:27:56] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=97) for new host testvm2006.codfw.wmnet [11:28:05] (03CR) 10Slyngshede: [C: 03+2] P:debmonitor::server client HEAD check is on port 443. [puppet] - 10https://gerrit.wikimedia.org/r/994672 (owner: 10Slyngshede) [11:29:47] !log cgoubert@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1424.eqiad.wmnet with OS bullseye [11:30:38] (03PS1) 10Muehlenhoff: Remove obsolete Hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/994683 (https://phabricator.wikimedia.org/T354959) [11:34:13] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host titan1001.eqiad.wmnet [11:34:22] (ProbeDown) firing: (14) Service debmonitor1002:443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:34:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T355609)', diff saved to https://phabricator.wikimedia.org/P55949 and previous config saved to /var/cache/conftool/dbconfig/20240131-113456-marostegui.json [11:34:58] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2175.codfw.wmnet with reason: Maintenance [11:35:03] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [11:35:12] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2175.codfw.wmnet with reason: Maintenance [11:35:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2175 (T355609)', diff saved to https://phabricator.wikimedia.org/P55950 and previous config saved to /var/cache/conftool/dbconfig/20240131-113518-marostegui.json [11:35:27] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetboard2003.codfw.wmnet [11:37:39] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker[1157-1175].eqiad.wmnet [11:38:23] !log stevemunene@cumin1002 START - Cookbook sre.hadoop.init-hadoop-workers for hosts an-worker[1157-1175].eqiad.wmnet [11:38:34] (03CR) 10Hnowlan: [C: 03+1] "🚀" [puppet] - 10https://gerrit.wikimedia.org/r/994679 (https://phabricator.wikimedia.org/T355532) (owner: 10Clément Goubert) [11:38:54] !log stevemunene@cumin1002 END (FAIL) - Cookbook sre.hadoop.init-hadoop-workers (exit_code=99) for hosts an-worker[1157-1175].eqiad.wmnet [11:39:25] (03CR) 10Muehlenhoff: [C: 03+2] hadoop:httpd: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/993075 (owner: 10Muehlenhoff) [11:39:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard2003.codfw.wmnet [11:40:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host puppetboard1003.eqiad.wmnet [11:41:26] (03CR) 10Slyngshede: "It's not particularly well documented, I had this to go by: https://github.com/prometheus/node_exporter/blob/master/docs/TIME.md#timex-col" [puppet] - 10https://gerrit.wikimedia.org/r/994172 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [11:42:01] (03CR) 10Slyngshede: [C: 03+2] P:url_downloader absent Icinga check. [puppet] - 10https://gerrit.wikimedia.org/r/994170 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [11:42:10] 10SRE, 10SRE-swift-storage, 10Data-Persistence, 10Thumbor, and 2 others: Changing default image thumbnail size on English Wikipedia - https://phabricator.wikimedia.org/T355914 (10MatthewVernon) I think the main issue is likely that we'll melt Thumbor if we just switch enwiki to 250, because 250 isn't a pre... [11:44:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host puppetboard1003.eqiad.wmnet [11:46:06] (03CR) 10Alexandros Kosiaris: "Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/994196 (https://phabricator.wikimedia.org/T266216) (owner: 10Alexandros Kosiaris) [11:46:09] (03CR) 10Alexandros Kosiaris: [C: 03+2] linkrecommendation: Bump memory limit by 200Mi [deployment-charts] - 10https://gerrit.wikimedia.org/r/994196 (https://phabricator.wikimedia.org/T266216) (owner: 10Alexandros Kosiaris) [11:46:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T355609)', diff saved to https://phabricator.wikimedia.org/P55951 and previous config saved to /var/cache/conftool/dbconfig/20240131-114643-marostegui.json [11:46:49] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [11:48:08] (03Merged) 10jenkins-bot: linkrecommendation: Bump memory limit by 200Mi [deployment-charts] - 10https://gerrit.wikimedia.org/r/994196 (https://phabricator.wikimedia.org/T266216) (owner: 10Alexandros Kosiaris) [11:50:42] (ProbeDown) firing: (18) Service debmonitor1002:443 has failed probes (http_debmonitor_discovery_wmnet_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:51:11] !log jelto@cumin1002 START - Cookbook sre.hosts.reboot-single for host people1004.eqiad.wmnet [11:52:35] (03PS1) 10Hashar: ci: prune tags when updating git mirrors [puppet] - 10https://gerrit.wikimedia.org/r/994685 (https://phabricator.wikimedia.org/T252310) [11:54:22] (ProbeDown) firing: (16) Service debmonitor1002:443 has failed probes (http_debmonitor_discovery_wmnet_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:57:04] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-replica1005.wikimedia.org [11:57:11] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host people1004.eqiad.wmnet [11:59:22] !log jelto@cumin1002 START - Cookbook sre.hosts.reboot-single for host planet2003.codfw.wmnet [12:00:01] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host dbstore1008.eqiad.wmnet [12:00:34] (03PS1) 10Slyngshede: P::installserver::proxy Blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/994686 (https://phabricator.wikimedia.org/T350694) [12:00:51] (03CR) 10Hnowlan: [C: 03+1] mw-web, mw-api-ext: Raise replicas for 35% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/994677 (https://phabricator.wikimedia.org/T355532) (owner: 10Clément Goubert) [12:00:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-replica1005.wikimedia.org [12:01:11] (03PS2) 10Hashar: ci: fetch tags for git mirrors [puppet] - 10https://gerrit.wikimedia.org/r/994685 (https://phabricator.wikimedia.org/T252310) [12:01:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P55952 and previous config saved to /var/cache/conftool/dbconfig/20240131-120150-marostegui.json [12:02:25] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-replica1006.wikimedia.org [12:03:21] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host planet2003.codfw.wmnet [12:03:30] !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/linkrecommendation: apply [12:04:15] !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/linkrecommendation: apply [12:04:41] !log akosiaris@deploy2002 helmfile [staging] START helmfile.d/services/linkrecommendation: apply [12:04:51] !log jelto@cumin1002 START - Cookbook sre.hosts.reboot-single for host stewards1001.eqiad.wmnet [12:04:54] !log akosiaris@deploy2002 helmfile [staging] DONE helmfile.d/services/linkrecommendation: apply [12:04:56] (03CR) 10Clément Goubert: [C: 03+2] mw-web, mw-api-ext: Raise replicas for 35% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/994677 (https://phabricator.wikimedia.org/T355532) (owner: 10Clément Goubert) [12:05:28] (03CR) 10CI reject: [V: 04-1] P::installserver::proxy Blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/994686 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [12:05:48] (03Merged) 10jenkins-bot: mw-web, mw-api-ext: Raise replicas for 35% traffic [deployment-charts] - 10https://gerrit.wikimedia.org/r/994677 (https://phabricator.wikimedia.org/T355532) (owner: 10Clément Goubert) [12:05:51] !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/linkrecommendation: apply [12:06:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-replica1006.wikimedia.org [12:06:40] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-ext: apply [12:06:42] !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/linkrecommendation: apply [12:06:54] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-ext: apply [12:06:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-replica2007.wikimedia.org [12:07:03] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-ext: apply [12:07:11] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-ext: apply [12:08:27] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-web: apply [12:08:36] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stewards1001.eqiad.wmnet [12:09:05] !log jelto@cumin1002 START - Cookbook sre.hosts.reboot-single for host stewards2001.codfw.wmnet [12:10:30] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-web: apply [12:10:36] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-web: apply [12:10:43] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-web: apply [12:10:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-replica2007.wikimedia.org [12:10:59] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ldap-replica2008.wikimedia.org [12:11:45] !log btullis@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dbstore1008.eqiad.wmnet [12:11:59] 10SRE, 10SRE-swift-storage, 10Data-Persistence, 10Thumbor, and 2 others: Changing default image thumbnail size on English Wikipedia - https://phabricator.wikimedia.org/T355914 (10TheDJ) We could have MW detect thumbnail transclusions with the default size, have a configuration setting for a 1/x ratio, then... [12:12:10] !log btullis@cumin1002 START - Cookbook sre.hosts.reboot-single for host dbstore1009.eqiad.wmnet [12:13:05] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stewards2001.codfw.wmnet [12:13:57] !log Raising external traffic to mw-on-k8s to 35% - T355532 [12:14:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:03] T355532: Move 40% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T355532 [12:14:16] slyngs: fabfur ^^ head's up [12:14:53] (03CR) 10Clément Goubert: [C: 03+2] trafficserver: move 35% of traffic to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/994679 (https://phabricator.wikimedia.org/T355532) (owner: 10Clément Goubert) [12:14:59] Thanks, and good luck. It's going to be great :-) [12:15:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ldap-replica2008.wikimedia.org [12:15:28] ty :) [12:16:22] (03PS1) 10Alexandros Kosiaris: KubeletOperationalLatency: Bump Operational Latencies for to 15m [alerts] - 10https://gerrit.wikimedia.org/r/994690 [12:16:56] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host apt-staging2001.codfw.wmnet [12:16:57] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P55953 and previous config saved to /var/cache/conftool/dbconfig/20240131-121656-marostegui.json [12:17:18] (03PS1) 10Slyngshede: P:debmonitor::server Do not embed the fqdn of an instance in rules. [puppet] - 10https://gerrit.wikimedia.org/r/994691 [12:17:54] (03CR) 10CI reject: [V: 04-1] KubeletOperationalLatency: Bump Operational Latencies for to 15m [alerts] - 10https://gerrit.wikimedia.org/r/994690 (owner: 10Alexandros Kosiaris) [12:19:26] `analytics-mysql wikidatawiki` on stat1007 isn’t working for me? [12:19:26] ERROR 2002 (HY000): Can't connect to MySQL server on 'dbstore1009.eqiad.wmnet' (115) [12:20:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:20:28] ok, now it’s working again [12:20:42] (ProbeDown) firing: (4) Service debmonitor2002:443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:20:55] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host apt-staging2001.codfw.wmnet [12:22:45] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netmon2002.wikimedia.org [12:23:29] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Move 40% of mediawiki external requests to mw on k8s - https://phabricator.wikimedia.org/T355532 (10Clement_Goubert) [12:23:55] (03PS2) 10Alexandros Kosiaris: KubeletOperationalLatency: Bump Operational Latencies for to 15m [alerts] - 10https://gerrit.wikimedia.org/r/994690 [12:24:22] (ProbeDown) resolved: (4) Service debmonitor2002:443 has failed probes (http_debmonitor_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Debmonitor - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:24:42] !log btullis@cumin1002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host dbstore1009.eqiad.wmnet [12:24:51] PROBLEM - Check systemd state on dbstore1009 is CRITICAL: CRITICAL - degraded: The following units failed: mariadb.service,prometheus-mysqld-exporter.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:25:02] (03CR) 10CI reject: [V: 04-1] KubeletOperationalLatency: Bump Operational Latencies for to 15m [alerts] - 10https://gerrit.wikimedia.org/r/994690 (owner: 10Alexandros Kosiaris) [12:25:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [12:25:54] (03PS1) 10Clément Goubert: mw-api-int: Raise replicas to 160 [deployment-charts] - 10https://gerrit.wikimedia.org/r/994694 [12:28:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netmon2002.wikimedia.org [12:28:28] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on netmon2002:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [12:30:33] (03Abandoned) 10Clément Goubert: mw-api-int: bump replicas before moving wikifeeds [deployment-charts] - 10https://gerrit.wikimedia.org/r/961046 (owner: 10Giuseppe Lavagetto) [12:31:35] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netmon1003.wikimedia.org [12:32:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T355609)', diff saved to https://phabricator.wikimedia.org/P55954 and previous config saved to /var/cache/conftool/dbconfig/20240131-123203-marostegui.json [12:32:05] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db2189.codfw.wmnet with reason: Maintenance [12:32:18] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db2189.codfw.wmnet with reason: Maintenance [12:32:19] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [12:32:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2189 (T355609)', diff saved to https://phabricator.wikimedia.org/P55955 and previous config saved to /var/cache/conftool/dbconfig/20240131-123224-marostegui.json [12:33:28] (03PS3) 10Alexandros Kosiaris: KubeletOperationalLatency: Bump Operational Latencies for to 15m [alerts] - 10https://gerrit.wikimedia.org/r/994690 [12:34:52] (03CR) 10Alexandros Kosiaris: [C: 03+2] KubeletOperationalLatency: Bump Operational Latencies for to 15m [alerts] - 10https://gerrit.wikimedia.org/r/994690 (owner: 10Alexandros Kosiaris) [12:35:29] Hmm I'm getting 500s on sal.toolforge.org :( [12:35:48] (PuppetZeroResources) firing: Puppet has failed generate resources on testvm2006:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [12:35:59] (03Merged) 10jenkins-bot: KubeletOperationalLatency: Bump Operational Latencies for to 15m [alerts] - 10https://gerrit.wikimedia.org/r/994690 (owner: 10Alexandros Kosiaris) [12:37:34] (03CR) 10Hnowlan: [C: 03+1] mw-api-int: Raise replicas to 160 [deployment-charts] - 10https://gerrit.wikimedia.org/r/994694 (owner: 10Clément Goubert) [12:42:47] (03CR) 10Clément Goubert: [C: 03+2] mw-api-int: Raise replicas to 160 [deployment-charts] - 10https://gerrit.wikimedia.org/r/994694 (owner: 10Clément Goubert) [12:42:50] !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host netmon1003.wikimedia.org [12:43:06] PROBLEM - Check systemd state on netmon1003 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:43:28] (KeyholderUnarmed) firing: (2) 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [12:43:38] (03Merged) 10jenkins-bot: mw-api-int: Raise replicas to 160 [deployment-charts] - 10https://gerrit.wikimedia.org/r/994694 (owner: 10Clément Goubert) [12:44:10] !log cgoubert@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-api-int: apply [12:44:22] !log cgoubert@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-api-int: apply [12:44:31] !log cgoubert@deploy2002 helmfile [codfw] START helmfile.d/services/mw-api-int: apply [12:44:43] !log cgoubert@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-api-int: apply [12:45:23] (03PS1) 10Slyngshede: C:samplicator Icinga monitoring is not required. [puppet] - 10https://gerrit.wikimedia.org/r/994698 (https://phabricator.wikimedia.org/T350694) [12:45:25] (03CR) 10Ladsgroup: [C: 03+1] db-production.php: Disable writes on es5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994611 (https://phabricator.wikimedia.org/T356235) (owner: 10Marostegui) [12:46:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T355609)', diff saved to https://phabricator.wikimedia.org/P55956 and previous config saved to /var/cache/conftool/dbconfig/20240131-124623-marostegui.json [12:46:30] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [12:48:29] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow1002.eqiad.wmnet [12:53:28] (KeyholderUnarmed) resolved: (2) 1 unarmed Keyholder key(s) on netmon1003:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [12:53:46] (03PS1) 10Bartosz Dziewoński: Add an exception for ConvenientDiscussions-style permalinks [extensions/DiscussionTools] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/994708 (https://phabricator.wikimedia.org/T349653) [12:53:58] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: (2) The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [12:54:10] (03PS1) 10Bartosz Dziewoński: Add an exception for ConvenientDiscussions-style permalinks [extensions/DiscussionTools] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/994709 (https://phabricator.wikimedia.org/T349653) [12:54:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow1002.eqiad.wmnet [12:54:39] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/994691 (owner: 10Slyngshede) [12:55:57] (03CR) 10Slyngshede: [C: 03+2] P:debmonitor::server Do not embed the fqdn of an instance in rules. [puppet] - 10https://gerrit.wikimedia.org/r/994691 (owner: 10Slyngshede) [12:57:49] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow2003.codfw.wmnet [12:58:09] !log filippo@cumin1002 START - Cookbook sre.hosts.reboot-single for host titan1002.eqiad.wmnet [12:59:12] RECOVERY - Check systemd state on netmon1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:01:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P55957 and previous config saved to /var/cache/conftool/dbconfig/20240131-130130-marostegui.json [13:01:54] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow2003.codfw.wmnet [13:03:08] (03PS1) 10Bartosz Dziewoński: index.php: Restore support for forcesafemode option. [core] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/994710 (https://phabricator.wikimedia.org/T355314) [13:03:13] (03PS1) 10Bartosz Dziewoński: index.php: Restore support for forcesafemode option. [core] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/994711 (https://phabricator.wikimedia.org/T355314) [13:04:40] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow3003.esams.wmnet [13:04:58] !log filippo@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host titan1002.eqiad.wmnet [13:08:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow3003.esams.wmnet [13:09:01] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow4002.ulsfo.wmnet [13:14:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow4002.ulsfo.wmnet [13:14:46] RECOVERY - Check systemd state on dbstore1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:16:34] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow5002.eqsin.wmnet [13:16:37] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P55959 and previous config saved to /var/cache/conftool/dbconfig/20240131-131637-marostegui.json [13:19:55] (03PS1) 10Daimona Eaytoy: [metawiki] Rename the campaignevents-beta-tester group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994701 (https://phabricator.wikimedia.org/T356070) [13:19:56] 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10ayounsi) Current status, ignoring IPv6 for now. The cluster VIP is dynamically announced from the primary cluster node. Limitation from `isc-dhcp... [13:22:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow5002.eqsin.wmnet [13:23:32] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host netflow6001.drmrs.wmnet [13:24:51] (03PS1) 10Daimona Eaytoy: [metawiki] Let admins add/remove the event-organizer group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994702 (https://phabricator.wikimedia.org/T356070) [13:27:01] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/991806 (owner: 10PipelineBot) [13:27:05] (03Abandoned) 10Jgiannelos: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/992426 (owner: 10PipelineBot) [13:27:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host netflow6001.drmrs.wmnet [13:30:28] (03PS1) 10Daimona Eaytoy: beta: Update for campaignevents-beta-tester group rename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994705 (https://phabricator.wikimedia.org/T356070) [13:31:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T355609)', diff saved to https://phabricator.wikimedia.org/P55960 and previous config saved to /var/cache/conftool/dbconfig/20240131-133143-marostegui.json [13:31:49] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [13:36:51] hi MatmaRex! Looking at the window's schedule, and it looks like we're back. Just checking in advance whether you'll be around, as it makes sense to +2 now to save a bit of CI time :) [13:38:30] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (if SSH key has been validated out-of-band)" [puppet] - 10https://gerrit.wikimedia.org/r/993170 (https://phabricator.wikimedia.org/T355606) (owner: 10AOkoth) [13:39:11] (03CR) 10Arnaudb: [C: 03+2] admin: add amastilovic to analytics-privatedata [puppet] - 10https://gerrit.wikimedia.org/r/993170 (https://phabricator.wikimedia.org/T355606) (owner: 10AOkoth) [13:40:33] (03CR) 10Urbanecm: [C: 03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994701 (https://phabricator.wikimedia.org/T356070) (owner: 10Daimona Eaytoy) [13:40:49] (03CR) 10Urbanecm: [C: 03+1] "note: needs migrateUserGroup.php to run" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994701 (https://phabricator.wikimedia.org/T356070) (owner: 10Daimona Eaytoy) [13:41:07] (03CR) 10Urbanecm: [C: 03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994702 (https://phabricator.wikimedia.org/T356070) (owner: 10Daimona Eaytoy) [13:41:16] (03PS1) 10Marostegui: wmnet: Update CNAME for es5 [dns] - 10https://gerrit.wikimedia.org/r/994730 (https://phabricator.wikimedia.org/T356235) [13:41:18] (03CR) 10Urbanecm: [C: 03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994705 (https://phabricator.wikimedia.org/T356070) (owner: 10Daimona Eaytoy) [13:41:28] (03CR) 10Marostegui: [C: 04-2] "Not yet" [dns] - 10https://gerrit.wikimedia.org/r/994730 (https://phabricator.wikimedia.org/T356235) (owner: 10Marostegui) [13:43:02] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting analytics-privatedata-users access for amastilovic - https://phabricator.wikimedia.org/T355606 (10ABran-WMF) 05Open→03Resolved Just [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/993170 | merged ]] :-) For the general information, it... [13:43:41] (03PS1) 10Marostegui: mariadb: Switchover es5 master [puppet] - 10https://gerrit.wikimedia.org/r/994731 (https://phabricator.wikimedia.org/T356235) [13:44:09] !log jelto@cumin1002 START - Cookbook sre.hosts.reboot-single for host planet1003.eqiad.wmnet [13:44:33] (03CR) 10Marostegui: [C: 04-2] "Not yet" [puppet] - 10https://gerrit.wikimedia.org/r/994731 (https://phabricator.wikimedia.org/T356235) (owner: 10Marostegui) [13:45:00] (03PS1) 10Urbanecm: testwiki: Temporarily change default value for 4 Echo properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994732 (https://phabricator.wikimedia.org/T353225) [13:45:52] (03CR) 10Urbanecm: [C: 03+2] testwiki: Temporarily change default value for 4 Echo properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994732 (https://phabricator.wikimedia.org/T353225) (owner: 10Urbanecm) [13:46:06] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994732 (https://phabricator.wikimedia.org/T353225) (owner: 10Urbanecm) [13:46:32] (03Merged) 10jenkins-bot: testwiki: Temporarily change default value for 4 Echo properties [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994732 (https://phabricator.wikimedia.org/T353225) (owner: 10Urbanecm) [13:48:03] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:994732|testwiki: Temporarily change default value for 4 Echo properties (T353225)]] [13:48:08] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host planet1003.eqiad.wmnet [13:48:09] T353225: Echo: Make use of conditional user defaults - https://phabricator.wikimedia.org/T353225 [13:48:37] urbanecm: hi, yes. that's a good idea [13:48:48] !log jelto@cumin1002 START - Cookbook sre.hosts.reboot-single for host people2003.codfw.wmnet [13:48:54] (03CR) 10Urbanecm: [C: 03+2] decodeURI fragments before sending them to discussiontoolsfindcomment [extensions/DiscussionTools] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/994234 (https://phabricator.wikimedia.org/T356199) (owner: 10DLynch) [13:48:56] (03CR) 10Urbanecm: [C: 03+2] decodeURI fragments before sending them to discussiontoolsfindcomment [extensions/DiscussionTools] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/994235 (https://phabricator.wikimedia.org/T356199) (owner: 10DLynch) [13:48:58] (03CR) 10Urbanecm: [C: 03+2] Add an exception for ConvenientDiscussions-style permalinks [extensions/DiscussionTools] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/994708 (https://phabricator.wikimedia.org/T349653) (owner: 10Bartosz Dziewoński) [13:49:00] (03CR) 10Urbanecm: [C: 03+2] Add an exception for ConvenientDiscussions-style permalinks [extensions/DiscussionTools] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/994709 (https://phabricator.wikimedia.org/T349653) (owner: 10Bartosz Dziewoński) [13:49:02] (03CR) 10Urbanecm: [C: 03+2] index.php: Restore support for forcesafemode option. [core] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/994710 (https://phabricator.wikimedia.org/T355314) (owner: 10Bartosz Dziewoński) [13:49:05] (03CR) 10Urbanecm: [C: 03+2] index.php: Restore support for forcesafemode option. [core] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/994711 (https://phabricator.wikimedia.org/T355314) (owner: 10Bartosz Dziewoński) [13:49:10] started the jobs then :) [13:49:33] CI has been flaky this morning, i hope that's resolved now [13:49:56] i think it was, but not sure [13:49:59] we'll see [13:50:09] thanks for the heads-up though [13:51:53] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:994732|testwiki: Temporarily change default value for 4 Echo properties (T353225)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:54:03] (03CR) 10Daniel Kinzler: Configure parser cache filters for parsoid-pcache (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994212 (https://phabricator.wikimedia.org/T346765) (owner: 10Daniel Kinzler) [13:54:13] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350 (10Papaul) @Marostegui yes we will put some in row C and D as well. Just the once in row A and B will be connected to 10G is has 10G NIC. Thanks [13:54:33] 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350 (10Marostegui) Then no problem at all! Thanks [13:54:51] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host people2003.codfw.wmnet [13:55:23] (03Merged) 10jenkins-bot: decodeURI fragments before sending them to discussiontoolsfindcomment [extensions/DiscussionTools] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/994234 (https://phabricator.wikimedia.org/T356199) (owner: 10DLynch) [13:55:54] (03PS2) 10Slyngshede: P::installserver::proxy Blackbox monitoring [puppet] - 10https://gerrit.wikimedia.org/r/994686 (https://phabricator.wikimedia.org/T350694) [13:56:32] (03Merged) 10jenkins-bot: Add an exception for ConvenientDiscussions-style permalinks [extensions/DiscussionTools] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/994708 (https://phabricator.wikimedia.org/T349653) (owner: 10Bartosz Dziewoński) [13:56:46] (03Merged) 10jenkins-bot: decodeURI fragments before sending them to discussiontoolsfindcomment [extensions/DiscussionTools] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/994235 (https://phabricator.wikimedia.org/T356199) (owner: 10DLynch) [13:56:49] (03Merged) 10jenkins-bot: Add an exception for ConvenientDiscussions-style permalinks [extensions/DiscussionTools] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/994709 (https://phabricator.wikimedia.org/T349653) (owner: 10Bartosz Dziewoński) [13:56:50] that was quick [13:57:21] 10SRE-Sprint-Week-Sustainability-March2023, 10ChangeProp, 10Prod-Kubernetes, 10serviceops, and 2 others: Raise an alarm on container restarts/OOMs in kubernetes - https://phabricator.wikimedia.org/T256256 (10akosiaris) 05Open→03Resolved Patches reviewed and merged, I had some followup patches in T26621... [13:58:23] (03PS1) 10Arnaudb: admin: update approvers for analytics-priv... [puppet] - 10https://gerrit.wikimedia.org/r/994379 (https://phabricator.wikimedia.org/T356132) [13:59:56] jesus fucking christ https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240131T1400 [14:00:02] when did we remove the “max 6 patches” anyway? [14:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: It is that lovely time of the day again! You are hereby commanded to deploy UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240131T1400). [14:00:05] Daimona and MatmaRex: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:12] o/ [14:00:16] i'll deploy [14:00:20] :D [14:00:48] !log urbanecm@deploy2002 urbanecm: Continuing with sync [14:00:59] i think someone removed "max 6 patches" because no one ever respected that. neither the requesters nor the deployers [14:01:17] oh. we _literally_ removed it. :D [14:01:23] We're trying to set a new record here. [14:01:28] I mean, I took it into account [14:01:41] and six patches was still pushing it, a window with six patches would tend to overrun in my experience [14:01:55] I did a while back, since no-one paid any attention into that and `scap backport` makes a sync the time unit instead of a patch [14:01:58] removing the limit doesn’t magically make deployments go faster [14:01:59] But yeah, I don't expect all patches to make it. [14:02:01] o/ [14:02:08] i tend to do patches in parallel, which helps a lot [14:02:19] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/994379 (https://phabricator.wikimedia.org/T356132) (owner: 10Arnaudb) [14:02:50] (03CR) 10Arnaudb: [C: 03+2] admin: update approvers for analytics-priv... [puppet] - 10https://gerrit.wikimedia.org/r/994379 (https://phabricator.wikimedia.org/T356132) (owner: 10Arnaudb) [14:03:04] (03PS1) 10Slyngshede: D:uwsgi::app Allow disabling of monitoring. [puppet] - 10https://gerrit.wikimedia.org/r/994735 (https://phabricator.wikimedia.org/T350694) [14:03:07] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Additional approvers for analytics-privatedata-users - https://phabricator.wikimedia.org/T356132 (10ABran-WMF) p:05Triage→03Medium a:03ABran-WMF [14:03:25] (03PS2) 10Urbanecm: [metawiki] Rename the campaignevents-beta-tester group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994701 (https://phabricator.wikimedia.org/T356070) (owner: 10Daimona Eaytoy) [14:03:30] (03CR) 10Urbanecm: [C: 03+2] [metawiki] Rename the campaignevents-beta-tester group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994701 (https://phabricator.wikimedia.org/T356070) (owner: 10Daimona Eaytoy) [14:03:41] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Additional approvers for analytics-privatedata-users - https://phabricator.wikimedia.org/T356132 (10Reedy) [14:04:08] Out of curiosity, what are the current main bottlenecks? CI? Scap? [14:04:39] it really was a silly rule IMO. what if there's more than 6 bugs that need to be fixed? should i sit around for 7 hours waiting for the next deploy window? should i ask someone from WMF to schedule a separate deploy window? (they never like that.) (what if i'm not a WMF employee?) [14:04:42] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [14:04:51] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Additional approvers for analytics-privatedata-users - https://phabricator.wikimedia.org/T356132 (10ABran-WMF) 05Open→03Resolved Patch was just [[ https://gerrit.wikimedia.org/r/c/operations/puppet/+/994379 | merged ]] [14:04:54] (03PS1) 10Jgiannelos: mobileapps: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/994736 [14:04:56] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1139.eqiad.wmnet with reason: Maintenance [14:04:57] https://wikitech.wikimedia.org/wiki/Backport_windows still has the limit of 6 patches [14:04:58] now it's just "may or may not be deployed at the sole discretion of the deployer", which is how it was in practice anyway [14:05:05] (though without any of the RFC 2119 magic words) [14:05:25] I guess I can make my own rule that I’ll refuse to deploy windows that have been so ludicrously filled [14:05:33] because I seriously do not appreciate it and would like the limit backh [14:06:09] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1251/co" [puppet] - 10https://gerrit.wikimedia.org/r/994735 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [14:06:16] Daimona: scap can take about 10 minutes, CI can take 5-15 minutes (in my experience as a client :) not a deployer) [14:06:22] (03CR) 10Jgiannelos: [C: 03+2] mobileapps: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/994736 (owner: 10Jgiannelos) [14:06:51] Lucas_WMDE: i think that's completely fair [14:07:13] (03Merged) 10jenkins-bot: index.php: Restore support for forcesafemode option. [core] (wmf/1.42.0-wmf.15) - 10https://gerrit.wikimedia.org/r/994710 (https://phabricator.wikimedia.org/T355314) (owner: 10Bartosz Dziewoński) [14:07:26] (03Merged) 10jenkins-bot: mobileapps: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/994736 (owner: 10Jgiannelos) [14:07:45] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:994732|testwiki: Temporarily change default value for 4 Echo properties (T353225)]] (duration: 19m 37s) [14:07:46] T353225: Echo: Make use of conditional user defaults - https://phabricator.wikimedia.org/T353225 [14:07:47] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994701 (https://phabricator.wikimedia.org/T356070) (owner: 10Daimona Eaytoy) [14:07:55] MatmaRex: thanks. Then it's both. I was trying to understand the impact that faster CI would have on deployments, which TBH is something I've never thought of before. [14:07:57] (03CR) 10Slyngshede: D:uwsgi::app Allow disabling of monitoring. [puppet] - 10https://gerrit.wikimedia.org/r/994735 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [14:07:59] (03CR) 10Arnaudb: [C: 03+1] mariadb: Switchover es5 master [puppet] - 10https://gerrit.wikimedia.org/r/994731 (https://phabricator.wikimedia.org/T356235) (owner: 10Marostegui) [14:08:15] 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10Volans) Indeed the `dhcrelay` not working as expected is a bit annoying also because if we run a dhcrelay for each VM, we'd need to hook also at VM... [14:08:18] PROBLEM - Docker registry HTTPS interface on registry1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Docker [14:08:28] (03Merged) 10jenkins-bot: [metawiki] Rename the campaignevents-beta-tester group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994701 (https://phabricator.wikimedia.org/T356070) (owner: 10Daimona Eaytoy) [14:08:36] (03CR) 10Arnaudb: [C: 03+1] wmnet: Update CNAME for es5 [dns] - 10https://gerrit.wikimedia.org/r/994730 (https://phabricator.wikimedia.org/T356235) (owner: 10Marostegui) [14:08:36] Daimona: CI time varies a lot by repo; config changes go a lot faster than mw core backports [14:08:41] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [14:08:44] (03CR) 10Volans: [C: 03+2] postgres backups: add hard link for latest [puppet] - 10https://gerrit.wikimedia.org/r/994184 (https://phabricator.wikimedia.org/T316655) (owner: 10Volans) [14:08:46] (03PS2) 10Slyngshede: D:uwsgi::app Allow disabling of monitoring. [puppet] - 10https://gerrit.wikimedia.org/r/994735 (https://phabricator.wikimedia.org/T350694) [14:08:55] Daimona: for me, it's easier to work with slower CI, as i can control when CI starts, and it can run in parallel. scap backport is one at a time. [14:08:56] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:994234|decodeURI fragments before sending them to discussiontoolsfindcomment (T356199)]], [[gerrit:994235|decodeURI fragments before sending them to discussiontoolsfindcomment (T356199)]], [[gerrit:994708|Add an exception for ConvenientDiscussions-style permalinks (T349653)]], [[gerrit:994709|Add an exception for ConvenientDiscussions-style permalinks (T349653)] [14:08:56] ], [[gerrit:994710|index.php: Restore support for forcesafemode option. (T355314)]], [[gerrit:994701|[metawiki] Rename the campaignevents-beta-tester group (T356070)]] [14:08:57] some extensions also have blessedly short CI times, while Wikibase takes a long time too [14:09:03] T356199: Diacritics in talk pages permalinks cause comments to not being found - https://phabricator.wikimedia.org/T356199 [14:09:03] T349653: Permalink redirecting: Support plain section links - https://phabricator.wikimedia.org/T349653 [14:09:04] T355314: [1.42.0-wmf.13] safemode is not propagated on pages - https://phabricator.wikimedia.org/T355314 [14:09:04] T356070: [EPIC] Change Management of Event Organizer Right - https://phabricator.wikimedia.org/T356070 [14:09:21] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [14:09:31] urbanecm: you're giving me https://xkcd.com/1172/ vibes :P [14:09:33] Lucas_WMDE: what would you say if i edited https://wikitech.wikimedia.org/wiki/Backport_windows to say something like "patches take about 15 minutes each to deploy; consider choosing a window that doesn't have too many patches schedule already" (instead of the "6 patches" thing) [14:09:36] RECOVERY - Docker registry HTTPS interface on registry1004 is OK: HTTP OK: HTTP/1.1 200 OK - 3746 bytes in 0.155 second response time https://wikitech.wikimedia.org/wiki/Docker [14:10:06] (03PS1) 10Alexandros Kosiaris: jaeger-ui: Allow reaching out to idp [deployment-charts] - 10https://gerrit.wikimedia.org/r/994737 [14:10:28] !log urbanecm@deploy2002 urbanecm and kemayo and matmarex and daimona: Backport for [[gerrit:994234|decodeURI fragments before sending them to discussiontoolsfindcomment (T356199)]], [[gerrit:994235|decodeURI fragments before sending them to discussiontoolsfindcomment (T356199)]], [[gerrit:994708|Add an exception for ConvenientDiscussions-style permalinks (T349653)]], [[gerrit:994709|Add an exception for ConvenientDiscuss [14:10:29] ions-style permalinks (T349653)]], [[gerrit:994710|index.php: Restore support for forcesafemode option. (T355314)]], [[gerrit:994701|[metawiki] Rename the campaignevents-beta-tester group (T356070)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:10:42] MatmaRex: please test most your backports at mwdebug :) [14:10:46] i think the last point on that page also isn't true: "Cherry-picking a patch to both release branches counts as 2 as they will be separate deployments." since scap-backport can kind of do them both at once? [14:10:48] looking [14:10:52] Yeah, right. My impression is that scap is the only bottleneck for config, whereas for backports it can be scap and/or CI depending on the repo. [14:11:08] But then I don't really know. [14:11:13] (https://gerrit.wikimedia.org/r/c/mediawiki/core/+/994711 isn't there yet, in process of merging) [14:11:21] urbanecm: that’s so many patches that the SAL starts getting truncated :S https://sal.toolforge.org/log/uCrbX40BxE1_1c7sYJwB [14:11:24] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:11:40] (though https://sal.toolforge.org/log/B9PZX40BhuQtenzv9yu7 was still complete even though it was already split up on my IRC client) [14:12:48] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:56] Daimona: group appears to be renamed from my end. feel free to test too. [14:12:56] urbanecm: changes look good [14:12:56] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:12:57] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [14:12:59] thanks [14:13:07] !log urbanecm@deploy2002 urbanecm and kemayo and matmarex and daimona: Continuing with sync [14:13:10] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1146.eqiad.wmnet with reason: Maintenance [14:13:18] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1146:3312 (T355609)', diff saved to https://phabricator.wikimedia.org/P55962 and previous config saved to /var/cache/conftool/dbconfig/20240131-141316-marostegui.json [14:13:30] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [14:13:45] MatmaRex: yeah, it’s kind of outdated… a lot comes down to how many patches the deployer is comfortable deploying together, I guess [14:14:17] urbanecm: I'm seeing the new name as well (CC cmelo, HouseOfM) [14:14:30] (03Merged) 10jenkins-bot: index.php: Restore support for forcesafemode option. [core] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/994711 (https://phabricator.wikimedia.org/T355314) (owner: 10Bartosz Dziewoński) [14:14:52] yeah, and whether the patches are independent or not [14:14:55] great! i renamed the MW messages at meta, but they likely need to be reworded. [14:14:56] thank you Daimona [14:15:09] (that is, how annoying it will be if you have to revert the deployment) [14:15:32] you can revert a single patch out of multiple if needed :) [14:15:36] 10SRE, 10SRE-swift-storage, 10Data-Persistence, 10Thumbor, and 2 others: Changing default image thumbnail size on English Wikipedia - https://phabricator.wikimedia.org/T355914 (10Ladsgroup) This is much messier and much hairier than it looks. And yes, without proper preparation this will bring down everyth... [14:15:50] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:16:06] (03PS2) 10Urbanecm: [metawiki] Let admins add/remove the event-organizer group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994702 (https://phabricator.wikimedia.org/T356070) (owner: 10Daimona Eaytoy) [14:16:10] (03CR) 10Urbanecm: [C: 03+2] [metawiki] Let admins add/remove the event-organizer group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994702 (https://phabricator.wikimedia.org/T356070) (owner: 10Daimona Eaytoy) [14:16:55] (03Merged) 10jenkins-bot: [metawiki] Let admins add/remove the event-organizer group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994702 (https://phabricator.wikimedia.org/T356070) (owner: 10Daimona Eaytoy) [14:17:21] urbanecm: I'm updating the messages on meta. Can I delete the redirects? [14:17:29] Daimona: feel free to [14:18:02] !log [urbanecm@mwmaint2002 ~]$ mwscript migrateUserGroup.php --wiki=metawiki campaignevents-beta-tester event-organizer # T356070 [14:18:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:08] T356070: [EPIC] Change Management of Event Organizer Right - https://phabricator.wikimedia.org/T356070 [14:19:26] The redirects are gone. [14:19:27] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:994234|decodeURI fragments before sending them to discussiontoolsfindcomment (T356199)]], [[gerrit:994235|decodeURI fragments before sending them to discussiontoolsfindcomment (T356199)]], [[gerrit:994708|Add an exception for ConvenientDiscussions-style permalinks (T349653)]], [[gerrit:994709|Add an exception for ConvenientDiscussions-style permalinks (T349653) [14:19:28] ]], [[gerrit:994710|index.php: Restore support for forcesafemode option. (T355314)]], [[gerrit:994701|[metawiki] Rename the campaignevents-beta-tester group (T356070)]] (duration: 10m 31s) [14:19:34] T356199: Diacritics in talk pages permalinks cause comments to not being found - https://phabricator.wikimedia.org/T356199 [14:19:35] T349653: Permalink redirecting: Support plain section links - https://phabricator.wikimedia.org/T349653 [14:19:35] T355314: [1.42.0-wmf.13] safemode is not propagated on pages - https://phabricator.wikimedia.org/T355314 [14:19:40] so, first bunch of patches is done [14:20:20] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:994702|[metawiki] Let admins add/remove the event-organizer group (T356070)]], [[gerrit:994711|index.php: Restore support for forcesafemode option. (T355314)]] [14:20:50] !log eevans@cumin1002 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on restbase2020.codfw.wmnet with reason: Decommissioning — T352469 [14:20:56] T352469: Decommission restbase20[13-20]) - https://phabricator.wikimedia.org/T352469 [14:21:05] !log eevans@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on restbase2020.codfw.wmnet with reason: Decommissioning — T352469 [14:21:49] !log urbanecm@deploy2002 daimona and matmarex and urbanecm: Backport for [[gerrit:994702|[metawiki] Let admins add/remove the event-organizer group (T356070)]], [[gerrit:994711|index.php: Restore support for forcesafemode option. (T355314)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:22:30] MatmaRex: Daimona: your patches are at mwdebug. can you take a look please? [14:22:38] (the remaining core backport + allowing admins to grant) [14:23:28] (03PS2) 10Urbanecm: beta: Update for campaignevents-beta-tester group rename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994705 (https://phabricator.wikimedia.org/T356070) (owner: 10Daimona Eaytoy) [14:23:35] (03CR) 10Urbanecm: [C: 03+2] beta: Update for campaignevents-beta-tester group rename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994705 (https://phabricator.wikimedia.org/T356070) (owner: 10Daimona Eaytoy) [14:23:35] urbanecm: looks good, safe mode works on wmf.16 too [14:23:40] awesome [14:23:46] urbanecm: looks good -- checkbox on special:userrights is active. [14:23:52] wonderful [14:23:53] !log urbanecm@deploy2002 daimona and matmarex and urbanecm: Continuing with sync [14:23:56] shipping [14:24:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T355609)', diff saved to https://phabricator.wikimedia.org/P55963 and previous config saved to /var/cache/conftool/dbconfig/20240131-142413-marostegui.json [14:24:17] (03Merged) 10jenkins-bot: beta: Update for campaignevents-beta-tester group rename [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994705 (https://phabricator.wikimedia.org/T356070) (owner: 10Daimona Eaytoy) [14:24:20] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [14:25:09] Daimona: can you clarify "Add secret WikimediaCampaignEvents config to beta's PrivateSettings" please? [14:25:32] 10SRE, 10Infrastructure-Foundations: Connection errors to some hosts from cumin1002 - https://phabricator.wikimedia.org/T356174 (10MoritzMuehlenhoff) I've looked into this some more and the previously setup iptables seem to be not fully flushed with the iptables -> nftables handover. If I manually install ipta... [14:25:37] (feel free to commit what you need to the private repo at beta, it'll get synced) [14:25:55] urbanecm: Is the migration script still running? Some users haven't been migrated AFAICS. Also, it looks like the group name is not updated in log entries, right? [14:26:03] yep, it is [14:26:08] and yes, log entries are unaffected. [14:26:18] (03CR) 10Filippo Giunchedi: [C: 03+1] jaeger-ui: Allow reaching out to idp [deployment-charts] - 10https://gerrit.wikimedia.org/r/994737 (owner: 10Alexandros Kosiaris) [14:26:21] (03CR) 10Filippo Giunchedi: [C: 03+2] jaeger-ui: Allow reaching out to idp [deployment-charts] - 10https://gerrit.wikimedia.org/r/994737 (owner: 10Alexandros Kosiaris) [14:26:24] we can restore the old messages if desired. [14:26:56] Re secret: we need to configure a couple API secrets/keys. Those should be in PS for obvious reasons. Just lmk where you'd like to find them. [14:27:28] Daimona: you do have shell beta access, right? [14:27:36] I'm not sure if restoring the old messages really makes much of a difference, though. The name wold still be different, wouldn't it? [14:27:38] (03PS1) 10Slyngshede: D:prometheus::blackbox::check::tcp allow specifying runbook. [puppet] - 10https://gerrit.wikimedia.org/r/994739 (https://phabricator.wikimedia.org/T350694) [14:27:41] And yes, I do. [14:28:19] Daimona: in that case, can you commit them to PS at `deployment-deploy03:/srv/mediawiki-staging/private`? [14:28:45] Maybe? I've never done that :) [14:28:54] !log filippo@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [14:29:17] lmk if i can help in some way. [14:29:24] (03CR) 10Alexandros Kosiaris: [C: 03+1] hieradata: add jaeger config for SSO oidc [puppet] - 10https://gerrit.wikimedia.org/r/994664 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi) [14:30:01] (03PS1) 10Hnowlan: thumbor: increase memory limit, namespace limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/994740 [14:30:05] Looks like I do have access to that repo. Any guidelines or policies on how to use it? [14:30:20] (03PS2) 10Urbanecm: Add WikimediaCampaignEvents to extension list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994176 (https://phabricator.wikimedia.org/T347894) (owner: 10Cmelo) [14:30:26] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:994702|[metawiki] Let admins add/remove the event-organizer group (T356070)]], [[gerrit:994711|index.php: Restore support for forcesafemode option. (T355314)]] (duration: 10m 05s) [14:30:33] T356070: [EPIC] Change Management of Event Organizer Right - https://phabricator.wikimedia.org/T356070 [14:30:33] T355314: [1.42.0-wmf.13] safemode is not propagated on pages - https://phabricator.wikimedia.org/T355314 [14:31:07] 10SRE, 10Infrastructure-Foundations: Connection errors to some hosts from cumin1002 - https://phabricator.wikimedia.org/T356174 (10Volans) Thanks for the deep dive, the plan LGTM! I guess we could try to see if there is a better way to flush iptables+logging rules during the migration, but the risk of still le... [14:31:20] Daimona: basically, just commit there whenever you need a secret added to beta, and follow the standard commit message guidelines. once you commit, it'll get deployed whenver the deployment job runs (every 10 mins, iirc). [14:31:47] it is a good idea to keep https://github.com/wikimedia/operations-mediawiki-config/blob/master/private/readme.php in sync, which is a standard operations/mediawiki-config patch. [14:32:41] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1252/console" [puppet] - 10https://gerrit.wikimedia.org/r/994739 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [14:32:56] and the last remaining thing should be the extension stuff :) [14:33:20] 9 patches in 33 mins! [14:34:30] (03CR) 10Urbanecm: [C: 03+2] "security review (https://phabricator.wikimedia.org/T350900) determined this to be a low risk => risk automatically accepted per WMF risk m" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994176 (https://phabricator.wikimedia.org/T347894) (owner: 10Cmelo) [14:34:39] OK, let me find the secrets first :) [14:34:55] cmelo, HouseOfM: do you have the beta secrets available? [14:35:13] (03Merged) 10jenkins-bot: Add WikimediaCampaignEvents to extension list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994176 (https://phabricator.wikimedia.org/T347894) (owner: 10Cmelo) [14:35:19] Daimona: sure. i'll continue with adding it to extension-json in the meantime, as that doesn't do anything apart from i18n cache build. [14:35:20] I'm sure I have them somewhere but need to find them first, and then make sure they're the beta ones (not prod). [14:35:26] Yup, sure. [14:36:01] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:994176|Add WikimediaCampaignEvents to extension list (T347894)]] [14:36:07] T347894: Deploy the WikimediaCampaignEvents extension to the Beta Cluster - https://phabricator.wikimedia.org/T347894 [14:36:26] (03CR) 10Eevans: "It exists in hiera at least, yes; I don't know enough about these deployment charts to know how/whether hiera could be readily used to sou" [deployment-charts] - 10https://gerrit.wikimedia.org/r/994357 (https://phabricator.wikimedia.org/T353402) (owner: 10Eevans) [14:37:10] 14:36:03 0 languages rebuilt out of 503 [14:37:15] hmm... i have a feeling something's not right [14:37:31] !log urbanecm@deploy2002 cmelo and urbanecm: Backport for [[gerrit:994176|Add WikimediaCampaignEvents to extension list (T347894)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:37:32] I will, check if I have them Daimona [14:37:35] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1253/console" [puppet] - 10https://gerrit.wikimedia.org/r/994739 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [14:39:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P55964 and previous config saved to /var/cache/conftool/dbconfig/20240131-143921-marostegui.json [14:39:26] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:40:16] !log urbanecm@deploy2002 cmelo and urbanecm: Continuing with sync [14:40:34] (03CR) 10Slyngshede: "I can break this down into two separate patches if you prefer." [puppet] - 10https://gerrit.wikimedia.org/r/994739 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [14:41:31] 10SRE, 10Infrastructure-Foundations: Connection errors to some hosts from cumin1002 - https://phabricator.wikimedia.org/T356174 (10MoritzMuehlenhoff) >>! In T356174#9502363, @Volans wrote: > Thanks for the deep dive, the plan LGTM! > I guess we could try to see if there is a better way to flush iptables+loggin... [14:41:55] Daimona I sent the secrets to on slack [14:43:35] !log filippo@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [14:44:15] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:44:52] urbanecm: We're currently trying to figure out which secret is the right one for beta, and I've also just realized that it'll need one more (public) config patch for the API endpoint. This is not critical, and can be postponed to another window if needed. [14:45:04] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host lists2001.codfw.wmnet [14:46:18] Daimona: postponing the extra patch is a good idea. we've 15 mins, and that's getting too close [14:46:25] i assume https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/994179 should not get deployed then? [14:46:43] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:994176|Add WikimediaCampaignEvents to extension list (T347894)]] (duration: 10m 41s) [14:46:52] Yeah, we can hold off [14:47:01] T347894: Deploy the WikimediaCampaignEvents extension to the Beta Cluster - https://phabricator.wikimedia.org/T347894 [14:47:05] in that case, we're done :). [14:47:23] I mean, it wouldn't fail too hard and it wouldn't make beta more broken than it already is, but still. [14:47:36] up2you. i can enable the extension if you want [14:47:42] that'll fit [14:47:52] No need to, it wouldn't work anyway :) [14:47:56] PROBLEM - Check systemd state on cloudweb1003 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:47:59] okay, fine with me :) [14:48:13] I'll reschedule for another time, including the PS private patch and a public patch for the private readme. [14:48:26] sounds good [14:49:12] The one you found on phab is the right one Daimona [14:49:15] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [14:49:26] RECOVERY - Check systemd state on cloudweb1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:49:32] Yup, let's do that another time just to be sure. [14:49:52] (03CR) 10Filippo Giunchedi: "LGTM overall!" [puppet] - 10https://gerrit.wikimedia.org/r/993476 (https://phabricator.wikimedia.org/T355836) (owner: 10Cwhite) [14:51:11] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host lists2001.codfw.wmnet [14:51:35] thanks for deploying it all urbanecm :) [14:51:44] no problem :) [14:52:17] !log filippo@deploy2002 helmfile [aux-k8s-eqiad] START helmfile.d/aus-k8s-eqiad-services/jaeger: apply [14:52:35] (03CR) 10Alexandros Kosiaris: [C: 03+1] "We might have to also eventually bump the resourcequote, but it's already 1TB of RAM, it should be ok for this round." [deployment-charts] - 10https://gerrit.wikimedia.org/r/994740 (owner: 10Hnowlan) [14:52:45] Thanks urbanecm :) [14:53:39] !log I'm going to apply kafka log compaction for {eqiad,codfw}.mediawiki.currussearch.page_rerender.v1 on kafka-main-eqiad only (current replica) - T354794 [14:53:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:45] T354794: Requesting permission to enable kafka log compaction for page_rerender on kafka-main - https://phabricator.wikimedia.org/T354794 [14:54:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P55965 and previous config saved to /var/cache/conftool/dbconfig/20240131-145427-marostegui.json [14:57:03] (03CR) 10Hnowlan: [C: 03+2] thumbor: increase memory limit, namespace limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/994740 (owner: 10Hnowlan) [14:58:14] !log btullis@cumin1002 START - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas rolling reboot on A:schema [14:59:26] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:59:54] (03Merged) 10jenkins-bot: thumbor: increase memory limit, namespace limit [deployment-charts] - 10https://gerrit.wikimedia.org/r/994740 (owner: 10Hnowlan) [14:59:58] urbanecm: Just one last thing: do you have an ETA for the migration script? [15:00:05] Deploy window Wikifunction Services UTC Afternoon (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240131T1500) [15:01:03] Daimona: not really. but it's running. [15:01:25] Well, that's something. Thank you anyway :) [15:03:07] Heads up, i am deploying some changes on mobileapps that switchover the the outgoing traffic from RESTBase/parsoid to MW. [15:04:48] (03CR) 10Jgiannelos: [C: 03+2] mobileapps: Switchover PCS to core page HTML [deployment-charts] - 10https://gerrit.wikimedia.org/r/994177 (https://phabricator.wikimedia.org/T339865) (owner: 10Jgiannelos) [15:05:16] !log filippo@deploy2002 helmfile [aux-k8s-eqiad] DONE helmfile.d/aus-k8s-eqiad-services/jaeger: apply [15:05:18] (03PS2) 10Jgiannelos: mobileapps: Switchover PCS to core page HTML [deployment-charts] - 10https://gerrit.wikimedia.org/r/994177 (https://phabricator.wikimedia.org/T339865) [15:06:07] nemo-yiannis: it'll make it call mw-api-int instead of parsoid? [15:06:21] yes [15:06:24] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [15:06:26] (03PS1) 10Volans: netboxdb: increase local backup retention [puppet] - 10https://gerrit.wikimedia.org/r/994742 (https://phabricator.wikimedia.org/T316655) [15:06:28] (03PS1) 10Volans: netboxdb: change bacula settings for DB backup [puppet] - 10https://gerrit.wikimedia.org/r/994743 (https://phabricator.wikimedia.org/T316655) [15:07:04] ok I'll keep an eye on it then, do you have an idea how much that'll increase rps? [15:07:12] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [15:07:38] (03CR) 10CI reject: [V: 04-1] netboxdb: increase local backup retention [puppet] - 10https://gerrit.wikimedia.org/r/994742 (https://phabricator.wikimedia.org/T316655) (owner: 10Volans) [15:07:49] we keep the storage level on restbase for now, so the expected traffic is only caused by pregeneration [15:08:26] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [15:08:50] claime: not sure about the exact numbers [15:08:54] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [15:09:15] ack, we'll just keep an eye on it and scale if needed then [15:09:15] (03CR) 10Volans: netboxdb: change bacula settings for DB backup (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/994743 (https://phabricator.wikimedia.org/T316655) (owner: 10Volans) [15:09:27] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [15:09:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T355609)', diff saved to https://phabricator.wikimedia.org/P55966 and previous config saved to /var/cache/conftool/dbconfig/20240131-150934-marostegui.json [15:09:38] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [15:09:44] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [15:09:52] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1156.eqiad.wmnet with reason: Maintenance [15:09:53] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [15:10:10] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [15:10:17] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1156 (T355609)', diff saved to https://phabricator.wikimedia.org/P55967 and previous config saved to /var/cache/conftool/dbconfig/20240131-151016-marostegui.json [15:10:42] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:12:14] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:12:30] (03PS1) 10Hnowlan: admin_ng: increase container limit for thumbor [deployment-charts] - 10https://gerrit.wikimedia.org/r/994745 [15:12:53] (03PS3) 10Dbrant: Add labs config to test Contact page for account vanishing. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993718 (https://phabricator.wikimedia.org/T343536) [15:14:07] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [15:14:13] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [15:14:18] (03PS4) 10Dbrant: Add labs config to test Contact page for account vanishing. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993718 (https://phabricator.wikimedia.org/T343536) [15:14:49] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [15:14:51] !log btullis@cumin1002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas (exit_code=0) rolling reboot on A:schema [15:16:28] !log jgiannelos@deploy2002 helmfile [staging] START helmfile.d/services/mobileapps: apply [15:16:32] !log jgiannelos@deploy2002 helmfile [staging] DONE helmfile.d/services/mobileapps: apply [15:16:34] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [15:16:36] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [15:16:54] (03PS2) 10Volans: netboxdb: increase local backup retention [puppet] - 10https://gerrit.wikimedia.org/r/994742 (https://phabricator.wikimedia.org/T316655) [15:16:56] (03PS2) 10Volans: netboxdb: change bacula settings for DB backup [puppet] - 10https://gerrit.wikimedia.org/r/994743 (https://phabricator.wikimedia.org/T316655) [15:17:39] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [15:17:45] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [15:18:34] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [15:18:53] (03PS1) 10Arnaudb: mariadb: migrate core multi-instances nodes [puppet] - 10https://gerrit.wikimedia.org/r/994769 (https://phabricator.wikimedia.org/T344036) [15:20:38] (03PS5) 10Dbrant: Add labs config to test Contact page for account vanishing. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/993718 (https://phabricator.wikimedia.org/T343536) [15:20:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T355609)', diff saved to https://phabricator.wikimedia.org/P55968 and previous config saved to /var/cache/conftool/dbconfig/20240131-152042-marostegui.json [15:20:49] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [15:20:55] claime: done [15:21:42] nemo-yiannis: ack thx [15:23:53] Instant 2krps spike in codfw, going down a bit we'll see where it stabilizes [15:24:32] (03CR) 10Ayounsi: [C: 03+2] Move git search related classes to __init__ [cookbooks] - 10https://gerrit.wikimedia.org/r/981349 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [15:26:30] (03CR) 10Hnowlan: [C: 03+2] conftool: restore maps primary servers to kartotherian pool [puppet] - 10https://gerrit.wikimedia.org/r/993702 (https://phabricator.wikimedia.org/T355892) (owner: 10Hnowlan) [15:26:59] 10SRE, 10SRE-swift-storage, 10Data-Persistence, 10Thumbor, and 2 others: Changing default image thumbnail size on English Wikipedia - https://phabricator.wikimedia.org/T355914 (10TheDJ) > Once that's done, I suggest getting rid of thumbsizes user preferences altogether and making the thumbsize wiki-wide in... [15:27:07] (03CR) 10Jcrespo: "Please read below" [puppet] - 10https://gerrit.wikimedia.org/r/994743 (https://phabricator.wikimedia.org/T316655) (owner: 10Volans) [15:29:32] (03Merged) 10jenkins-bot: Move git search related classes to __init__ [cookbooks] - 10https://gerrit.wikimedia.org/r/981349 (https://phabricator.wikimedia.org/T350152) (owner: 10Ayounsi) [15:29:46] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=maps1009.eqiad.wmnet [15:32:47] !log ayounsi@cumin2002 START - Cookbook sre.hosts.decommission for hosts testvm2006.codfw.wmnet [15:34:31] !log hnowlan@puppetmaster1001 conftool action : set/weight=10; selector: name=maps1009.eqiad.wmnet [15:35:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P55969 and previous config saved to /var/cache/conftool/dbconfig/20240131-153549-marostegui.json [15:36:22] (03CR) 10Marostegui: "The ones with the dbstore role are backup sources, so you might want to coordinate with Jaime before shutting those down for the transfers" [puppet] - 10https://gerrit.wikimedia.org/r/994769 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [15:36:31] !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=10; selector: name=maps2009.codfw.wmnet [15:36:46] (03CR) 10Marostegui: [C: 03+1] mariadb: migrate core multi-instances nodes [puppet] - 10https://gerrit.wikimedia.org/r/994769 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [15:36:51] (ATSBackendErrorsHigh) firing: ATS: elevated 5xx errors from restbase.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-datasource=esams%20prometheus/ops&var-cluster=text&var-origin=restbase.discovery.wmnet&editPanel=12 - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [15:36:59] !log ayounsi@cumin2002 START - Cookbook sre.dns.netbox [15:37:28] uh :) [15:37:45] someone's working on that? [15:38:01] nemo-yiannis: ^ [15:38:22] ATS is reporting an increase on 500 (not 503s or 504s) [15:38:52] looking at it [15:39:27] !log ayounsi@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: testvm2006.codfw.wmnet decommissioned, removing all IPs except the asset tag one - ayounsi@cumin2002" [15:39:31] error looks like TypeError [ERR_HTTP_INVALID_HEADER_VALUE]: Invalid value "undefined" for header "content-language" [15:39:47] ok reverting [15:39:52] which is coming from changeprop [15:40:10] it looks like pcs is not setting the content-language header [15:40:26] tnx [15:40:50] (03PS1) 10Jgiannelos: Revert "mobileapps: Switchover PCS to core page HTML" [deployment-charts] - 10https://gerrit.wikimedia.org/r/994712 [15:40:57] (03CR) 10Hnowlan: [C: 03+2] admin_ng: increase container limit for thumbor [deployment-charts] - 10https://gerrit.wikimedia.org/r/994745 (owner: 10Hnowlan) [15:41:10] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: testvm2006.codfw.wmnet decommissioned, removing all IPs except the asset tag one - ayounsi@cumin2002" [15:41:10] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:41:11] !log ayounsi@cumin2002 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts testvm2006.codfw.wmnet [15:41:15] (03CR) 10Hnowlan: [C: 03+1] Revert "mobileapps: Switchover PCS to core page HTML" [deployment-charts] - 10https://gerrit.wikimedia.org/r/994712 (owner: 10Jgiannelos) [15:41:17] (03CR) 10Jgiannelos: [C: 03+2] Revert "mobileapps: Switchover PCS to core page HTML" [deployment-charts] - 10https://gerrit.wikimedia.org/r/994712 (owner: 10Jgiannelos) [15:41:22] 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by ayounsi@cumin2002 for hosts: `testvm2006.codfw.wmnet` - testvm2006.codfw.wmnet (**... [15:41:51] (ATSBackendErrorsHigh) firing: (4) ATS: elevated 5xx errors from restbase.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [15:43:30] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to wmf for arinaigum - https://phabricator.wikimedia.org/T355591 (10Rmaung) What are our next steps here? Is there something else we need to do on our end or is it up to the Clinic SRE now? [15:43:48] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [15:43:49] (03Merged) 10jenkins-bot: admin_ng: increase container limit for thumbor [deployment-charts] - 10https://gerrit.wikimedia.org/r/994745 (owner: 10Hnowlan) [15:43:54] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [15:43:57] (03Merged) 10jenkins-bot: Revert "mobileapps: Switchover PCS to core page HTML" [deployment-charts] - 10https://gerrit.wikimedia.org/r/994712 (owner: 10Jgiannelos) [15:45:06] !log jgiannelos@deploy2002 helmfile [eqiad] START helmfile.d/services/mobileapps: apply [15:45:34] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'. [15:45:39] !log jgiannelos@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mobileapps: apply [15:46:35] !log jgiannelos@deploy2002 helmfile [codfw] START helmfile.d/services/mobileapps: apply [15:46:44] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [15:46:51] (ATSBackendErrorsHigh) firing: (5) ATS: elevated 5xx errors from restbase.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [15:47:02] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'. [15:47:05] !log jgiannelos@deploy2002 helmfile [codfw] DONE helmfile.d/services/mobileapps: apply [15:47:35] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [15:50:44] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [15:50:56] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P55970 and previous config saved to /var/cache/conftool/dbconfig/20240131-155055-marostegui.json [15:50:58] seems like error rate going back to normal now [15:51:30] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to wmf for arinaigum - https://phabricator.wikimedia.org/T355591 (10ABran-WMF) I've asked for off band validation of the SSH Key then will be proceeding with the patch and the next steps [15:51:54] it caused some 5xx on mw-api-int as well, but my computer completely crashed on me at the worst possible moment... [15:52:20] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host moscovium.eqiad.wmnet [15:52:22] 10SRE, 10Infrastructure-Foundations: Connection errors to some hosts from cumin1002 - https://phabricator.wikimedia.org/T356174 (10MoritzMuehlenhoff) [15:56:14] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host moscovium.eqiad.wmnet [15:57:04] 10SRE, 10Infrastructure-Foundations: Connection errors to some hosts from cumin1002 - https://phabricator.wikimedia.org/T356174 (10MoritzMuehlenhoff) [15:57:49] nemo-yiannis: for reference, the mw-api-int errors it generated https://logstash.wikimedia.org/goto/f635d69f4c8c27ce8eb33ddab752cf21 [15:58:20] thanks claime [15:58:24] !log installing openssh security updates [15:58:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:00] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [16:01:04] (03PS1) 10Hnowlan: admin_ng: bump overall limit for thumbor memory [deployment-charts] - 10https://gerrit.wikimedia.org/r/994751 [16:01:51] (ATSBackendErrorsHigh) firing: (5) ATS: elevated 5xx errors from restbase.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [16:02:54] (03PS4) 10Ayounsi: DHCP: set "use-host-decl-names on" [puppet] - 10https://gerrit.wikimedia.org/r/994223 (https://phabricator.wikimedia.org/T300152) [16:04:41] (03CR) 10CI reject: [V: 04-1] DHCP: set "use-host-decl-names on" [puppet] - 10https://gerrit.wikimedia.org/r/994223 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [16:06:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T355609)', diff saved to https://phabricator.wikimedia.org/P55972 and previous config saved to /var/cache/conftool/dbconfig/20240131-160602-marostegui.json [16:06:04] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [16:06:18] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1170.eqiad.wmnet with reason: Maintenance [16:06:25] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1170:3312 (T355609)', diff saved to https://phabricator.wikimedia.org/P55973 and previous config saved to /var/cache/conftool/dbconfig/20240131-160624-marostegui.json [16:06:27] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [16:06:51] (ATSBackendErrorsHigh) firing: (5) ATS: elevated 5xx errors from restbase.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [16:08:34] Restbase is still serving some 5xx with the same value, does something need to be purged or do we let the long tail resolve? [16:10:01] seems that the error rate is now stable, definitely higher than normal [16:10:02] (03PS3) 10Jcrespo: netboxdb: change bacula settings for DB backup [puppet] - 10https://gerrit.wikimedia.org/r/994743 (https://phabricator.wikimedia.org/T316655) (owner: 10Volans) [16:10:11] Yeah, same content-language errors showing up in restbase logs [16:10:15] it should resolve [16:10:27] coming from changeprop and... wikifeeds?! [16:10:57] i didn't mess with that :) [16:11:06] Hello everyone. I'm going to deploy changes to analytics/refinery on the analytics cluster. AFAICT there are no deployments ongoing right now so... cool? [16:11:24] nemo-yiannis: same error at least https://logstash.wikimedia.org/app/discover#/doc/0fade920-6712-11eb-8327-370b46f9e7a5/ecs-default-1-1.11.0-6-2024.05?id=Bn9IYI0BRtLP5wy6ZFmd [16:13:19] issue is still there [16:13:39] https://grafana.wikimedia.org/goto/9HFRCxpSk?orgId=1 [16:13:58] fabfur, urandom, fabfur: could you open an incident report and assume IC role please? [16:14:18] oops.. one of fabfur mentions should've been bblack :) [16:14:34] considering you mentioned me 2 times, I'll do that :) [16:14:35] I'm also seeing errors from direct clients rather than changeprop backlogs - is this an issue in restbase? https://logstash.wikimedia.org/app/discover#/doc/0fade920-6712-11eb-8327-370b46f9e7a5/ecs-default-1-1.11.0-6-2024.05?id=BlBKYI0Bk70gu8Gm9qeq [16:14:55] fabfur: that was a bug from my side :) [16:15:03] and/or some mobileapps instances that didn't redeploy? [16:16:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T355609)', diff saved to https://phabricator.wikimedia.org/P55974 and previous config saved to /var/cache/conftool/dbconfig/20240131-161600-marostegui.json [16:16:20] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [16:20:47] hnowlan: i think its wikifeeds calling page/summary that triggers the error [16:21:12] but the root cause is the same (mobileapps not setting the right content-language after the switchover) [16:21:45] looks like that example above is getting it from a client hitting page summary also [16:23:20] maybe we still have some cached version of the PCS responses that don't have the header that's why the long tail of errors [16:24:23] (03CR) 10Jcrespo: [C: 03+1] netboxdb: change bacula settings for DB backup [puppet] - 10https://gerrit.wikimedia.org/r/994743 (https://phabricator.wikimedia.org/T316655) (owner: 10Volans) [16:24:58] there's two errors at work here afaict: "HTTPError: Cannot read property '0' of null" which comes from Changeprop [16:24:59] nemo-yiannis: cached where? [16:25:03] restbase [16:25:23] hnowlan: filtering on these shows a baseline of these errors going back further [16:25:33] claime: ah, okay [16:25:40] that's about 200 errors per second [16:25:53] (03PS1) 10Bking: rdf-streaming-updater: Change notification from email to task [alerts] - 10https://gerrit.wikimedia.org/r/994758 (https://phabricator.wikimedia.org/T348685) [16:25:59] regarding the header issue: RESTBase should fail before trying to store the response so i don't see why we still have this errors [16:26:04] (03CR) 10Ayounsi: [C: 03+1] netboxdb: change bacula settings for DB backup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/994743 (https://phabricator.wikimedia.org/T316655) (owner: 10Volans) [16:26:21] the content-language error is coming from real clients afaict, not changeprop [16:26:27] (03CR) 10Ayounsi: [C: 03+1] C:samplicator Icinga monitoring is not required. [puppet] - 10https://gerrit.wikimedia.org/r/994698 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [16:26:57] is there something we can/should drop in restbase or is this going to be too widespread? [16:27:02] (03CR) 10CI reject: [V: 04-1] rdf-streaming-updater: Change notification from email to task [alerts] - 10https://gerrit.wikimedia.org/r/994758 (https://phabricator.wikimedia.org/T348685) (owner: 10Bking) [16:27:35] (03CR) 10Ayounsi: [C: 03+1] netboxdb: increase local backup retention [puppet] - 10https://gerrit.wikimedia.org/r/994742 (https://phabricator.wikimedia.org/T316655) (owner: 10Volans) [16:27:54] i am trying to thing of alternatives [16:28:05] its either dropping the header on restbase level [16:28:14] or trying to find a way to purge things [16:29:34] (03CR) 10Volans: [C: 03+2] netboxdb: increase local backup retention [puppet] - 10https://gerrit.wikimedia.org/r/994742 (https://phabricator.wikimedia.org/T316655) (owner: 10Volans) [16:31:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P55976 and previous config saved to /var/cache/conftool/dbconfig/20240131-163106-marostegui.json [16:31:26] (03CR) 10Volans: [C: 03+2] "All good" [puppet] - 10https://gerrit.wikimedia.org/r/994743 (https://phabricator.wikimedia.org/T316655) (owner: 10Volans) [16:31:33] (03CR) 10Bking: [C: 03+2] cloudelastic: use acme-chief/letsencrypt with canary [puppet] - 10https://gerrit.wikimedia.org/r/994338 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [16:32:34] (03CR) 10Brouberol: rdf-streaming-updater: Change notification from email to task (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/994758 (https://phabricator.wikimedia.org/T348685) (owner: 10Bking) [16:36:28] PROBLEM - Check correctness of the icinga configuration on alert1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [16:38:27] (03CR) 10Jcrespo: "Personally, I would not shutdown backup sources at all- except for eventual decommission, ofc. Not that it could not be done like the othe" [puppet] - 10https://gerrit.wikimedia.org/r/994769 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [16:39:22] (03CR) 10Jcrespo: "(also setting them up requires moving the backup config, although that is a very simple patch)" [puppet] - 10https://gerrit.wikimedia.org/r/994769 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [16:39:52] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Jdforrester-WMF) [16:40:18] (03PS1) 10Bking: cloudelastic: allow wmnet hosts to request certs from acme-chief [puppet] - 10https://gerrit.wikimedia.org/r/994763 (https://phabricator.wikimedia.org/T355617) [16:40:32] (03PS1) 10Effie Mouzeli: php: add env[MCROUTER_SERVER] variable [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/994764 (https://phabricator.wikimedia.org/T346690) [16:40:47] nemo-yiannis: do you have an idea of what needs to be purged? and/or how many entries might need to be purged [16:41:30] not really [16:41:36] i am thinking about it [16:43:01] !log phuedx@deploy2002 Started deploy [analytics/refinery@2c00cad]: Regular analytics weekly train [analytics/refinery@2c00cad1] [16:43:02] (03PS2) 10Bking: rdf-streaming-updater: Change notification from email to task [alerts] - 10https://gerrit.wikimedia.org/r/994758 (https://phabricator.wikimedia.org/T348685) [16:44:01] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/994763 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [16:45:37] urandom, nemo-yiannis: is it worth doing a query for stuff in restbase that is missing that header? seems like it would be awful messy but we could bound it to the last two hours [16:45:42] (03Abandoned) 10Ryan Kemper: wdqs: make wdqs2025 puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/993193 (https://phabricator.wikimedia.org/T354959) (owner: 10Ryan Kemper) [16:46:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P55977 and previous config saved to /var/cache/conftool/dbconfig/20240131-164613-marostegui.json [16:46:16] if its possible on the cassandra side of things yeah [16:46:26] (03CR) 10Ryan Kemper: [C: 03+2] Remove obsolete Hiera entries [puppet] - 10https://gerrit.wikimedia.org/r/994683 (https://phabricator.wikimedia.org/T354959) (owner: 10Muehlenhoff) [16:47:37] (03PS1) 10Ryan Kemper: wdqs graph-split: enable cross-federation [puppet] - 10https://gerrit.wikimedia.org/r/994765 (https://phabricator.wikimedia.org/T355888) [16:48:02] (03PS1) 10Andrea Denisse: grafana: Ensure Loki data is synchronized across instances [puppet] - 10https://gerrit.wikimedia.org/r/994786 (https://phabricator.wikimedia.org/T352665) [16:48:59] 10SRE-Access-Requests, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10Gehel) [16:50:55] hnowlan: it's not indexed by anything but 'key' (page title) [16:51:16] so querying by timestamp or a matching header value is a full table scan [16:51:56] 10SRE, 10serviceops: Migrate MW appservers to bullseye - https://phabricator.wikimedia.org/T356293 (10Jdforrester-WMF) [16:52:09] (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] wdqs graph-split: enable cross-federation [puppet] - 10https://gerrit.wikimedia.org/r/994765 (https://phabricator.wikimedia.org/T355888) (owner: 10Ryan Kemper) [16:52:15] and sadly, it was decided that headers here should be a text field instead of a map, so we couldn't query for a particular key of that map anyway [16:52:36] (03CR) 10Scott French: "Sure, I'd be happy to!" [puppet] - 10https://gerrit.wikimedia.org/r/993710 (https://phabricator.wikimedia.org/T356054) (owner: 10Clément Goubert) [16:52:44] (03CR) 10D3r1ck01: "nitpick!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/994764 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [16:52:53] !log phuedx@deploy2002 Finished deploy [analytics/refinery@2c00cad]: Regular analytics weekly train [analytics/refinery@2c00cad1] (duration: 09m 52s) [16:52:57] right, in that case gathering a list of URLs to purge is the only option? [16:53:00] 10SRE, 10serviceops: Migrate MW appservers to bullseye - https://phabricator.wikimedia.org/T356293 (10Jdforrester-WMF) [16:53:09] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Jdforrester-WMF) [16:53:14] the error rate isn't dropping very quickly [16:53:17] 10SRE, 10serviceops: Migrate MW appservers to bullseye - https://phabricator.wikimedia.org/T356293 (10Jdforrester-WMF) 05Open→03Stalled [16:53:23] 10SRE, 10Infrastructure-Foundations, 10Epic: Tracking task for Bullseye migrations in production - https://phabricator.wikimedia.org/T291916 (10Jdforrester-WMF) [16:53:37] was kafka used to set the aberrant content? [16:53:49] could that be used, bounded by the time period? [16:53:58] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: (2) The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [16:54:03] !log ayounsi@cumin1002 START - Cookbook sre.hosts.reimage for host sretest2005.codfw.wmnet with OS bookworm [16:54:05] (03PS2) 10Effie Mouzeli: php: add env[MCROUTER_SERVER] variable [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/994764 (https://phabricator.wikimedia.org/T346690) [16:54:12] basically, re-queue it for a forced update? [16:54:20] (03CR) 10Effie Mouzeli: php: add env[MCROUTER_SERVER] variable (032 comments) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/994764 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [16:54:36] urandom: It was an actual request that was served with a response with missing `content-language` [16:54:42] could have been by pregeneration [16:54:49] or actual user request [16:55:01] the first case would have been kafka [16:56:11] !log phuedx@deploy2002 Started deploy [analytics/refinery@2c00cad] (thin): Regular analytics weekly train THIN [analytics/refinery@2c00cad1] [16:56:18] !log phuedx@deploy2002 Finished deploy [analytics/refinery@2c00cad] (thin): Regular analytics weekly train THIN [analytics/refinery@2c00cad1] (duration: 00m 06s) [16:56:52] this definitely seems like a restbase bug, if it'll persist headers that are missing a key that will cause it to 500 on request [16:57:12] (not that that helps this current situation, just don't want that to go unmentioned...) [16:57:41] !log phuedx@deploy2002 Started deploy [analytics/refinery@2c00cad] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@2c00cad1] [16:57:42] I just tried to CSV export my logstash query and it just crashed on me [16:57:44] awesome [16:57:55] 🎉 [16:57:56] (it crashed the whole browser) [16:58:07] hnowlan: we can try deploying a patch that would invalidate all requests between times [16:58:23] *between 2 timestamps [16:58:38] I think there is also a restbase(ish) way of purging, no? a request sent with the apropos headers? [16:58:54] yes but i don't know what we should purge at the moment [16:58:58] we can definitely purge from Cassandra if we can get a list of page titles [16:59:10] maybe the requests from logstash ? [16:59:23] nemo-yiannis: I was trying to export to CSV the last 20 minutes, to then crunch it into unique urls [16:59:32] logstash had other ideas [16:59:34] RECOVERY - Elasticsearch HTTPS for cloudelastic-psi-eqiad on cloudelastic1010 is OK: SSL OK - Certificate cloudelastic1010.eqiad.wmnet valid until 2024-02-27 15:28:00 +0000 (expires in 26 days) https://wikitech.wikimedia.org/wiki/Search [16:59:40] yeah [16:59:50] RECOVERY - Elasticsearch HTTPS for cloudelastic-chi-eqiad-ro on cloudelastic1010 is OK: SSL OK - Certificate cloudelastic1010.eqiad.wmnet valid until 2024-02-27 15:28:00 +0000 (expires in 26 days) https://wikitech.wikimedia.org/wiki/Search [16:59:54] RECOVERY - Elasticsearch HTTPS for cloudelastic-omega-eqiad-ro on cloudelastic1010 is OK: SSL OK - Certificate cloudelastic1010.eqiad.wmnet valid until 2024-02-27 15:28:00 +0000 (expires in 26 days) https://wikitech.wikimedia.org/wiki/Search [17:00:04] yeah it just won't let me [17:00:16] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, tested in pontoon" [puppet] - 10https://gerrit.wikimedia.org/r/994786 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse) [17:00:24] https://github.com/wikimedia/restbase/pull/1341 we can try this but its far from tested [17:00:46] godog: do you know of a way that won't crash my browser to export a saved search from logstash to CSV? [17:01:16] !log phuedx@deploy2002 Finished deploy [analytics/refinery@2c00cad] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@2c00cad1] (duration: 03m 35s) [17:01:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T355609)', diff saved to https://phabricator.wikimedia.org/P55978 and previous config saved to /var/cache/conftool/dbconfig/20240131-170120-marostegui.json [17:01:22] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [17:01:24] or anyone in o11y actually [17:01:35] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1182.eqiad.wmnet with reason: Maintenance [17:01:42] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1182 (T355609)', diff saved to https://phabricator.wikimedia.org/P55979 and previous config saved to /var/cache/conftool/dbconfig/20240131-170141-marostegui.json [17:01:45] oh I think I got a CSV [17:01:46] claime: wild, haven't seen that happening although a bit cheeky you could 'copy as curl' the request, first thing that came to mind [17:01:58] not even a CSV, just a list [17:02:03] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [17:02:20] hnowlan: yay [17:02:43] deploy2002:/home/hnowlan/restbase_urls.csv [17:02:47] (03CR) 10CDanis: [C: 03+1] hieradata: add jaeger config for SSO oidc [puppet] - 10https://gerrit.wikimedia.org/r/994664 (https://phabricator.wikimedia.org/T320555) (owner: 10Filippo Giunchedi) [17:03:25] (03CR) 10CDanis: [C: 03+1] C:samplicator Icinga monitoring is not required. [puppet] - 10https://gerrit.wikimedia.org/r/994698 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [17:04:02] so is it easiest to get RB to purge those urls via curl? what's the operation there again, cache-control or something? [17:04:21] ~1k unique urls [17:04:38] (out of the max 10k lines we can export from logstash) [17:04:47] (03PS2) 10Eevans: sessionstore: updated list of Cassandra hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/994357 (https://phabricator.wikimedia.org/T353402) [17:05:01] 10SRE, 10serviceops: confd setup left without configuration doesn't stop confd - https://phabricator.wikimedia.org/T356296 (10Volans) [17:05:12] i can do the script for the restbase purging [17:06:13] (03PS1) 10Effie Mouzeli: mw-debug: set MCROUTER_SERVER variable [deployment-charts] - 10https://gerrit.wikimedia.org/r/994789 (https://phabricator.wikimedia.org/T346690) [17:06:25] or urandom can take that url list and purge from cassandra directly [17:06:52] (ATSBackendErrorsHigh) firing: (2) ATS: elevated 5xx errors from restbase.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [17:07:02] or that [17:07:34] RECOVERY - Check correctness of the icinga configuration on alert1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [17:09:20] urandom: can you please do that? [17:09:26] is it safer to do it via RB? [17:09:33] I can, but I don't think there is any advantage to it [17:10:02] it'll look roughly the same, a script to iterate over invocations of curl, versus one that iterates over cassandra queries [17:10:52] hnowlan: as a rule of thumb, it's probably always safer to use the services interface, rather than back-end the database, but in this case it's pretty straightforward [17:11:58] nemo-yiannis: go ahead with the script in that case [17:12:08] on it [17:12:23] claime: how did you create those saved reports? when creating one in reporting I just get the option to create images [17:12:48] hnowlan: Share -> Snapshot -> short URL [17:12:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T355609)', diff saved to https://phabricator.wikimedia.org/P55980 and previous config saved to /var/cache/conftool/dbconfig/20240131-171252-marostegui.json [17:12:58] ok purging [17:13:01] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [17:13:27] claime: I mean the ones in https://logstash.wikimedia.org/app/reports-dashboards#/ [17:14:03] Ah there are the CSVs that crashed my browser [17:14:24] I saved a query, then clicked reporting, generate CSV [17:14:29] But a query from Discover [17:14:37] ahh [17:14:45] https://logstash.wikimedia.org/goto/47cf83822ad386393d4533f39c83581c [17:14:48] PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:14:48] This one [17:14:49] (03PS1) 10Ssingh: admin: add Chris Dobbins to data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/994791 [17:16:06] RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:16:43] i think errors are going down [17:16:48] (or i am too optimistic) [17:16:51] (ATSBackendErrorsHigh) resolved: ATS: elevated 5xx errors from restbase.discovery.wmnet #page - https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server#Debugging - https://grafana.wikimedia.org/d/1T_4O08Wk/ats-backends-origin-servers-overview?orgId=1&viewPanel=12&var-datasource=esams%20prometheus/ops&var-cluster=text&var-origin=restbase.discovery.wmnet&editPanel=12 - https://alerts.wikimedia.org/?q=alertname%3DATSBackendErrorsHigh [17:17:10] nemo-yiannis: looks that way [17:17:13] (03CR) 10Ssingh: "To be merged later." [puppet] - 10https://gerrit.wikimedia.org/r/994791 (owner: 10Ssingh) [17:17:32] looks like it, we may need to repeat the export -> unique -> purge again because of the 10k limit of the CSV [17:17:52] nemo-yiannis: could you document the method you used on wikitech when things calm down/tomorrow? [17:17:58] RECOVERY - Elasticsearch HTTPS for cloudelastic-omega-eqiad on cloudelastic1010 is OK: SSL OK - Certificate cloudelastic1010.eqiad.wmnet valid until 2024-02-27 15:28:00 +0000 (expires in 26 days) https://wikitech.wikimedia.org/wiki/Search [17:18:03] hnowlan: sure [17:18:29] hnowlan: how did you manage to get the url list btw? [17:18:34] (03PS5) 10JHathaway: DHCP: set "use-host-decl-names on" [puppet] - 10https://gerrit.wikimedia.org/r/994223 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi) [17:18:48] yeah thanks for the list hnowlan [17:18:49] claime: I was able to download it okay for whatever reason [17:18:57] dang [17:18:59] I'm going to deploy another change to analytics/refinery on the analytics cluster [17:19:15] client side issue then [17:19:28] !log phuedx@deploy2002 Started deploy [analytics/refinery@bef134c]: Regular analytics weekly train [analytics/refinery@bef134c2] [17:19:49] hnowlan: did you get it from the reports url you gave earlier? [17:19:55] claime: yeah [17:20:05] I crashed on generation, didn't try the download afterwards [17:20:11] ahh [17:20:18] RECOVERY - Elasticsearch HTTPS for cloudelastic-psi-eqiad-ro on cloudelastic1010 is OK: SSL OK - Certificate cloudelastic1010.eqiad.wmnet valid until 2024-02-27 15:28:00 +0000 (expires in 26 days) https://wikitech.wikimedia.org/wiki/Search [17:20:22] I'll redo an export for the last 5 minutes [17:21:07] seeing a sharp decline in 500s in logstash [17:21:27] still purging URLs on my end [17:21:40] nope, download also crashes chromium x) [17:21:51] and just logging in crashes firefox for some reason [17:22:00] web tools are fun [17:22:16] nemo-yiannis: ack, so we may not need to repeat, we'll see [17:22:17] I'll upload another [17:22:32] hnowlan: last one I generated is the last 10 minutes [17:22:44] out of curiosity are the 500s logged on webrequest logs ? [17:22:50] that would be another source [17:24:10] (03CR) 10JHathaway: [C: 03+2] Add DKIM & SPF records for wikimediafoundation.myshopify.com [dns] - 10https://gerrit.wikimedia.org/r/994333 (https://phabricator.wikimedia.org/T355833) (owner: 10JHathaway) [17:24:16] (03PS3) 10JHathaway: Add DKIM & SPF records for wikimediafoundation.myshopify.com [dns] - 10https://gerrit.wikimedia.org/r/994333 (https://phabricator.wikimedia.org/T355833) [17:24:47] new URLs are in restbase_urls_2.csv [17:25:05] of the 1329 uniques in the file, 680 are in common with the first [17:25:28] but just to warn nemo-yiannis there are duplicates [17:25:39] its ok thanks [17:25:47] i will run only new ones [17:26:11] https://w.wiki/92g2 they are, but sampled 1/128th [17:27:16] 👍 [17:27:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P55981 and previous config saved to /var/cache/conftool/dbconfig/20240131-172758-marostegui.json [17:28:28] good call on webrequests nemo-yiannis [17:28:35] that at least doesn't crash my browser [17:28:38] :p [17:29:06] (03CR) 10MVernon: [C: 03+1] sessionstore: updated list of Cassandra hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/994357 (https://phabricator.wikimedia.org/T353402) (owner: 10Eevans) [17:29:27] (ProbeDown) firing: Service vrts1001:1443 has failed probes (http_ticket_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#vrts1001:1443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:29:58] PROBLEM - SSH on vrts1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:30:33] !log phuedx@deploy2002 Finished deploy [analytics/refinery@bef134c]: Regular analytics weekly train [analytics/refinery@bef134c2] (duration: 11m 05s) [17:30:54] !log phuedx@deploy2002 Started deploy [analytics/refinery@bef134c] (thin): Regular analytics weekly train THIN [analytics/refinery@bef134c2] [17:31:03] !log phuedx@deploy2002 Finished deploy [analytics/refinery@bef134c] (thin): Regular analytics weekly train THIN [analytics/refinery@bef134c2] (duration: 00m 08s) [17:31:24] RECOVERY - SSH on vrts1001 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:31:41] !log phuedx@deploy2002 Started deploy [analytics/refinery@bef134c] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@bef134c2] [17:32:16] (03CR) 10Subramanya Sastry: [C: 03+1] Turn on DT visual enhancements on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991039 (https://phabricator.wikimedia.org/T355374) (owner: 10C. Scott Ananian) [17:33:16] running purges again [17:33:19] 10SRE, 10Ganeti, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Investigate Ganeti in routed mode - https://phabricator.wikimedia.org/T300152 (10ayounsi) For the latter, some more debug: I added ` shared-network "test" { subnet 10.192.24.1 netmask 255.255.255.255 { opti... [17:33:57] 10SRE, 10DNS, 10Foundational Technology Requests, 10Traffic, 10Patch-For-Review: Ensure that wikimediafoundation.myshopify.com complies with Google's new email sender guidelines - https://phabricator.wikimedia.org/T355833 (10jhathaway) @bcampbell patch is merged, if you want to give it a whirl! [17:35:01] (03PS1) 10Ryan Kemper: wdqs: allow further federation to freiburg [puppet] - 10https://gerrit.wikimedia.org/r/994793 (https://phabricator.wikimedia.org/T339347) [17:35:11] !log phuedx@deploy2002 Finished deploy [analytics/refinery@bef134c] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@bef134c2] (duration: 03m 29s) [17:36:00] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to wmf for arinaigum - https://phabricator.wikimedia.org/T355591 (10Arinaigu) The SSH public key I provided on the ticket is newly created and has not been used for WMCS or anything else. What else is needed to do the off band validation? [17:36:02] PROBLEM - SSH on vrts1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:37:08] PROBLEM - clamd running on vrts1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [17:37:32] RECOVERY - SSH on vrts1001 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:37:44] PROBLEM - Check systemd state on vrts1001 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service,clamav-freshclam.service,exim4.service,puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:38:38] RECOVERY - clamd running on vrts1001 is OK: PROCS OK: 1 process with UID = 114 (clamav), command name clamd https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [17:38:38] PROBLEM - freshclam running on vrts1001 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 114 (clamav), command name freshclam https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [17:39:27] (ProbeDown) resolved: Service vrts1001:1443 has failed probes (http_ticket_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#vrts1001:1443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:43:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P55982 and previous config saved to /var/cache/conftool/dbconfig/20240131-174305-marostegui.json [17:43:52] hnowlan: claime done with the second round of URLs [17:44:12] (03CR) 10Bking: rdf-streaming-updater: Change notification from email to task (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/994758 (https://phabricator.wikimedia.org/T348685) (owner: 10Bking) [17:44:14] Also for future reference: https://wikitech.wikimedia.org/wiki/RESTBase#Notes_on_purging I added some notes here [17:44:22] great, thank you [17:44:58] we've evened out at like .4 rps returning 500- shall we wait for that to clear naturally or do one last sweep? it's getting late for most of us [17:45:26] !log aokoth@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM vrts1001.eqiad.wmnet [17:45:31] 10SRE, 10Infrastructure-Foundations: prometheus-node-exporter errors on firewall-running.prom content - https://phabricator.wikimedia.org/T356305 (10Volans) p:05Triage→03High [17:45:42] !log aokoth@cumin1002 END (FAIL) - Cookbook sre.ganeti.reboot-vm (exit_code=99) for VM vrts1001.eqiad.wmnet [17:45:57] (03PS1) 10Volans: firewall: fix nftables metric exporter [puppet] - 10https://gerrit.wikimedia.org/r/994794 (https://phabricator.wikimedia.org/T356305) [17:46:21] !log aokoth@cumin1002 START - Cookbook sre.ganeti.reboot-vm for VM vrts1001.eqiad.wmnet [17:47:42] RECOVERY - freshclam running on vrts1001 is OK: PROCS OK: 1 process with UID = 114 (clamav), command name freshclam https://wikitech.wikimedia.org/wiki/VRT_System%23ClamAV [17:47:52] 10SRE, 10DNS, 10Foundational Technology Requests, 10Traffic, 10Patch-For-Review: Ensure that wikimediafoundation.myshopify.com complies with Google's new email sender guidelines - https://phabricator.wikimedia.org/T355833 (10bcampbell) Thanks @jhathaway . I just clicked the button on the Shopify admin co... [17:48:08] (03PS1) 10JHathaway: phabricator: verify domain for Google's postmaster tools [dns] - 10https://gerrit.wikimedia.org/r/994795 (https://phabricator.wikimedia.org/T355691) [17:48:20] RECOVERY - Check systemd state on vrts1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:48:23] hnowlan: i am fine either way [17:48:32] (03PS1) 10Majavah: admin: remove non-yubikey key from taavi [puppet] - 10https://gerrit.wikimedia.org/r/994796 [17:48:49] (03CR) 10Dzahn: [C: 03+1] phabricator: verify domain for Google's postmaster tools [dns] - 10https://gerrit.wikimedia.org/r/994795 (https://phabricator.wikimedia.org/T355691) (owner: 10JHathaway) [17:48:59] (03CR) 10CI reject: [V: 04-1] phabricator: verify domain for Google's postmaster tools [dns] - 10https://gerrit.wikimedia.org/r/994795 (https://phabricator.wikimedia.org/T355691) (owner: 10JHathaway) [17:49:12] hnowlan: the only problem might be that we might have content that won't get purged unless there is edits [17:49:18] but we can do another round of purging tomorrow [17:50:11] (03CR) 10Dzahn: [C: 03+1] "" All TTLs for type TXT should match" is what it doesnt like here" [dns] - 10https://gerrit.wikimedia.org/r/994795 (https://phabricator.wikimedia.org/T355691) (owner: 10JHathaway) [17:50:19] (03PS2) 10JHathaway: phabricator: verify domain for Google's postmaster tools [dns] - 10https://gerrit.wikimedia.org/r/994795 (https://phabricator.wikimedia.org/T355691) [17:50:38] (03CR) 10Dzahn: [C: 03+1] phabricator: verify domain for Google's postmaster tools [dns] - 10https://gerrit.wikimedia.org/r/994795 (https://phabricator.wikimedia.org/T355691) (owner: 10JHathaway) [17:50:48] (03CR) 10JHathaway: "fixed, thanks!" [dns] - 10https://gerrit.wikimedia.org/r/994795 (https://phabricator.wikimedia.org/T355691) (owner: 10JHathaway) [17:50:52] !log aokoth@cumin1002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM vrts1001.eqiad.wmnet [17:51:35] !log ayounsi@cumin1002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host sretest2005.codfw.wmnet with OS bookworm [17:51:52] (03CR) 10JHathaway: [C: 03+2] phabricator: verify domain for Google's postmaster tools [dns] - 10https://gerrit.wikimedia.org/r/994795 (https://phabricator.wikimedia.org/T355691) (owner: 10JHathaway) [17:51:54] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1254/co" [puppet] - 10https://gerrit.wikimedia.org/r/994167 (https://phabricator.wikimedia.org/T356171) (owner: 10Majavah) [17:52:32] (03PS2) 10Bking: cloudelastic: allow wmnet hosts to request certs from acme-chief [puppet] - 10https://gerrit.wikimedia.org/r/994763 (https://phabricator.wikimedia.org/T355617) [17:52:44] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/994763 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [17:55:04] (03CR) 10Majavah: [C: 03+1] cloudelastic: allow wmnet hosts to request certs from acme-chief (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/994763 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [17:55:58] (03CR) 10Bking: [C: 03+2] cloudelastic: allow wmnet hosts to request certs from acme-chief [puppet] - 10https://gerrit.wikimedia.org/r/994763 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [17:56:01] 10SRE, 10SRE-swift-storage, 10Data-Persistence, 10Thumbor, and 2 others: Changing default image thumbnail size on English Wikipedia - https://phabricator.wikimedia.org/T355914 (10Ganesha811) Is there a pre-generated size around 250px that would be easier to switch to? We can attempt to develop a new consen... [17:56:27] (03CR) 10Dzahn: "the port and the statuscode look good per:" [puppet] - 10https://gerrit.wikimedia.org/r/994686 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [17:56:55] (03CR) 10David Caro: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/994167 (https://phabricator.wikimedia.org/T356171) (owner: 10Majavah) [17:57:05] (03CR) 10Hashar: [C: 03+1] "I have cherry picked the patch on the integration Puppet master and ran it on integration-agent-docker-1041:" [puppet] - 10https://gerrit.wikimedia.org/r/994685 (https://phabricator.wikimedia.org/T252310) (owner: 10Hashar) [17:57:18] (03CR) 10Majavah: [V: 03+1 C: 03+2] P:toolforge::mailrelay: ARC sign outbound mail [puppet] - 10https://gerrit.wikimedia.org/r/994167 (https://phabricator.wikimedia.org/T356171) (owner: 10Majavah) [17:58:12] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T355609)', diff saved to https://phabricator.wikimedia.org/P55983 and previous config saved to /var/cache/conftool/dbconfig/20240131-175811-marostegui.json [17:58:14] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1188.eqiad.wmnet with reason: Maintenance [17:58:17] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [17:58:27] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1188.eqiad.wmnet with reason: Maintenance [17:58:34] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1188 (T355609)', diff saved to https://phabricator.wikimedia.org/P55984 and previous config saved to /var/cache/conftool/dbconfig/20240131-175833-marostegui.json [17:58:42] 10SRE-Access-Requests, 10Data-Platform-SRE (2024.01.22 - 2024.02.11): Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10BTullis) a:03BTullis [18:00:05] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240131T1800) [18:00:36] (03CR) 10Dzahn: [C: 03+1] "well, or I am wrong, since:" [puppet] - 10https://gerrit.wikimedia.org/r/994686 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [18:02:15] (03PS1) 10Btullis: Revoke production shell access from goransm [puppet] - 10https://gerrit.wikimedia.org/r/994799 (https://phabricator.wikimedia.org/T356279) [18:02:16] PROBLEM - Elasticsearch HTTPS for cloudelastic-psi-eqiad-ro on cloudelastic1010 is CRITICAL: SSL CRITICAL - failed to verify cloudelastic1010.eqiad.wmnet against cloudelastic.wikimedia.org, cloudelastic1001.wikimedia.org, cloudelastic1002.wikimedia.org, cloudelastic1003.wikimedia.org, cloudelastic1004.wikimedia.org, cloudelastic1005.wikimedia.org, cloudelastic1006.wikimedia.org, cloudelastic1007.wikimedia.org, cloudelastic1008.wikimedia.o [18:02:16] delastic1009.wikimedia.org https://wikitech.wikimedia.org/wiki/Search [18:02:18] PROBLEM - Elasticsearch HTTPS for cloudelastic-chi-eqiad-ro on cloudelastic1010 is CRITICAL: SSL CRITICAL - failed to verify cloudelastic1010.eqiad.wmnet against cloudelastic.wikimedia.org, cloudelastic1001.wikimedia.org, cloudelastic1002.wikimedia.org, cloudelastic1003.wikimedia.org, cloudelastic1004.wikimedia.org, cloudelastic1005.wikimedia.org, cloudelastic1006.wikimedia.org, cloudelastic1007.wikimedia.org, cloudelastic1008.wikimedia.o [18:02:18] delastic1009.wikimedia.org https://wikitech.wikimedia.org/wiki/Search [18:02:24] PROBLEM - Elasticsearch HTTPS for cloudelastic-omega-eqiad-ro on cloudelastic1010 is CRITICAL: SSL CRITICAL - failed to verify cloudelastic1010.eqiad.wmnet against cloudelastic.wikimedia.org, cloudelastic1001.wikimedia.org, cloudelastic1002.wikimedia.org, cloudelastic1003.wikimedia.org, cloudelastic1004.wikimedia.org, cloudelastic1005.wikimedia.org, cloudelastic1006.wikimedia.org, cloudelastic1007.wikimedia.org, cloudelastic1008.wikimedia [18:02:24] oudelastic1009.wikimedia.org https://wikitech.wikimedia.org/wiki/Search [18:02:34] PROBLEM - Elasticsearch HTTPS for cloudelastic-omega-eqiad on cloudelastic1010 is CRITICAL: SSL CRITICAL - failed to verify cloudelastic1010.eqiad.wmnet against cloudelastic.wikimedia.org, cloudelastic1001.wikimedia.org, cloudelastic1002.wikimedia.org, cloudelastic1003.wikimedia.org, cloudelastic1004.wikimedia.org, cloudelastic1005.wikimedia.org, cloudelastic1006.wikimedia.org, cloudelastic1007.wikimedia.org, cloudelastic1008.wikimedia.or [18:02:34] elastic1009.wikimedia.org https://wikitech.wikimedia.org/wiki/Search [18:02:40] agghh [18:02:51] Not a big deal, just annoying. Will fix shortly [18:03:01] ^^ those SSL alerts above, that is [18:03:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T355609)', diff saved to https://phabricator.wikimedia.org/P55985 and previous config saved to /var/cache/conftool/dbconfig/20240131-180319-marostegui.json [18:03:24] PROBLEM - Elasticsearch HTTPS for cloudelastic-psi-eqiad on cloudelastic1010 is CRITICAL: SSL CRITICAL - failed to verify cloudelastic1010.eqiad.wmnet against cloudelastic.wikimedia.org, cloudelastic1001.wikimedia.org, cloudelastic1002.wikimedia.org, cloudelastic1003.wikimedia.org, cloudelastic1004.wikimedia.org, cloudelastic1005.wikimedia.org, cloudelastic1006.wikimedia.org, cloudelastic1007.wikimedia.org, cloudelastic1008.wikimedia.org, [18:03:24] astic1009.wikimedia.org https://wikitech.wikimedia.org/wiki/Search [18:03:29] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [18:03:33] (03CR) 10Marostegui: [C: 03+1] "up to you two, I don't have any strong opinions" [puppet] - 10https://gerrit.wikimedia.org/r/994769 (https://phabricator.wikimedia.org/T344036) (owner: 10Arnaudb) [18:03:44] PROBLEM - Elasticsearch HTTPS for cloudelastic-chi-eqiad on cloudelastic1010 is CRITICAL: SSL CRITICAL - failed to verify cloudelastic1010.eqiad.wmnet against cloudelastic.wikimedia.org, cloudelastic1001.wikimedia.org, cloudelastic1002.wikimedia.org, cloudelastic1003.wikimedia.org, cloudelastic1004.wikimedia.org, cloudelastic1005.wikimedia.org, cloudelastic1006.wikimedia.org, cloudelastic1007.wikimedia.org, cloudelastic1008.wikimedia.org, [18:03:44] astic1009.wikimedia.org https://wikitech.wikimedia.org/wiki/Search [18:04:20] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on cloudelastic1010.eqiad.wmnet with reason: T355617 [18:04:25] T355617: Migrate cloudelastic from public to private IPs - https://phabricator.wikimedia.org/T355617 [18:04:36] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on cloudelastic1010.eqiad.wmnet with reason: T355617 [18:07:25] 10SRE-Access-Requests, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10BTullis) It isn't 100% clear from the description whether the user should still have production shell acc... [18:09:48] 10SRE-Access-Requests, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10BTullis) Adding @MoritzMuehlenhoff for visibility - Do we need to do anything else regarding an off-board... [18:17:53] (03CR) 10Brouberol: [C: 03+1] rdf-streaming-updater: Change notification from email to task [alerts] - 10https://gerrit.wikimedia.org/r/994758 (https://phabricator.wikimedia.org/T348685) (owner: 10Bking) [18:18:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P55986 and previous config saved to /var/cache/conftool/dbconfig/20240131-181825-marostegui.json [18:21:07] (03PS1) 10Bking: cloudelastic: Don't validate certs against FQDN [puppet] - 10https://gerrit.wikimedia.org/r/994800 (https://phabricator.wikimedia.org/T355617) [18:21:34] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/994800 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [18:22:10] (03PS2) 10Bking: cloudelastic: Don't validate certs against FQDN [puppet] - 10https://gerrit.wikimedia.org/r/994800 (https://phabricator.wikimedia.org/T355617) [18:22:24] (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: Change notification from email to task [alerts] - 10https://gerrit.wikimedia.org/r/994758 (https://phabricator.wikimedia.org/T348685) (owner: 10Bking) [18:23:17] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/994800 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [18:23:36] (03Merged) 10jenkins-bot: rdf-streaming-updater: Change notification from email to task [alerts] - 10https://gerrit.wikimedia.org/r/994758 (https://phabricator.wikimedia.org/T348685) (owner: 10Bking) [18:25:20] (03CR) 10Bking: [C: 03+1] Revoke production shell access from goransm [puppet] - 10https://gerrit.wikimedia.org/r/994799 (https://phabricator.wikimedia.org/T356279) (owner: 10Btullis) [18:26:33] (03CR) 10Btullis: [C: 03+2] Revoke production shell access from goransm [puppet] - 10https://gerrit.wikimedia.org/r/994799 (https://phabricator.wikimedia.org/T356279) (owner: 10Btullis) [18:27:23] (03PS3) 10Bking: cloudelastic: Don't validate certs against FQDN [puppet] - 10https://gerrit.wikimedia.org/r/994800 (https://phabricator.wikimedia.org/T355617) [18:29:36] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/994800 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [18:33:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P55988 and previous config saved to /var/cache/conftool/dbconfig/20240131-183332-marostegui.json [18:33:36] (RdfStreamingUpdaterSpaceUsageTooHigh) firing: (2) The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [18:35:30] (03CR) 10Dzahn: [C: 03+2] gerrit: sync soy email template with version 3.7 [puppet] - 10https://gerrit.wikimedia.org/r/993695 (https://phabricator.wikimedia.org/T355259) (owner: 10Hashar) [18:35:44] (03CR) 10Dzahn: [C: 03+2] gerrit: move soy templates files to unique namespaces [puppet] - 10https://gerrit.wikimedia.org/r/993694 (owner: 10Hashar) [18:37:23] 10SRE-Access-Requests, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10Manuel) I looked into this: Goran, upon quitting his contract with WMDE, requested continued private data... [18:40:01] !log phuedx@deploy2002 Started deploy [airflow-dags/analytics@5078a6b]: (no justification provided) [18:40:29] !log phuedx@deploy2002 Finished deploy [airflow-dags/analytics@5078a6b]: (no justification provided) (duration: 00m 28s) [18:41:22] (03PS4) 10Bking: cloudelastic: Don't validate certs against FQDN [puppet] - 10https://gerrit.wikimedia.org/r/994800 (https://phabricator.wikimedia.org/T355617) [18:41:59] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/994800 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [18:43:36] (RdfStreamingUpdaterSpaceUsageTooHigh) resolved: The RDF Streaming Updater is using more than 50GiB of storage - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterSpaceUsageTooHigh [18:46:36] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/994800 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [18:48:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T355609)', diff saved to https://phabricator.wikimedia.org/P55989 and previous config saved to /var/cache/conftool/dbconfig/20240131-184838-marostegui.json [18:48:41] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1197.eqiad.wmnet with reason: Maintenance [18:48:45] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [18:48:54] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1197.eqiad.wmnet with reason: Maintenance [18:49:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1197 (T355609)', diff saved to https://phabricator.wikimedia.org/P55990 and previous config saved to /var/cache/conftool/dbconfig/20240131-184900-marostegui.json [18:53:35] 10SRE-Access-Requests, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10BTullis) Ah, thanks @Manuel. It seems that I have acted with too much haste. I can revert the change then... [18:53:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T355609)', diff saved to https://phabricator.wikimedia.org/P55991 and previous config saved to /var/cache/conftool/dbconfig/20240131-185345-marostegui.json [18:53:54] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [18:53:58] (03PS1) 10Btullis: Revert "Revoke production shell access from goransm" [puppet] - 10https://gerrit.wikimedia.org/r/994715 [18:54:19] (03PS2) 10Btullis: Revert "Revoke production shell access from goransm" [puppet] - 10https://gerrit.wikimedia.org/r/994715 (https://phabricator.wikimedia.org/T356279) [18:58:19] 10SRE-Access-Requests, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10mpopov) > We should add `expiry_date` and `expiry_contact` fields to reflect the NDA @KFrancis are you s... [19:00:04] dancy and hashar: Deploy window Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240131T1900) [19:00:05] dancy and hashar: Deploy window MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240131T1900) [19:00:50] (03CR) 10Btullis: [C: 03+2] Revert "Revoke production shell access from goransm" [puppet] - 10https://gerrit.wikimedia.org/r/994715 (https://phabricator.wikimedia.org/T356279) (owner: 10Btullis) [19:01:33] o/ [19:02:56] (03PS1) 10TrainBranchBot: group1 wikis to 1.42.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994802 (https://phabricator.wikimedia.org/T354434) [19:02:59] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.42.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994802 (https://phabricator.wikimedia.org/T354434) (owner: 10TrainBranchBot) [19:03:51] (03Merged) 10jenkins-bot: group1 wikis to 1.42.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994802 (https://phabricator.wikimedia.org/T354434) (owner: 10TrainBranchBot) [19:04:42] 10SRE-Access-Requests, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10BTullis) I have just deployed the revert, so the changes should be undone and the user should still have... [19:08:08] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Disk (sdl) failed in ms-be1068 - https://phabricator.wikimedia.org/T356033 (10Jclark-ctr) 05Open→03Resolved received new drive and returned failed drive [19:08:53] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P55992 and previous config saved to /var/cache/conftool/dbconfig/20240131-190852-marostegui.json [19:08:59] (03PS1) 10TrainBranchBot: group0 wikis to 1.42.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994807 (https://phabricator.wikimedia.org/T354434) [19:09:01] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.42.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994807 (https://phabricator.wikimedia.org/T354434) (owner: 10TrainBranchBot) [19:09:06] 10SRE-Access-Requests, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10Manuel) > I'm not quite sure that I understand what this means, I just wanted to emphasize that WMDE ha... [19:09:16] I'm rolling the train back to group0 due to new errors. I'll file a ticket shortly. [19:09:48] (03Merged) 10jenkins-bot: group0 wikis to 1.42.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994807 (https://phabricator.wikimedia.org/T354434) (owner: 10TrainBranchBot) [19:11:37] 10SRE-Access-Requests, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10Manuel) > Apologies for jumping the gun and revoking access before having thoroughly checked. All good... [19:11:59] 10SRE-Access-Requests, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10MoritzMuehlenhoff) >>! In T356279#9503507, @BTullis wrote: > We should add `expiry_date` and `expiry_cont... [19:12:09] (03CR) 10Muehlenhoff: [C: 03+1] admin: remove non-yubikey key from taavi [puppet] - 10https://gerrit.wikimedia.org/r/994796 (owner: 10Majavah) [19:17:08] Train blocker: https://phabricator.wikimedia.org/T356322 [19:17:20] !log dancy@deploy2002 rebuilt and synchronized wikiversions files: group0 wikis to 1.42.0-wmf.16 refs T354434 [19:17:27] 10SRE-Access-Requests, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10MoritzMuehlenhoff) >>! In T356279#9503576, @Manuel wrote: >> from a technical perpective. > > I don't kn... [19:17:36] T354434: 1.42.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T354434 [19:22:01] 10SRE-Access-Requests, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10Dzahn) Shouldn't we add the affected user to this ticket and ask them about all this? [19:23:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P55993 and previous config saved to /var/cache/conftool/dbconfig/20240131-192359-marostegui.json [19:25:46] 10SRE, 10SRE-Access-Requests, 10LDAP-Access-Requests: Requesting access to wmf for arinaigum - https://phabricator.wikimedia.org/T355591 (10Dzahn) >>! In T355591#9503106, @Arinaigu wrote: > The SSH public key I provided on the ticket is newly created and has not been used for WMCS or anything else. What else... [19:28:16] (03PS1) 10Volans: requestctl-generator: adapt for superset 3 API [puppet] - 10https://gerrit.wikimedia.org/r/994811 (https://phabricator.wikimedia.org/T335356) [19:28:18] (03PS1) 10Volans: requestctl-generator: fix bug for URI filters [puppet] - 10https://gerrit.wikimedia.org/r/994812 [19:28:57] (03CR) 10Volans: [C: 04-1] "To be merged only after the upgrade of Superset to v3" [puppet] - 10https://gerrit.wikimedia.org/r/994811 (https://phabricator.wikimedia.org/T335356) (owner: 10Volans) [19:29:39] (03CR) 10Volans: "Bug found while testing superset-next but unrelated" [puppet] - 10https://gerrit.wikimedia.org/r/994812 (owner: 10Volans) [19:31:09] (03CR) 10Eevans: [C: 03+2] cassandra: cassandra roles for druid-based aqs endpoints [puppet] - 10https://gerrit.wikimedia.org/r/994225 (https://phabricator.wikimedia.org/T352948) (owner: 10Eevans) [19:32:08] (03PS5) 10Bking: cloudelastic: Don't validate certs against FQDN [puppet] - 10https://gerrit.wikimedia.org/r/994800 (https://phabricator.wikimedia.org/T355617) [19:35:01] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/994800 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [19:38:51] (03PS1) 10Eevans: add (faux) creds for {edit,editor}_analytics roles [labs/private] - 10https://gerrit.wikimedia.org/r/994814 (https://phabricator.wikimedia.org/T352948) [19:39:06] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T355609)', diff saved to https://phabricator.wikimedia.org/P55994 and previous config saved to /var/cache/conftool/dbconfig/20240131-193905-marostegui.json [19:39:08] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1222.eqiad.wmnet with reason: Maintenance [19:39:19] (03CR) 10Eevans: [V: 03+2 C: 03+2] add (faux) creds for {edit,editor}_analytics roles [labs/private] - 10https://gerrit.wikimedia.org/r/994814 (https://phabricator.wikimedia.org/T352948) (owner: 10Eevans) [19:39:20] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [19:39:21] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1222.eqiad.wmnet with reason: Maintenance [19:39:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1222 (T355609)', diff saved to https://phabricator.wikimedia.org/P55996 and previous config saved to /var/cache/conftool/dbconfig/20240131-193927-marostegui.json [19:41:31] 10SRE-Access-Requests, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10Manuel) > it grants running commands under the analytics-wmde user on the stat* hosts Reading T310055#80... [19:42:37] (03PS6) 10Bking: cloudelastic: Don't validate certs against FQDN [puppet] - 10https://gerrit.wikimedia.org/r/994800 (https://phabricator.wikimedia.org/T355617) [19:42:57] (03PS1) 10Eevans: aqs: apply {edit,editor}_analytics roles (users) [puppet] - 10https://gerrit.wikimedia.org/r/994815 (https://phabricator.wikimedia.org/T352948) [19:43:15] (03PS2) 10Eevans: aqs: apply {edit,editor}_analytics roles (users) [puppet] - 10https://gerrit.wikimedia.org/r/994815 (https://phabricator.wikimedia.org/T352948) [19:44:33] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/994800 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [19:45:19] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/994815 (https://phabricator.wikimedia.org/T352948) (owner: 10Eevans) [19:48:35] (03CR) 10Eevans: [C: 03+2] aqs: apply {edit,editor}_analytics roles (users) [puppet] - 10https://gerrit.wikimedia.org/r/994815 (https://phabricator.wikimedia.org/T352948) (owner: 10Eevans) [19:51:46] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T355609)', diff saved to https://phabricator.wikimedia.org/P55997 and previous config saved to /var/cache/conftool/dbconfig/20240131-195145-marostegui.json [19:52:03] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [19:52:18] (03PS7) 10Bking: cloudelastic: stop issuing certs for soon-to-be defunct FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/994800 (https://phabricator.wikimedia.org/T355617) [19:53:31] (03PS1) 10Reedy: Gadget: Bump GADGET_CLASS_VERSION [extensions/Gadgets] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/994716 (https://phabricator.wikimedia.org/T356322) [19:53:48] dancy: ^^ [19:54:01] master patch has just been +2'd so should be working its way through CI now [19:57:08] Thanks for jump on it Reedy! [19:57:11] *jumping [19:58:22] 10SRE, 10Traffic-Icebox: Lower geodns TTLs from 600 (10min) to 300 (5min) - https://phabricator.wikimedia.org/T140365 (10ssingh) a:03ssingh [19:59:09] !log joal@deploy2002 Started deploy [analytics/refinery@b738b3f]: HOTFIX analytics weekly train [analytics/refinery@b738b3fd] [20:02:29] (03CR) 10Eevans: [C: 03+2] cassandra: create template for aqsloader role & grants [puppet] - 10https://gerrit.wikimedia.org/r/993102 (https://phabricator.wikimedia.org/T355917) (owner: 10Eevans) [20:03:05] Reedy: Once you are done deploying that can you ping me. I want to do a security deploy for T356226 [20:05:40] 10SRE-Access-Requests, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10mpopov) I'm also sorry for making the request without knowing about the prior request. >>! In T356279#95... [20:06:17] (03PS8) 10Bking: cloudelastic: stop issuing certs for soon-to-be defunct FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/994800 (https://phabricator.wikimedia.org/T355617) [20:06:32] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/994800 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [20:06:52] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P55998 and previous config saved to /var/cache/conftool/dbconfig/20240131-200652-marostegui.json [20:07:26] (03CR) 10JHathaway: [C: 03+2] postgresql: add explicit Augeas lens [puppet] - 10https://gerrit.wikimedia.org/r/993191 (owner: 10JHathaway) [20:09:47] 10SRE, 10serviceops: Migrate MW appservers' base images to bullseye - https://phabricator.wikimedia.org/T356293 (10Jdforrester-WMF) [20:10:00] !log joal@deploy2002 Finished deploy [analytics/refinery@b738b3f]: HOTFIX analytics weekly train [analytics/refinery@b738b3fd] (duration: 10m 51s) [20:11:29] (03PS1) 10Eevans: cassandra: fix erroneous DDL syntax for GRANTs [puppet] - 10https://gerrit.wikimedia.org/r/994820 (https://phabricator.wikimedia.org/T355917) [20:12:44] 10SRE-Access-Requests, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10AndrewTavis_WMDE) I apologize for my part of bringing this up with the assumption that such access isn't... [20:13:00] (03CR) 10Eevans: [C: 03+2] cassandra: fix erroneous DDL syntax for GRANTs [puppet] - 10https://gerrit.wikimedia.org/r/994820 (https://phabricator.wikimedia.org/T355917) (owner: 10Eevans) [20:14:27] (03PS9) 10Bking: cloudelastic: stop issuing certs for soon-to-be defunct FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/994800 (https://phabricator.wikimedia.org/T355617) [20:15:34] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/994800 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [20:15:38] I can also wait for my security deploy until the backport window if preferred by the train conductors. [20:15:53] That is preferable to me. [20:16:31] 👍 [20:17:37] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994823 (https://phabricator.wikimedia.org/T128546) [20:21:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P55999 and previous config saved to /var/cache/conftool/dbconfig/20240131-202158-marostegui.json [20:23:54] (03CR) 10Eevans: [C: 03+2] sessionstore: updated list of Cassandra hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/994357 (https://phabricator.wikimedia.org/T353402) (owner: 10Eevans) [20:24:35] !log joal@deploy2002 Started deploy [analytics/refinery@b738b3f] (thin): HOTFIX analytics weekly train -THIN [analytics/refinery@b738b3fd] [20:24:41] !log joal@deploy2002 Finished deploy [analytics/refinery@b738b3f] (thin): HOTFIX analytics weekly train -THIN [analytics/refinery@b738b3fd] (duration: 00m 05s) [20:25:08] !log joal@deploy2002 Started deploy [analytics/refinery@b738b3f] (hadoop-test): HOTFIX analytics weekly train - Test [analytics/refinery@b738b3fd] [20:26:57] (03PS1) 10Kosta Harlan: [beta] tempaccounts: Use same reservedPattern whether enabled/disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994826 (https://phabricator.wikimedia.org/T342475) [20:27:22] !log eevans@deploy2002 helmfile [staging] START helmfile.d/services/sessionstore: apply [20:27:56] (03PS2) 10Jforrester: wikifunctions: Upgrade orchestrator from 2024-01-09-190638 to 2024-01-18-182456 [deployment-charts] - 10https://gerrit.wikimedia.org/r/992756 (https://phabricator.wikimedia.org/T278596) [20:28:08] !log eevans@deploy2002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply [20:28:44] !log joal@deploy2002 Finished deploy [analytics/refinery@b738b3f] (hadoop-test): HOTFIX analytics weekly train - Test [analytics/refinery@b738b3fd] (duration: 03m 35s) [20:29:22] (03CR) 10Dreamy Jazz: [C: 03+1] [beta] tempaccounts: Use same reservedPattern whether enabled/disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994826 (https://phabricator.wikimedia.org/T342475) (owner: 10Kosta Harlan) [20:31:57] !log eevans@deploy2002 helmfile [codfw] START helmfile.d/services/sessionstore: apply [20:32:09] !log eevans@deploy2002 helmfile [codfw] DONE helmfile.d/services/sessionstore: apply [20:32:39] (03PS10) 10Bking: cloudelastic: stop issuing certs for soon-to-be defunct FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/994800 (https://phabricator.wikimedia.org/T355617) [20:33:40] 10SRE, 10Wikimedia-Etherpad, 10collaboration-services: Upgrade etherpad.wikimedia.org to v1.9.6 - https://phabricator.wikimedia.org/T316421 (10Dzahn) >>! In T316421#9501760, @Jelto wrote: > According to [openstack explorer](https://openstack-browser.toolforge.org/project/packaging) @akosiaris you are one of... [20:33:44] !log [urbanecm@mwmaint2002 ~]$ mwscript userOptions.php --wiki=testwiki --old-is-default --old=2 --new 1 --nowarn 'echo-subscriptions-web-reverted' # T353225 [20:33:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:50] T353225: Echo: Make use of conditional user defaults - https://phabricator.wikimedia.org/T353225 [20:34:33] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/994800 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [20:35:05] !log eevans@deploy2002 helmfile [codfw] START helmfile.d/services/sessionstore: sync [20:35:15] !log eevans@deploy2002 helmfile [codfw] DONE helmfile.d/services/sessionstore: sync [20:35:53] !log eevans@deploy2002 helmfile [staging] START helmfile.d/services/sessionstore: sync [20:36:02] !log eevans@deploy2002 helmfile [staging] DONE helmfile.d/services/sessionstore: sync [20:36:54] !log eevans@deploy2002 helmfile [eqiad] START helmfile.d/services/sessionstore: apply [20:37:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T355609)', diff saved to https://phabricator.wikimedia.org/P56000 and previous config saved to /var/cache/conftool/dbconfig/20240131-203704-marostegui.json [20:37:07] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1225.eqiad.wmnet with reason: Maintenance [20:37:11] !log eevans@deploy2002 helmfile [eqiad] DONE helmfile.d/services/sessionstore: apply [20:37:12] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [20:37:20] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1225.eqiad.wmnet with reason: Maintenance [20:40:29] (03PS11) 10Bking: cloudelastic: stop issuing certs for soon-to-be defunct FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/994800 (https://phabricator.wikimedia.org/T355617) [20:40:55] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/994800 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [20:42:56] (03PS2) 10Jforrester: Gadget: Bump GADGET_CLASS_VERSION [extensions/Gadgets] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/994716 (https://phabricator.wikimedia.org/T356322) (owner: 10Reedy) [20:43:32] jouncebot: nowandnext [20:43:32] For the next 0 hour(s) and 16 minute(s): MediaWiki train - Utc-7+Utc-0 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240131T1900) [20:43:32] In 0 hour(s) and 16 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240131T2100) [20:44:00] dancy: Should I sling out the UBN fix to try to unblock the train? [20:44:08] yes please! [20:44:14] Kk. [20:44:19] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1229.eqiad.wmnet with reason: Maintenance [20:44:27] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy2002 using scap backport" [extensions/Gadgets] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/994716 (https://phabricator.wikimedia.org/T356322) (owner: 10Reedy) [20:44:33] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1229.eqiad.wmnet with reason: Maintenance [20:44:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1229 (T355609)', diff saved to https://phabricator.wikimedia.org/P56001 and previous config saved to /var/cache/conftool/dbconfig/20240131-204439-marostegui.json [20:44:45] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [20:46:35] (03PS1) 10Eevans: sessionstore: remove EOL hosts [deployment-charts] - 10https://gerrit.wikimedia.org/r/994830 (https://phabricator.wikimedia.org/T353402) [20:47:43] 10SRE-Access-Requests, 10Data-Platform-SRE (2024.01.22 - 2024.02.11), 10Patch-For-Review: Remove production data access for former WMDE staff member goransm - https://phabricator.wikimedia.org/T356279 (10KFrancis) >>! In T356279#9503531, @mpopov wrote: >> We should add `expiry_date` and `expiry_contact` fiel... [20:49:13] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T355609)', diff saved to https://phabricator.wikimedia.org/P56002 and previous config saved to /var/cache/conftool/dbconfig/20240131-204913-marostegui.json [20:50:28] (03PS12) 10Bking: cloudelastic: stop issuing certs for soon-to-be defunct FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/994800 (https://phabricator.wikimedia.org/T355617) [20:51:42] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/994800 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [20:53:14] Reedy, James_F, Dreamy_Jazz: I'm running out of time the roll the train forward as soon as the fix lands so I propose that Dreamy_Jazz, you do your thing after James_F is done. I can roll forward when I get back (~40 mins) or ask hash.ar to roll forward during the next train window. [20:53:45] Okay. Thanks. [20:54:03] I've also got a config deploy to do (but that isn't urgent). [20:54:13] WFM. [20:54:31] Sorry, the CI merge is taking longer than I hoped. [20:54:49] OK.. See y'all in a bit [20:57:59] (03PS1) 10Houseblaster: InitialiseSettings: Set wgSignatureValidation to disallow [enwiki] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994831 (https://phabricator.wikimedia.org/T355462) [20:59:49] (03PS2) 10Houseblaster: InitialiseSettings: Set wgSignatureValidation to disallow [enwiki] [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994831 (https://phabricator.wikimedia.org/T355462) [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240131T2100). nyaa~ [21:00:05] jan_drewniak and Dreamy_Jazz: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:26] \o [21:00:38] Sorry for the slow. [21:00:41] jan_drewniak: Do you mind if I do my security deploy first? [21:00:49] * James_F impatiently waits for https://integration.wikimedia.org/ci/job/wmf-quibble-vendor-mysql-php74-docker/49865/console [21:00:52] (after the current backport is complete) [21:01:14] Dreamy_Jazz: yeah, do yours first, I'll do mine later (it usually takes some time) [21:01:19] (03CR) 10Houseblaster: [C: 04-1] "Waiting for MassMessages to be sent out and a month to pass" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994831 (https://phabricator.wikimedia.org/T355462) (owner: 10Houseblaster) [21:01:26] Okay. Thanks. [21:03:05] (03Merged) 10jenkins-bot: Gadget: Bump GADGET_CLASS_VERSION [extensions/Gadgets] (wmf/1.42.0-wmf.16) - 10https://gerrit.wikimedia.org/r/994716 (https://phabricator.wikimedia.org/T356322) (owner: 10Reedy) [21:03:25] Finally. [21:03:34] !log jforrester@deploy2002 Started scap: Backport for [[gerrit:994716|Gadget: Bump GADGET_CLASS_VERSION (T356322)]] [21:03:46] T356322: InvalidArgumentException: Unrecognized 'targets' parameter - https://phabricator.wikimedia.org/T356322 [21:04:20] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P56003 and previous config saved to /var/cache/conftool/dbconfig/20240131-210419-marostegui.json [21:05:01] !log jforrester@deploy2002 jforrester and reedy: Backport for [[gerrit:994716|Gadget: Bump GADGET_CLASS_VERSION (T356322)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:05:32] !log jforrester@deploy2002 jforrester and reedy: Continuing with sync [21:07:13] (03CR) 10Ryan Kemper: [C: 03+1] cloudelastic: stop issuing certs for soon-to-be defunct FQDNs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/994800 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [21:10:38] (03PS13) 10Bking: cloudelastic: stop issuing certs for soon-to-be defunct FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/994800 (https://phabricator.wikimedia.org/T355617) [21:12:05] !log jforrester@deploy2002 Finished scap: Backport for [[gerrit:994716|Gadget: Bump GADGET_CLASS_VERSION (T356322)]] (duration: 08m 31s) [21:12:12] Finally. [21:12:15] Dreamy_Jazz: Over to you. [21:12:22] T356322: InvalidArgumentException: Unrecognized 'targets' parameter - https://phabricator.wikimedia.org/T356322 [21:14:18] Thanks. [21:14:29] 10SRE, 10serviceops: Migrate MW appservers' base images to bullseye - https://phabricator.wikimedia.org/T356293 (10MoritzMuehlenhoff) FWIF, the PHP packages for this are already available in the component/php74 (and used on some initial snapshot* hosts( . They are the same version as the component/php74 we use... [21:15:05] (03CR) 10Bking: cloudelastic: stop issuing certs for soon-to-be defunct FQDNs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/994800 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [21:16:38] !log Doing security deploy for T356226 [21:16:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229', diff saved to https://phabricator.wikimedia.org/P56004 and previous config saved to /var/cache/conftool/dbconfig/20240131-211926-marostegui.json [21:19:44] (03CR) 10Bking: [C: 03+2] cloudelastic: stop issuing certs for soon-to-be defunct FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/994800 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [21:23:20] !log dreamyjazz Deployed security patch for T356226 [21:28:06] (03PS1) 10Bking: Revert "cloudelastic: stop issuing certs for soon-to-be defunct FQDNs" [puppet] - 10https://gerrit.wikimedia.org/r/994717 [21:30:02] (03CR) 10Bking: [V: 03+2 C: 03+1] Revert "cloudelastic: stop issuing certs for soon-to-be defunct FQDNs" [puppet] - 10https://gerrit.wikimedia.org/r/994717 (owner: 10Bking) [21:30:07] (03CR) 10Bking: [V: 03+2 C: 03+2] Revert "cloudelastic: stop issuing certs for soon-to-be defunct FQDNs" [puppet] - 10https://gerrit.wikimedia.org/r/994717 (owner: 10Bking) [21:30:34] !log dreamyjazz Deployed security patch for T356226 [21:31:46] !log Security deploy done [21:31:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:01] thx Dreamy_Jazz [21:32:13] Logstash looks okay, so shouldn't need to revert it. [21:32:21] Fingers crossed. [21:32:56] Will you proceed with the train now or can I do my config change? [21:33:02] (03CR) 10Effie Mouzeli: [C: 03+1] ipoid: Fix chart default ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/992899 (https://phabricator.wikimedia.org/T355167) (owner: 10Alexandros Kosiaris) [21:33:13] I'll do the train now. [21:33:18] 👍 [21:33:33] (03PS1) 10TrainBranchBot: group1 wikis to 1.42.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994837 (https://phabricator.wikimedia.org/T354434) [21:33:35] (03CR) 10TrainBranchBot: [C: 03+2] group1 wikis to 1.42.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994837 (https://phabricator.wikimedia.org/T354434) (owner: 10TrainBranchBot) [21:34:20] (03Merged) 10jenkins-bot: group1 wikis to 1.42.0-wmf.16 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994837 (https://phabricator.wikimedia.org/T354434) (owner: 10TrainBranchBot) [21:34:33] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1229 (T355609)', diff saved to https://phabricator.wikimedia.org/P56005 and previous config saved to /var/cache/conftool/dbconfig/20240131-213432-marostegui.json [21:34:35] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on db1233.eqiad.wmnet with reason: Maintenance [21:34:48] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1233.eqiad.wmnet with reason: Maintenance [21:34:54] (03PS1) 10Bking: cloudelastic: stop issuing certs for soon-to-be defunct FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/994838 (https://phabricator.wikimedia.org/T355617) [21:34:55] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1233 (T355609)', diff saved to https://phabricator.wikimedia.org/P56006 and previous config saved to /var/cache/conftool/dbconfig/20240131-213454-marostegui.json [21:34:56] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [21:35:38] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/994838 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [21:42:03] !log dancy@deploy2002 rebuilt and synchronized wikiversions files: group1 wikis to 1.42.0-wmf.16 refs T354434 [21:42:16] T354434: 1.42.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T354434 [21:43:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T355609)', diff saved to https://phabricator.wikimedia.org/P56007 and previous config saved to /var/cache/conftool/dbconfig/20240131-214334-marostegui.json [21:43:41] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [21:47:21] Will there be enough time for the rest of the backport window to be done? [21:47:42] Go ahead with your deployment [21:47:48] Thanks! [21:47:55] (03PS2) 10Bking: cloudelastic: stop issuing certs for soon-to-be defunct FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/994838 (https://phabricator.wikimedia.org/T355617) [21:47:57] jouncebot: next [21:47:58] In 0 hour(s) and 12 minute(s): Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240131T2200) [21:48:12] Dreamy_Jazz: We're not using our WF Services window, so you can use that too. [21:48:21] Thanks. [21:48:24] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by dreamyjazz@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994826 (https://phabricator.wikimedia.org/T342475) (owner: 10Kosta Harlan) [21:48:51] !log dancy@deploy2002 Synchronized php: group1 wikis to 1.42.0-wmf.16 refs T354434 (duration: 06m 47s) [21:48:57] T354434: 1.42.0-wmf.16 deployment blockers - https://phabricator.wikimedia.org/T354434 [21:49:05] (03CR) 10CI reject: [V: 04-1] cloudelastic: stop issuing certs for soon-to-be defunct FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/994838 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [21:49:13] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/994838 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [21:50:13] (03PS3) 10Bking: cloudelastic: stop issuing certs for soon-to-be defunct FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/994838 (https://phabricator.wikimedia.org/T355617) [21:51:21] (03CR) 10CI reject: [V: 04-1] cloudelastic: stop issuing certs for soon-to-be defunct FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/994838 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [21:51:29] (03Merged) 10jenkins-bot: [beta] tempaccounts: Use same reservedPattern whether enabled/disabled [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994826 (https://phabricator.wikimedia.org/T342475) (owner: 10Kosta Harlan) [21:52:38] jan_drewniak: My deploys are done (the last one was just for betawikis, so no sync was needed). [21:53:56] Dreamy_Jazz: thanks, I'll go ahead with mine now [21:54:52] !log Removed already applied patches for T347708 from /srv/patches [21:54:53] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994823 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [21:54:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:57] T347708: CVE-2024-23172: Several not properly escaped messages in the CheckUser extension - https://phabricator.wikimedia.org/T347708 [21:55:39] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994823 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [21:56:36] (03PS4) 10Bking: cloudelastic: stop issuing certs for soon-to-be defunct FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/994838 (https://phabricator.wikimedia.org/T355617) [21:57:47] (03CR) 10CI reject: [V: 04-1] cloudelastic: stop issuing certs for soon-to-be defunct FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/994838 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [21:58:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P56008 and previous config saved to /var/cache/conftool/dbconfig/20240131-215840-marostegui.json [21:59:51] (03PS5) 10Bking: cloudelastic: stop issuing certs for soon-to-be defunct FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/994838 (https://phabricator.wikimedia.org/T355617) [22:00:04] Deploy window Wikifunction Services UTC Late (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240131T2200) [22:03:27] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/994838 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [22:03:34] (03CR) 10JHathaway: [C: 03+1] "sounds good, no strong opinion" [puppet] - 10https://gerrit.wikimedia.org/r/992888 (owner: 10Majavah) [22:04:06] (03CR) 10JHathaway: "moritz can you review this one as well, when you have a moment" [puppet] - 10https://gerrit.wikimedia.org/r/993190 (owner: 10JHathaway) [22:05:10] !log jdrewniak@deploy2002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:994823| Bumping portals to master (T128546)]] (duration: 07m 26s) [22:05:23] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [22:11:54] !log jdrewniak@deploy2002 Synchronized portals: Wikimedia Portals Update: [[gerrit:994823| Bumping portals to master (T128546)]] (duration: 06m 43s) [22:12:07] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [22:13:47] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233', diff saved to https://phabricator.wikimedia.org/P56009 and previous config saved to /var/cache/conftool/dbconfig/20240131-221347-marostegui.json [22:16:00] (03CR) 10Bking: [V: 03+1] wdqs: allow further federation to freiburg [puppet] - 10https://gerrit.wikimedia.org/r/994793 (https://phabricator.wikimedia.org/T339347) (owner: 10Ryan Kemper) [22:16:18] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: allow further federation to freiburg [puppet] - 10https://gerrit.wikimedia.org/r/994793 (https://phabricator.wikimedia.org/T339347) (owner: 10Ryan Kemper) [22:17:17] (03PS1) 10Ryan Kemper: wdqs: allow new federated endpoint [puppet] - 10https://gerrit.wikimedia.org/r/994842 (https://phabricator.wikimedia.org/T346455) [22:17:47] (03CR) 10Bking: [C: 03+1] wdqs: allow new federated endpoint [puppet] - 10https://gerrit.wikimedia.org/r/994842 (https://phabricator.wikimedia.org/T346455) (owner: 10Ryan Kemper) [22:19:25] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: allow new federated endpoint [puppet] - 10https://gerrit.wikimedia.org/r/994842 (https://phabricator.wikimedia.org/T346455) (owner: 10Ryan Kemper) [22:28:54] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1233 (T355609)', diff saved to https://phabricator.wikimedia.org/P56010 and previous config saved to /var/cache/conftool/dbconfig/20240131-222853-marostegui.json [22:28:55] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [22:29:06] T355609: Make cuc_id a bigint - https://phabricator.wikimedia.org/T355609 [22:29:09] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [22:37:53] (03PS4) 10C. Scott Ananian: Turn on DT visual enhancements on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991039 (https://phabricator.wikimedia.org/T355374) [22:37:58] (03PS6) 10Ryan Kemper: cloudelastic: stop issuing certs for soon-to-be defunct FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/994838 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [22:38:25] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/994838 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [22:39:13] (03CR) 10Ryan Kemper: [C: 03+1] cloudelastic: stop issuing certs for soon-to-be defunct FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/994838 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [22:39:31] (03CR) 10Bking: [C: 03+2] cloudelastic: stop issuing certs for soon-to-be defunct FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/994838 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [22:39:37] (03CR) 10Bking: [V: 03+2 C: 03+2] cloudelastic: stop issuing certs for soon-to-be defunct FQDNs [puppet] - 10https://gerrit.wikimedia.org/r/994838 (https://phabricator.wikimedia.org/T355617) (owner: 10Bking) [22:47:07] (03PS4) 10Cwhite: logging::collector: add mw accesslog sampling by benthos [puppet] - 10https://gerrit.wikimedia.org/r/993476 (https://phabricator.wikimedia.org/T355836) [22:47:32] (03PS1) 10Bking: Revert "cloudelastic: stop issuing certs for soon-to-be defunct FQDNs" [puppet] - 10https://gerrit.wikimedia.org/r/994719 [22:47:52] (03CR) 10Bking: [V: 03+2 C: 03+2] Revert "cloudelastic: stop issuing certs for soon-to-be defunct FQDNs" [puppet] - 10https://gerrit.wikimedia.org/r/994719 (owner: 10Bking) [22:55:46] (03CR) 10Cwhite: logging::collector: add mw accesslog sampling by benthos (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/993476 (https://phabricator.wikimedia.org/T355836) (owner: 10Cwhite) [22:58:03] (03CR) 10Cwhite: [C: 03+2] logging::collector: add mw accesslog sampling by benthos [puppet] - 10https://gerrit.wikimedia.org/r/993476 (https://phabricator.wikimedia.org/T355836) (owner: 10Cwhite) [23:01:42] (03CR) 10JHathaway: "I gave this a bit of thought, and I think the Systemd timer is probably sufficient and we can drop the check, rather than porting to Prome" [puppet] - 10https://gerrit.wikimedia.org/r/987431 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [23:05:58] (03Abandoned) 10Joal: Bump mediawiki_history_snapshot to 2023-09 [puppet] - 10https://gerrit.wikimedia.org/r/965059 (owner: 10Joal) [23:08:01] (03PS4) 10Cwhite: logstash: stop consuming the full mediawiki accesslog topics [puppet] - 10https://gerrit.wikimedia.org/r/992657 (https://phabricator.wikimedia.org/T355836) [23:11:09] (03PS1) 10Varnent: Update favicon for Office Wiki and remove old icon. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994851 (https://phabricator.wikimedia.org/T144254) [23:28:50] (03CR) 10Catrope: [C: 03+1] Update favicon for Office Wiki and remove old icon. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/994851 (https://phabricator.wikimedia.org/T144254) (owner: 10Varnent)