[00:01:21] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1039603 (owner: 10TrainBranchBot) [00:09:10] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9600.service on elastic1086:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:15:46] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:18:45] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:35:03] PROBLEM - CirrusSearch more_like eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39 [00:40:01] RECOVERY - CirrusSearch more_like eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39 [00:45:46] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:48:45] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:59:42] 06SRE, 10DNS, 06Traffic: benefactors.wikimedia.org should point somewhere better then the wikimedia.org homepage - https://phabricator.wikimedia.org/T367012 (10Pppery) 03NEW [01:31:20] 06SRE, 10DNS, 06Traffic: Cleanup unused DNS subdomains - https://phabricator.wikimedia.org/T367012#9874025 (10Pppery) [01:32:04] 06SRE, 10DNS, 06Traffic: Remove iegreview.wikimedia.org from DNS - https://phabricator.wikimedia.org/T367011#9874028 (10Pppery) In for a penny, in for a pound - I tested every wikimedia.org subdomain and filed T367012 and T367013 [01:32:20] (03PS3) 10Huji: Add tfj as a shortcut for toolforge-jobs command [puppet] - 10https://gerrit.wikimedia.org/r/802596 (https://phabricator.wikimedia.org/T309308) [01:34:42] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [01:45:46] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [01:48:45] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:11:55] FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9600.service on elastic1086:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:38:45] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:45:46] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:48:45] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:58:45] RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:09:10] RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9600.service on elastic1086:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:45:46] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:48:45] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:12:45] (03PS1) 10Snwachukwu: Update Eventgate-Wikimedia and Eventstreams repository to Gitlab source and version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040862 (https://phabricator.wikimedia.org/T344730) [04:15:46] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:18:45] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:35:01] (03CR) 10KartikMistry: [C:03+2] Update Apertium to 2024-06-07-143238-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040195 (https://phabricator.wikimedia.org/T356252) (owner: 10KartikMistry) [04:35:36] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance [04:35:49] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance [04:35:51] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [04:36:07] Updating Apertium service in some time. [04:36:07] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [04:36:09] (03Merged) 10jenkins-bot: Update Apertium to 2024-06-07-143238-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040195 (https://phabricator.wikimedia.org/T356252) (owner: 10KartikMistry) [04:36:15] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1165 (T364069)', diff saved to https://phabricator.wikimedia.org/P64474 and previous config saved to /var/cache/conftool/dbconfig/20240610-043615-marostegui.json [04:36:19] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [04:36:40] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: Primary switchover s7 T366875 [04:36:46] T366875: Switchover s7 master (db2218 -> db2121) - https://phabricator.wikimedia.org/T366875 [04:36:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2121 with weight 0 T366875', diff saved to https://phabricator.wikimedia.org/P64475 and previous config saved to /var/cache/conftool/dbconfig/20240610-043649-root.json [04:37:04] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s7 T366875 [04:37:35] !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/apertium: apply [04:37:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db2121 from API/vslow/dump T366875', diff saved to https://phabricator.wikimedia.org/P64476 and previous config saved to /var/cache/conftool/dbconfig/20240610-043741-root.json [04:37:56] !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/apertium: apply [04:38:13] (03PS2) 10Gerrit maintenance bot: mariadb: Promote db2121 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1039593 (https://phabricator.wikimedia.org/T366875) [04:38:29] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2121 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1039593 (https://phabricator.wikimedia.org/T366875) (owner: 10Gerrit maintenance bot) [04:38:30] (03CR) 10Marostegui: [V:03+2 C:03+2] mariadb: Promote db2121 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1039593 (https://phabricator.wikimedia.org/T366875) (owner: 10Gerrit maintenance bot) [04:40:58] (03PS1) 10Marostegui: db1180: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1040863 [04:41:30] (03CR) 10Marostegui: [C:03+2] db1180: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1040863 (owner: 10Marostegui) [04:41:59] !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/apertium: apply [04:42:36] !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/apertium: apply [04:44:15] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T352010)', diff saved to https://phabricator.wikimedia.org/P64477 and previous config saved to /var/cache/conftool/dbconfig/20240610-044414-ladsgroup.json [04:44:21] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [04:44:30] !log Rename flaggedpage_pending in s5 T365568 [04:44:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:44:34] T365568: Drop flaggedpage_pending from production - https://phabricator.wikimedia.org/T365568 [04:49:22] !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/apertium: apply [04:49:56] !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/apertium: apply [04:52:41] !log Updated Apertium to 2024-06-07-143238-production (T356252) [04:52:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:59:24] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P64478 and previous config saved to /var/cache/conftool/dbconfig/20240610-045922-ladsgroup.json [05:02:15] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:02:19] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:04:09] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:06:01] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 13 Aug 2024 12:55:14 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:06:11] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 52067 bytes in 5.371 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:06:11] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8617 bytes in 1.724 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [05:06:16] !log Starting s7 codfw failover from db2218 to db2121 - T366875 [05:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:06:21] T366875: Switchover s7 master (db2218 -> db2121) - https://phabricator.wikimedia.org/T366875 [05:06:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2121 to s7 primary T366875', diff saved to https://phabricator.wikimedia.org/P64479 and previous config saved to /var/cache/conftool/dbconfig/20240610-050637-marostegui.json [05:07:40] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2218 T366875', diff saved to https://phabricator.wikimedia.org/P64480 and previous config saved to /var/cache/conftool/dbconfig/20240610-050738-root.json [05:11:45] (03PS1) 10Marostegui: db2218: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1040865 [05:12:49] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2218.codfw.wmnet with reason: Long schema change [05:12:51] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2218.codfw.wmnet with reason: Long schema change [05:13:06] (03CR) 10Marostegui: [C:03+2] db2218: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1040865 (owner: 10Marostegui) [05:13:30] !log dbmaint codfw s7 deploy schema change on db2218 T364299 [05:13:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:13:33] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [05:14:34] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P64481 and previous config saved to /var/cache/conftool/dbconfig/20240610-051432-ladsgroup.json [05:15:46] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:18:45] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:29:42] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T352010)', diff saved to https://phabricator.wikimedia.org/P64482 and previous config saved to /var/cache/conftool/dbconfig/20240610-052941-ladsgroup.json [05:29:45] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance [05:29:47] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance [05:29:47] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [05:34:42] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [05:41:51] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2179 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1039604 (https://phabricator.wikimedia.org/T367017) [05:45:46] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:48:45] FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:04:21] FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:08:45] FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:09:21] RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues [06:11:16] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T352010)', diff saved to https://phabricator.wikimedia.org/P64483 and previous config saved to /var/cache/conftool/dbconfig/20240610-061116-ladsgroup.json [06:11:23] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [06:14:58] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2207 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1039605 (https://phabricator.wikimedia.org/T367019) [06:15:26] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1162 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1040886 (https://phabricator.wikimedia.org/T367020) [06:15:31] (03PS1) 10Gerrit maintenance bot: wmnet: Update s2-master alias [dns] - 10https://gerrit.wikimedia.org/r/1040887 (https://phabricator.wikimedia.org/T367020) [06:15:46] FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:17:00] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T364069)', diff saved to https://phabricator.wikimedia.org/P64484 and previous config saved to /var/cache/conftool/dbconfig/20240610-061658-marostegui.json [06:17:04] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [06:18:42] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 32 hosts with reason: Primary switchover s4 T367017 [06:18:45] FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:18:46] T367017: Switchover s4 master (db2140 -> db2179) - https://phabricator.wikimedia.org/T367017 [06:18:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2179 with weight 0 T367017', diff saved to https://phabricator.wikimedia.org/P64485 and previous config saved to /var/cache/conftool/dbconfig/20240610-061849-root.json [06:19:09] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 32 hosts with reason: Primary switchover s4 T367017 [06:19:39] !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db2179 from API/vslow/dump T367017', diff saved to https://phabricator.wikimedia.org/P64486 and previous config saved to /var/cache/conftool/dbconfig/20240610-061939-root.json [06:19:57] (03PS1) 10Marostegui: Revert "db2218: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1040571 [06:20:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2218 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P64487 and previous config saved to /var/cache/conftool/dbconfig/20240610-062017-root.json [06:20:21] (03CR) 10Marostegui: [C:03+2] Revert "db2218: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1040571 (owner: 10Marostegui) [06:26:25] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P64488 and previous config saved to /var/cache/conftool/dbconfig/20240610-062624-ladsgroup.json [06:32:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P64489 and previous config saved to /var/cache/conftool/dbconfig/20240610-063208-marostegui.json [06:35:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2218 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P64490 and previous config saved to /var/cache/conftool/dbconfig/20240610-063524-root.json [06:36:14] (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2179 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1039604 (https://phabricator.wikimedia.org/T367017) (owner: 10Gerrit maintenance bot) [06:38:13] !log Starting s4 codfw failover from db2140 to db2179 - T367017 [06:38:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:38:20] T367017: Switchover s4 master (db2140 -> db2179) - https://phabricator.wikimedia.org/T367017 [06:38:22] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 26 hosts with reason: Primary switchover s2 T367019 [06:38:30] T367019: Switchover s2 master (db2204 -> db2207) - https://phabricator.wikimedia.org/T367019 [06:38:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2179 to s4 primary T367017', diff saved to https://phabricator.wikimedia.org/P64491 and previous config saved to /var/cache/conftool/dbconfig/20240610-063830-root.json [06:38:44] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 26 hosts with reason: Primary switchover s2 T367019 [06:39:04] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2140 T367017', diff saved to https://phabricator.wikimedia.org/P64492 and previous config saved to /var/cache/conftool/dbconfig/20240610-063904-root.json [06:39:13] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set db2207 with weight 0 T367019', diff saved to https://phabricator.wikimedia.org/P64493 and previous config saved to /var/cache/conftool/dbconfig/20240610-063912-arnaudb.json [06:41:06] (03CR) 10Muehlenhoff: [C:03+2] Remove profile::base::use_linux510_on_buster [puppet] - 10https://gerrit.wikimedia.org/r/1040109 (owner: 10Muehlenhoff) [06:41:32] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P64494 and previous config saved to /var/cache/conftool/dbconfig/20240610-064132-ladsgroup.json [06:42:17] (03CR) 10DCausse: [C:03+1] Deprecate system::role for search roles [puppet] - 10https://gerrit.wikimedia.org/r/1040125 (owner: 10Muehlenhoff) [06:43:47] (03PS1) 10Marostegui: db2140: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1040869 [06:44:23] (03CR) 10Marostegui: [C:03+2] db2140: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1040869 (owner: 10Marostegui) [06:45:10] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2009.codfw.wmnet [06:45:46] FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:46:42] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2140.codfw.wmnet with reason: Long schema change [06:46:45] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2140.codfw.wmnet with reason: Long schema change [06:47:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P64495 and previous config saved to /var/cache/conftool/dbconfig/20240610-064716-marostegui.json [06:47:37] !log dbmaint codfw s4 deploy schema change on db2140 T364299 [06:47:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:47:40] T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299 [06:48:45] FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [06:50:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2218 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P64496 and previous config saved to /var/cache/conftool/dbconfig/20240610-065031-root.json [06:53:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2009.codfw.wmnet [06:54:01] (03CR) 10Muehlenhoff: [C:03+2] Deprecate system::role for search roles [puppet] - 10https://gerrit.wikimedia.org/r/1040125 (owner: 10Muehlenhoff) [06:56:41] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T352010)', diff saved to https://phabricator.wikimedia.org/P64497 and previous config saved to /var/cache/conftool/dbconfig/20240610-065640-ladsgroup.json [06:56:43] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2202.codfw.wmnet with reason: Maintenance [06:56:44] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [06:56:56] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2202.codfw.wmnet with reason: Maintenance [06:58:19] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1240.eqiad.wmnet with reason: Maintenance [06:58:32] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1240.eqiad.wmnet with reason: Maintenance [06:59:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2009.codfw.wmnet [06:59:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [07:00:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2009.codfw.wmnet [07:00:05] Amir1 and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240610T0700). [07:00:05] kostajh: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [07:01:18] hello [07:01:44] I'll deploy the patch now [07:02:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T364069)', diff saved to https://phabricator.wikimedia.org/P64498 and previous config saved to /var/cache/conftool/dbconfig/20240610-070224-marostegui.json [07:02:28] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance [07:02:29] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [07:02:37] (03CR) 10Muehlenhoff: [C:03+2] Deprecate system::role for wikikube roles [puppet] - 10https://gerrit.wikimedia.org/r/1040124 (owner: 10Muehlenhoff) [07:02:41] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance [07:02:49] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1168 (T364069)', diff saved to https://phabricator.wikimedia.org/P64499 and previous config saved to /var/cache/conftool/dbconfig/20240610-070249-marostegui.json [07:03:32] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2010.codfw.wmnet [07:05:38] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2218 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P64500 and previous config saved to /var/cache/conftool/dbconfig/20240610-070537-root.json [07:05:46] FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:07:56] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2010.codfw.wmnet [07:10:50] <_joe_> jouncebot: nowandnext [07:10:50] For the next 0 hour(s) and 49 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240610T0700) [07:10:50] In 0 hour(s) and 49 minute(s): Gerrit upgrade (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240610T0800) [07:11:14] <_joe_> kostajh: lmk when you're done :) [07:12:09] _joe_: are you able to check something for me with mw kubernetes via https://wikitech.wikimedia.org/wiki/MediaWiki_On_Kubernetes#Get_a_shell_on_a_production_pod ? [07:12:24] I'd like to see the output of `scandir('/usr/share/GeoIP')` [07:12:40] (03PS1) 10Brouberol: global_config: expose services for all mariadb hosts and masters [puppet] - 10https://gerrit.wikimedia.org/r/1040872 (https://phabricator.wikimedia.org/T331894) [07:13:00] because I need some verification that https://gerrit.wikimedia.org/r/c/operations/puppet/+/1037528 propagated the files to the locations we care about [07:13:27] it seems like on mwmaint, mwdebug, and mwdeploy, the files are not updated (but then again, that puppet config doesn't target those locations AFAIK) [07:13:59] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2010.codfw.wmnet [07:14:23] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2010.codfw.wmnet [07:14:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [07:15:25] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2011.codfw.wmnet [07:16:31] <_joe_> yeah let me look at that patch for a sec [07:17:07] <_joe_> kostajh: in theory the change should affect all mw servers [07:17:35] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1009.eqiad.wmnet [07:17:42] (03CR) 10JMeybohm: [C:03+2] push-notifications: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032519 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [07:18:19] _joe_: I think $fetch_private is not true for the mwdebug/mwmaint servers, perhaps [07:18:29] https://gerrit.wikimedia.org/r/c/operations/puppet/+/1037528/21/modules/puppetmaster/manifests/geoip.pp#26 [07:18:32] <_joe_> kostajh: no you're wrong [07:18:39] <_joe_> they are full mediawiki servers [07:18:40] (03CR) 10JMeybohm: [C:03+2] function-orchestrator: ensure all containers have securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037163 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [07:18:42] (03CR) 10JMeybohm: [C:03+2] function-evaluator: ensure all containers have securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037162 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [07:18:50] (03Merged) 10jenkins-bot: push-notifications: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032519 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [07:19:51] (03Merged) 10jenkins-bot: function-orchestrator: ensure all containers have securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037163 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [07:19:55] (03Merged) 10jenkins-bot: function-evaluator: ensure all containers have securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037162 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [07:20:44] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2218 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P64501 and previous config saved to /var/cache/conftool/dbconfig/20240610-072043-root.json [07:20:45] <_joe_> kostajh: on a k8s node, clearly the change had no effect [07:20:57] (03CR) 10JMeybohm: [C:03+1] proton: drop replicas from 12 to 10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040221 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [07:21:39] (03PS1) 10Muehlenhoff: Remove iegreview module [puppet] - 10https://gerrit.wikimedia.org/r/1040873 (https://phabricator.wikimedia.org/T334415) [07:21:54] (03CR) 10JMeybohm: [C:03+1] admin_ng: bump CPU resourcequota for proton [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040220 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [07:22:16] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/push-notifications: apply [07:22:26] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/push-notifications: apply [07:23:07] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2011.codfw.wmnet [07:23:22] _joe_: hmm. In the past, I was told (sorry, I have forgotten by whom) that the GeoIP changes would show up on mwmaint server. That's why I added this note to operations/mediawki-config https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1038723/7/wmf-config/CommonSettings.php#3953 [07:23:30] *would *not* show up [07:23:43] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/push-notifications: apply [07:23:53] (03CR) 10JMeybohm: [V:03+1 C:03+2] etcd::v3: Allow all nodes of an etcd cluster to connect to each other [puppet] - 10https://gerrit.wikimedia.org/r/1038283 (https://phabricator.wikimedia.org/T366465) (owner: 10JMeybohm) [07:24:15] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/push-notifications: apply [07:24:20] <_joe_> kostajh: whoever told you that is very wrong [07:24:34] just to confirm, could we please try `scandir('/usr/share/GeoIP')` in a production k8s shell? [07:25:25] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/push-notifications: apply [07:25:59] <_joe_> kostajh: already did on the physical hosts where it's mounted from [07:26:05] <_joe_> there is no trace of the new files [07:26:09] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/push-notifications: apply [07:26:44] (03PS2) 10Brouberol: global_config: expose services for all mariadb hosts and masters [puppet] - 10https://gerrit.wikimedia.org/r/1040872 (https://phabricator.wikimedia.org/T331894) [07:27:17] shouldn't we see the GeoIP enterprise files on `/usr/share/GeoIP`? [07:27:23] <_joe_> kostajh: well actually, they're there but not updated since friday, to be clearer [07:27:45] I see these ones https://gerrit.wikimedia.org/r/c/operations/puppet/+/1037528/21/modules/puppetmaster/manifests/geoip.pp#37 [07:27:47] <_joe_> it should be mounted, yes [07:28:05] <_joe_> wait a sec [07:28:12] are there some logs we can look at of the puppet run? [07:28:25] <_joe_> I am trying to figure out what is going on rn [07:29:19] (03PS1) 10Brouberol: datahub-next: replace IPs by Services in network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040874 (https://phabricator.wikimedia.org/T359423) [07:29:20] (03PS1) 10Brouberol: datahub: replace IPs by Services in network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040875 (https://phabricator.wikimedia.org/T359423) [07:30:02] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1040872 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [07:30:17] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1009.eqiad.wmnet [07:31:12] <_joe_> kostajh: ok I got what your mistake is, I misunderstood your original request [07:31:16] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2011.codfw.wmnet [07:31:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2011.codfw.wmnet [07:31:24] <_joe_> the enterprise file is under /usr/share/GeoIPInfo [07:31:26] <_joe_> not under [07:31:32] <_joe_> /usr/share/GeoIP [07:32:08] <_joe_> not sure why we're separating files in those two directories [07:32:23] hmm. On mwmaint I get `ls: cannot access '/usr/share/GeoIPInfo/': No such file or directory` [07:32:23] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2012.codfw.wmnet [07:32:43] <_joe_> I'm talking inside the container [07:32:53] _joe_: can you see the GeoLite2 files alongside the Enterprise file? [07:33:04] <_joe_> kostajh: yes [07:33:10] alright, thank you [07:33:16] sorry for the confusion [07:33:27] <_joe_> but they're last week [07:33:33] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [07:33:42] <_joe_> not updated today like the enterprise ones [07:33:48] that should be ok [07:34:02] I think [07:34:15] upstream, they are updated twice per week [07:34:21] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [07:34:31] but I guess the puppet module is supposed to download them more frequently [07:34:59] _joe_: do you think it's ok to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1038723/ or do we need to confirm that those files are updating regularly? [07:35:30] <_joe_> kostajh: I think it's ok, but tbh I question the whole approach [07:35:30] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [07:35:40] <_joe_> I'm soryr I wasn't around when this was decided [07:35:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2218 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P64502 and previous config saved to /var/cache/conftool/dbconfig/20240610-073549-root.json [07:35:54] <_joe_> but *imho* it would make sense to have ipoid read the maxmind data [07:36:12] <_joe_> instead of mounting these databases inside mediawiki, which we should stop doing instead of expanding [07:36:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1009.eqiad.wmnet [07:36:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1009.eqiad.wmnet [07:36:50] _joe_: I can add that as a proposal in T357753 [07:36:50] T357753: Build next iteration of IPoid using OpenSearch as backend - https://phabricator.wikimedia.org/T357753 [07:37:10] <_joe_> yeah I think it's quite important [07:37:24] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1010.eqiad.wmnet [07:37:29] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1010.eqiad.wmnet [07:37:34] <_joe_> I don't know why my team didn't tell you, we've been planning to dismiss the maxmind data inside mediawiki for quite some time :( [07:37:45] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1010.eqiad.wmnet [07:37:53] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [07:38:39] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Revert db2207 with weight 500 T367019', diff saved to https://phabricator.wikimedia.org/P64503 and previous config saved to /var/cache/conftool/dbconfig/20240610-073838-arnaudb.json [07:38:42] T367019: Switchover s2 master (db2204 -> db2207) - https://phabricator.wikimedia.org/T367019 [07:39:38] _joe_: well, for now it is trying to preserve status quo, we are just trying to remove references to Enterprise files which will disappear at the end of July [07:39:44] (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038723 (https://phabricator.wikimedia.org/T361884) (owner: 10Kosta Harlan) [07:40:32] (03Merged) 10jenkins-bot: IPInfo: Switch to using GeoLite2 data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038723 (https://phabricator.wikimedia.org/T361884) (owner: 10Kosta Harlan) [07:41:11] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2207.codfw.wmnet with reason: maintenance [07:41:17] !log kharlan@deploy1002 Started scap: Backport for [[gerrit:1038723|IPInfo: Switch to using GeoLite2 data (T361884)]] [07:41:22] T361884: Remove $wgIPInfoGeoIP2EnterprisePath from production config - https://phabricator.wikimedia.org/T361884 [07:41:24] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2207.codfw.wmnet with reason: maintenance [07:41:58] !log arnaudb@cumin1002 dbctl commit (dc=all): 'depool db2207 maintenance', diff saved to https://phabricator.wikimedia.org/P64504 and previous config saved to /var/cache/conftool/dbconfig/20240610-074157-arnaudb.json [07:43:54] !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db2207.codfw.wmnet [07:44:05] PROBLEM - OpenSearch health check for shards on 9200 on logstash1025 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fbc34695280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.w [07:44:05] org/wiki/Search%23Administration [07:44:55] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2012.codfw.wmnet [07:45:05] RECOVERY - OpenSearch health check for shards on 9200 on logstash1025 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: green, timed_out: False, number_of_nodes: 18, number_of_data_nodes: 12, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 756, active_shards: 1774, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_sha [07:45:05] umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [07:46:59] PROBLEM - Host kubetcd2005 is DOWN: PING CRITICAL - Packet loss = 100% [07:46:59] PROBLEM - Host ml-etcd2002 is DOWN: PING CRITICAL - Packet loss = 100% [07:47:00] (03CR) 10DCausse: "Thanks for working on this!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040211 (https://phabricator.wikimedia.org/T362310) (owner: 10Peter Fischer) [07:47:16] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2207.codfw.wmnet [07:48:34] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1010.eqiad.wmnet [07:50:16] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply [07:50:31] RECOVERY - Host kubetcd2005 is UP: PING OK - Packet loss = 0%, RTA = 30.54 ms [07:50:51] RECOVERY - Host ml-etcd2002 is UP: PING OK - Packet loss = 0%, RTA = 35.27 ms [07:50:59] !log marostegui@cumin1002 dbctl commit (dc=all): 'db2218 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P64505 and previous config saved to /var/cache/conftool/dbconfig/20240610-075056-root.json [07:51:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2012.codfw.wmnet [07:51:32] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2012.codfw.wmnet [07:53:05] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply [07:53:25] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply [07:54:29] still deploying [07:54:35] my tmux session vanished :( [07:54:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1010.eqiad.wmnet [07:55:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1010.eqiad.wmnet [07:55:25] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 10%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P64506 and previous config saved to /var/cache/conftool/dbconfig/20240610-075524-arnaudb.json [07:56:03] `tmux ls` shows no session. And if I try `scap backport` again, I see `07:55:23 backport is locked by kharlan`. Amir1 urbanecm how should I proceed? [07:56:36] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply [07:57:21] kostajh: there is a process under your account running [07:57:26] (03CR) 10JMeybohm: [C:03+1] "> - the limit is not configurable and is 1000rps per UA" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040211 (https://phabricator.wikimedia.org/T362310) (owner: 10Peter Fischer) [07:57:41] `ps aux | grep scap` shows some [07:57:43] !log kharlan@deploy1002 kharlan: Backport for [[gerrit:1038723|IPInfo: Switch to using GeoLite2 data (T361884)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [07:57:47] T361884: Remove $wgIPInfoGeoIP2EnterprisePath from production config - https://phabricator.wikimedia.org/T361884 [07:57:50] !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ping1004.eqiad.wmnet with OS bookworm [07:57:56] 06SRE, 06Infrastructure-Foundations: Move the ping* servers to Bookworm - https://phabricator.wikimedia.org/T366695#9874387 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ping1004.eqiad.wmnet with OS bookworm [07:58:04] (03CR) 10Jelto: [V:03+1] "looks mostly good and thanks for the ansers. I think a proper hiera lookup for the configure-projects-bot api token is needed. Let me know" [puppet] - 10https://gerrit.wikimedia.org/r/1035820 (https://phabricator.wikimedia.org/T355097) (owner: 10Brennen Bearnes) [07:58:12] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1011.eqiad.wmnet [07:58:24] kostajh: at this point, it should be waiting for your response, so it sounds like a good idea to kill it and start over? [07:58:38] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2013.codfw.wmnet [07:58:56] urbanecm: I tried "kill" but now I get a message about another lock [07:59:02] `07:58:45 concurrent prep is locked by kharlan (pid 29297) on Mon Jun 10 07:41:17 2024` [07:59:09] urbanecm: so remove that process as well? [07:59:20] kostajh: i'd kill the parent (29297) [07:59:31] urbanecm: kostajh: can you ping me when you're done deploying? [07:59:37] yeah [07:59:43] * urbanecm is not deploying anything [07:59:46] !log kharlan@deploy1002 Started scap: Backport for [[gerrit:1038723|IPInfo: Switch to using GeoLite2 data (T361884)]] [07:59:50] taavi: will do. _joe_ is also waiting to hear when I'm done. [08:00:05] hashar: Deploy window Gerrit upgrade (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240610T0800) [08:00:11] hashar: still finishing up the backport [08:00:22] ^ I will do it once the backports have been completed [08:00:23] no rush [08:02:54] !log kharlan@deploy1002 kharlan: Backport for [[gerrit:1038723|IPInfo: Switch to using GeoLite2 data (T361884)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:03:00] T361884: Remove $wgIPInfoGeoIP2EnterprisePath from production config - https://phabricator.wikimedia.org/T361884 [08:03:08] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on gerrit2002.wikimedia.org with reason: Gerrit upgrade [08:03:22] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gerrit2002.wikimedia.org with reason: Gerrit upgrade [08:03:33] !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on gerrit1003.wikimedia.org with reason: Gerrit upgrade [08:03:47] !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gerrit1003.wikimedia.org with reason: Gerrit upgrade [08:04:58] !log kharlan@deploy1002 kharlan: Continuing with sync [08:09:49] (03CR) 10Klausman: [C:03+1] admin_ng: update Bookworm-based Docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040153 (https://phabricator.wikimedia.org/T356252) (owner: 10Elukey) [08:10:30] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 25%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P64507 and previous config saved to /var/cache/conftool/dbconfig/20240610-081030-arnaudb.json [08:13:54] !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:1038723|IPInfo: Switch to using GeoLite2 data (T361884)]] (duration: 14m 07s) [08:13:58] T361884: Remove $wgIPInfoGeoIP2EnterprisePath from production config - https://phabricator.wikimedia.org/T361884 [08:14:15] !log UTC morning deploys done [08:14:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:31] _joe_ taavi hashar I am done with backporting. [08:15:00] please coordinate with each other as to who goes next :) [08:17:45] I think _joe_ was first :-) [08:17:49] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ping1004.eqiad.wmnet with reason: host reimage [08:17:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1011.eqiad.wmnet [08:18:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2013.codfw.wmnet [08:19:13] I am upgrading Gerrit [08:19:27] (03CR) 10Hashar: [C:03+2] Merge branch 'deploy/wmf/stable-3.8' into deploy/wmf/stable-3.9 [software/gerrit] (deploy/wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1039201 (owner: 10Hashar) [08:19:31] (03CR) 10Hashar: [C:03+2] Gerrit 3.9.5, rebuild plugins and update TypeScript API [software/gerrit] (deploy/wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1039610 (https://phabricator.wikimedia.org/T354887) (owner: 10Hashar) [08:20:01] (03Merged) 10jenkins-bot: Merge branch 'deploy/wmf/stable-3.8' into deploy/wmf/stable-3.9 [software/gerrit] (deploy/wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1039201 (owner: 10Hashar) [08:20:02] (03Merged) 10jenkins-bot: Gerrit 3.9.5, rebuild plugins and update TypeScript API [software/gerrit] (deploy/wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1039610 (https://phabricator.wikimedia.org/T354887) (owner: 10Hashar) [08:21:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ping1004.eqiad.wmnet with reason: host reimage [08:21:36] (03CR) 10JMeybohm: "This is what I had in mind as well. 30min does seem a good choice I'd say, given this is more like a "hey, something is off" then "I'm on " [alerts] - 10https://gerrit.wikimedia.org/r/1040242 (https://phabricator.wikimedia.org/T366932) (owner: 10Scott French) [08:22:15] !log hashar@deploy1002 Started deploy [gerrit/gerrit@092aade]: Gerrit to version 3.9.5 on gerrit2002 - T354887 [08:22:22] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@092aade]: Gerrit to version 3.9.5 on gerrit2002 - T354887 (duration: 00m 07s) [08:23:35] (03PS3) 10Brouberol: deployment_server: alert on admin-ng pending changes [puppet] - 10https://gerrit.wikimedia.org/r/1040992 (https://phabricator.wikimedia.org/T331894) [08:24:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1011.eqiad.wmnet [08:24:21] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2013.codfw.wmnet [08:24:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1011.eqiad.wmnet [08:24:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2013.codfw.wmnet [08:25:00] (03Abandoned) 10Volans: Disable Accounting report alerting until bug is fixed [puppet] - 10https://gerrit.wikimedia.org/r/1039978 (https://phabricator.wikimedia.org/T366874) (owner: 10Ayounsi) [08:25:01] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1012.eqiad.wmnet [08:25:37] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 50%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P64508 and previous config saved to /var/cache/conftool/dbconfig/20240610-082536-arnaudb.json [08:25:58] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2014.codfw.wmnet [08:26:52] I am doing the primary Gerrit now [08:26:59] !log hashar@deploy1002 Started deploy [gerrit/gerrit@092aade]: Gerrit to version 3.9.5 on gerrit1003 - T354887 [08:27:05] !log hashar@deploy1002 Finished deploy [gerrit/gerrit@092aade]: Gerrit to version 3.9.5 on gerrit1003 - T354887 (duration: 00m 05s) [08:30:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T364069)', diff saved to https://phabricator.wikimedia.org/P64509 and previous config saved to /var/cache/conftool/dbconfig/20240610-083042-marostegui.json [08:32:17] !log Gerrit has been upgraded [08:33:20] hashar: I can't seem to add comments to patches. I've done a hard refresh of the page [08:33:34] *inline comments to files on patches, that is [08:33:59] (03PS1) 10Jelto: gitlab:: add blackbox check for ssh service [puppet] - 10https://gerrit.wikimedia.org/r/1041028 (https://phabricator.wikimedia.org/T367021) [08:34:47] kostajh: my guess would be some cache is not in sync and some javascript is lost [08:35:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1012.eqiad.wmnet [08:35:37] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2014.codfw.wmnet [08:35:44] kostajh: that worked on https://gerrit.wikimedia.org/r/c/test/gerrit-ping/+/998940 [08:35:47] anything in the console? [08:36:22] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ping1004.eqiad.wmnet with OS bookworm [08:36:42] hashar: it works on a commit message [08:36:49] but not here https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1025719/3/composer.json#181 [08:36:59] I can add a comment to patches: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1037065/3#message-02c8decf3bbbe9e6a60fdc64bee00418ba48a811 [08:37:06] PROBLEM - Host dse-k8s-etcd1003 is DOWN: PING CRITICAL - Packet loss = 100% [08:37:06] PROBLEM - Host kubestagemaster2004 is DOWN: PING CRITICAL - Packet loss = 100% [08:37:28] hashar: the only browser warnings are about font downloads [08:37:34] PROBLEM - Host ml-etcd1002 is DOWN: PING CRITICAL - Packet loss = 100% [08:37:36] PROBLEM - Host kubetcd1004 is DOWN: PING CRITICAL - Packet loss = 100% [08:38:08] hashar / jelto bah, works on Chrome, not on Firefox. Let me try Firefox without plugins [08:38:28] I'm on firefox [08:38:45] FIRING: [3x] ProbeDown: Service ganeti2013:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:38:47] maybe try clear your cache for gerrit? [08:38:49] I too ( 115.10.0esr from Debian ) [08:39:33] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4048.ulsfo.wmnet [08:39:44] (03CR) 10Muehlenhoff: "Dummy comment to test Gerrit after update" [puppet] - 10https://gerrit.wikimedia.org/r/1040873 (https://phabricator.wikimedia.org/T334415) (owner: 10Muehlenhoff) [08:39:54] !log fabfur@cumin1002 START - Cookbook sre.hosts.reboot-single for host cp4048.ulsfo.wmnet [08:40:30] RECOVERY - Host dse-k8s-etcd1003 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms [08:40:36] RECOVERY - Host ml-etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 5.51 ms [08:40:36] JFTR, works for me as well (firefox, no plugins othe than WikimediaDebug) [08:40:43] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 75%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P64510 and previous config saved to /var/cache/conftool/dbconfig/20240610-084042-arnaudb.json [08:40:49] also 115.10 from Debian [08:40:50] RECOVERY - Host kubestagemaster2004 is UP: PING OK - Packet loss = 0%, RTA = 30.59 ms [08:41:06] RECOVERY - Host kubetcd1004 is UP: PING OK - Packet loss = 0%, RTA = 0.67 ms [08:41:12] I'm on Firefox nightly (128.0a1) [08:41:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1012.eqiad.wmnet [08:41:28] Commenting doesn't work in safe mode (extensions/add-ons disabled) [08:41:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1012.eqiad.wmnet [08:41:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2014.codfw.wmnet [08:41:48] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2014.codfw.wmnet [08:42:40] On macOS, the firefox version is 126, and commenting doesn't work with that version either [08:43:45] RESOLVED: [4x] ProbeDown: Service ganeti1012:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:44:14] (03PS2) 10Jelto: gitlab:: add blackbox check for ssh service [puppet] - 10https://gerrit.wikimedia.org/r/1041028 (https://phabricator.wikimedia.org/T367021) [08:45:41] hashar: I have cleared the cache for gerrit on Firefox nightly, and inline commenting still doesn't work [08:45:46] RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:45:51] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P64511 and previous config saved to /var/cache/conftool/dbconfig/20240610-084550-marostegui.json [08:45:51] nightly? [08:46:04] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2015.codfw.wmnet [08:46:24] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1013.eqiad.wmnet [08:46:38] am I good to deploy my config patch now? [08:46:39] hashar: tested on nightly (128) and stable (126). [08:46:40] FIRING: KubernetesRsyslogDown: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2001 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:46:40] kostajh: is there anything showing up in the browser console? [08:47:33] !log cmooney@cumin1002 START - Cookbook sre.dns.netbox [08:48:46] hashar: T367029 [08:48:48] T367029: Inline commenting doesn't work on Gerrit 3.9 with Firefox on macOS - https://phabricator.wikimedia.org/T367029 [08:48:51] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4048.ulsfo.wmnet [08:48:54] (03PS1) 10Muehlenhoff: Change ping host in codfw to ping2004 [homer/public] - 10https://gerrit.wikimedia.org/r/1041030 (https://phabricator.wikimedia.org/T366695) [08:50:12] !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add new entries for cr2-codfw peering to ssw1-d8-codfw - cmooney@cumin1002" [08:50:58] !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add new entries for cr2-codfw peering to ssw1-d8-codfw - cmooney@cumin1002" [08:50:58] !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [08:53:46] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2015.codfw.wmnet [08:53:55] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1013.eqiad.wmnet [08:54:23] !log upgrade prometheus-statsd-exporter on webperf - T302373 [08:54:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:27] T302373: Upgrade prometheus-statsd-exporter - https://phabricator.wikimedia.org/T302373 [08:55:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 100%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P64512 and previous config saved to /var/cache/conftool/dbconfig/20240610-085548-arnaudb.json [08:56:40] RESOLVED: KubernetesRsyslogDown: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2001 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [08:56:45] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 26 hosts with reason: Primary switchover s2 T367019 [08:56:49] T367019: Switchover s2 master (db2204 -> db2207) - https://phabricator.wikimedia.org/T367019 [08:57:08] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 26 hosts with reason: Primary switchover s2 T367019 [08:57:21] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set db2207 with weight 0 T367019', diff saved to https://phabricator.wikimedia.org/P64513 and previous config saved to /var/cache/conftool/dbconfig/20240610-085721-arnaudb.json [08:58:32] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [09:00:01] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1013.eqiad.wmnet [09:00:04] kostajh: so previously one could double click to add a comment below? [09:00:33] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1013.eqiad.wmnet [09:00:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P64514 and previous config saved to /var/cache/conftool/dbconfig/20240610-090058-marostegui.json [09:01:15] !log upload prometheus-statsd-exporter 0.26.1-1 to apt - T302373 [09:01:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:18] T302373: Upgrade prometheus-statsd-exporter - https://phabricator.wikimedia.org/T302373 [09:01:45] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2015.codfw.wmnet [09:01:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2015.codfw.wmnet [09:03:22] <_joe_> oh sorry folks I went to do other stuff and decided to deploy in the infra window [09:03:28] <_joe_> given mine is an infra change [09:06:31] (03PS2) 10Volans: reports: accounting, support swapped motherboards [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1037538 (https://phabricator.wikimedia.org/T358542) [09:07:52] (03PS3) 10Volans: reports: accounting, support swapped motherboards [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1037538 (https://phabricator.wikimedia.org/T358542) [09:08:23] (03CR) 10Volans: "tested on netbox-next:" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1037538 (https://phabricator.wikimedia.org/T358542) (owner: 10Volans) [09:09:33] 10ops-codfw, 06SRE, 10SRE-tools, 06DC-Ops, and 2 others: Netbox errors caused by system board replacement - https://phabricator.wikimedia.org/T358542#9874533 (10Volans) I've also manually fixed a bunch of warnings due to a clearly mistyped phabricator task number in the spreadsheet. The patch has been test... [09:13:01] (03CR) 10Arnaudb: [C:03+2] mariadb: Promote db2207 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1039605 (https://phabricator.wikimedia.org/T367019) (owner: 10Gerrit maintenance bot) [09:13:17] (03CR) 10JMeybohm: deployment_server: alert on admin-ng pending changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1040992 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [09:14:23] !log Starting s2 codfw failover from db2204 to db2207 - T367019 [09:14:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:26] T367019: Switchover s2 master (db2204 -> db2207) - https://phabricator.wikimedia.org/T367019 [09:14:31] (03CR) 10EoghanGaffney: [C:03+1] gitlab:: add blackbox check for ssh service [puppet] - 10https://gerrit.wikimedia.org/r/1041028 (https://phabricator.wikimedia.org/T367021) (owner: 10Jelto) [09:15:07] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote db2207 to s2 primary T367019', diff saved to https://phabricator.wikimedia.org/P64515 and previous config saved to /var/cache/conftool/dbconfig/20240610-091506-arnaudb.json [09:16:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T364069)', diff saved to https://phabricator.wikimedia.org/P64516 and previous config saved to /var/cache/conftool/dbconfig/20240610-091606-marostegui.json [09:16:10] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance [09:16:13] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [09:16:23] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance [09:16:31] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1173 (T364069)', diff saved to https://phabricator.wikimedia.org/P64517 and previous config saved to /var/cache/conftool/dbconfig/20240610-091631-marostegui.json [09:17:02] (03CR) 10Volans: [C:03+2] "Self-merging as the diffs from PS1 to PS3 are trivial typos" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1037538 (https://phabricator.wikimedia.org/T358542) (owner: 10Volans) [09:17:49] (03Merged) 10jenkins-bot: reports: accounting, support swapped motherboards [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1037538 (https://phabricator.wikimedia.org/T358542) (owner: 10Volans) [09:20:48] jouncebot: nownadnext [09:20:54] jouncebot: nowandnext [09:20:54] No deployments scheduled for the next 0 hour(s) and 39 minute(s) [09:20:54] In 0 hour(s) and 39 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240610T1000) [09:21:47] (03PS1) 10Hnowlan: fonts: add opendyslexic [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1041033 (https://phabricator.wikimedia.org/T285650) [09:21:50] (03CR) 10TrainBranchBot: [C:03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1040222 (owner: 10Majavah) [09:22:07] !log volans@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary [09:22:10] !log volans@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary [09:22:26] (03Merged) 10jenkins-bot: Reapply "wikitech: Replace OSM class in Gerrit blocking hook" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1040222 (owner: 10Majavah) [09:22:45] !log taavi@deploy1002 Started scap: Backport for [[gerrit:1040222|Reapply "wikitech: Replace OSM class in Gerrit blocking hook"]] [09:24:37] !log volans@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox [09:24:43] !log volans@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox [09:25:02] !log taavi@deploy1002 taavi: Backport for [[gerrit:1040222|Reapply "wikitech: Replace OSM class in Gerrit blocking hook"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [09:25:13] !log taavi@deploy1002 taavi: Continuing with sync [09:25:26] (no way to test wikitech changes on mwdebug :/) [09:26:45] FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [09:30:02] 10ops-codfw, 06SRE, 10SRE-tools, 06DC-Ops, and 2 others: Netbox errors caused by system board replacement - https://phabricator.wikimedia.org/T358542#9874557 (10Volans) 05Open→03Resolved This is now completed. The new runs are not alerting for these hosts with replaced motherboards. @wiki_willy co... [09:33:14] (03CR) 10JMeybohm: k8s: send logs to per-cluster kafka topics (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1040170 (https://phabricator.wikimedia.org/T366710) (owner: 10Filippo Giunchedi) [09:34:02] !log taavi@deploy1002 Finished scap: Backport for [[gerrit:1040222|Reapply "wikitech: Replace OSM class in Gerrit blocking hook"]] (duration: 11m 17s) [09:34:42] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [09:35:50] (03PS1) 10Brouberol: datahub: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041036 (https://phabricator.wikimedia.org/T362978) [09:36:16] (03CR) 10Hnowlan: [C:03+1] thumbor: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037166 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [09:37:07] !log roll upgrade prometheus-statsd-exporter to baremetal - T302373 [09:37:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:11] T302373: Upgrade prometheus-statsd-exporter - https://phabricator.wikimedia.org/T302373 [09:37:33] (03CR) 10JMeybohm: [C:03+2] thumbor: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037166 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [09:38:36] (03Merged) 10jenkins-bot: thumbor: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037166 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [09:47:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [09:47:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [09:47:51] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4048.ulsfo.wmnet [09:49:38] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2829/console" [puppet] - 10https://gerrit.wikimedia.org/r/1041028 (https://phabricator.wikimedia.org/T367021) (owner: 10Jelto) [09:50:17] (03PS1) 10Filippo Giunchedi: statsd-exporter: bump version to upgrade [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1041037 [09:51:35] (03CR) 10Giuseppe Lavagetto: [C:03+1] "nitpick on the version number, otherwise LGTM" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1041037 (owner: 10Filippo Giunchedi) [09:53:16] (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2830/" [puppet] - 10https://gerrit.wikimedia.org/r/1041028 (https://phabricator.wikimedia.org/T367021) (owner: 10Jelto) [09:53:28] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply [09:53:52] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [09:54:01] (03PS2) 10Brouberol: datahub: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041036 (https://phabricator.wikimedia.org/T362978) [09:54:25] (03PS2) 10Majavah: P:openstack: opentofu: fix configuration [puppet] - 10https://gerrit.wikimedia.org/r/1040145 [09:54:34] (03PS2) 10Filippo Giunchedi: statsd-exporter: bump version to upgrade [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1041037 [09:54:40] (03CR) 10CI reject: [V:04-1] datahub: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041036 (https://phabricator.wikimedia.org/T362978) (owner: 10Brouberol) [09:54:50] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 5:00:00 on 870 hosts with reason: Issue from T367019 [09:54:51] !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 5:00:00 on 870 hosts with reason: Issue from T367019 [09:54:54] T367019: Switchover s2 master (db2204 -> db2207) - https://phabricator.wikimedia.org/T367019 [09:55:05] (03CR) 10Filippo Giunchedi: [C:03+2] "Thank you, fixed the comments" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1041037 (owner: 10Filippo Giunchedi) [09:55:08] (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] statsd-exporter: bump version to upgrade [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1041037 (owner: 10Filippo Giunchedi) [09:56:50] PROBLEM - MariaDB Replica Lag: s2 on db2138 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2526.44 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:56:50] PROBLEM - MariaDB Replica Lag: s2 on db2148 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2526.60 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:56:50] PROBLEM - MariaDB Replica Lag: s2 on db2175 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2527.19 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:56:56] PROBLEM - MariaDB Replica Lag: s2 on db2189 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2534.40 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:57:03] (03CR) 10Vgutierrez: [C:03+1] depool text@drmrs before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1039944 (https://phabricator.wikimedia.org/T366466) (owner: 10Fabfur) [09:57:24] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 5:00:00 on 26 hosts with reason: Issue from T367019 [09:57:26] PROBLEM - MariaDB Replica Lag: s2 on db2126 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2562.56 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:57:30] PROBLEM - MariaDB Replica Lag: s2 on db2125 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2566.52 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:57:30] PROBLEM - MariaDB Replica Lag: s2 on db2204 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2568.10 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:57:30] PROBLEM - MariaDB Replica Lag: s2 on db2207 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2568.24 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [09:57:39] (03PS1) 10JMeybohm: developer-portal: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041039 (https://phabricator.wikimedia.org/T362978) [09:57:46] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on 26 hosts with reason: Issue from T367019 [09:57:49] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply [09:58:06] (03CR) 10Vgutierrez: [C:04-1] hiera: enable IPIP for high-traffic1@drmrs for text services (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1039947 (https://phabricator.wikimedia.org/T366466) (owner: 10Fabfur) [09:58:20] (03CR) 10Majavah: [C:03+2] P:openstack: opentofu: fix configuration [puppet] - 10https://gerrit.wikimedia.org/r/1040145 (owner: 10Majavah) [09:58:25] (03CR) 10Vgutierrez: [C:03+1] cache:hiera: enable IPIP on text@drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1039948 (https://phabricator.wikimedia.org/T366466) (owner: 10Fabfur) [09:59:12] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Netbox network report failing - timeout errors - https://phabricator.wikimedia.org/T321704#9874622 (10Volans) I've took a look today and trying to manually run all the tests there isn't anyone that takes so long to trigger the 300s timeout,... [09:59:16] (03CR) 10Jelto: [V:03+1 C:03+2] gitlab:: add blackbox check for ssh service [puppet] - 10https://gerrit.wikimedia.org/r/1041028 (https://phabricator.wikimedia.org/T367021) (owner: 10Jelto) [09:59:21] (03PS1) 10Arnaudb: depool: codfw [dns] - 10https://gerrit.wikimedia.org/r/1041041 (https://phabricator.wikimedia.org/T367019) [10:00:05] Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240610T1000) [10:01:08] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [10:01:47] !log cgoubert@cumin1002 conftool action : set/pooled=false; selector: dnsdisc=mw-web-ro,name=codfw [10:01:56] (03Abandoned) 10Arnaudb: depool: codfw [dns] - 10https://gerrit.wikimedia.org/r/1041041 (https://phabricator.wikimedia.org/T367019) (owner: 10Arnaudb) [10:01:58] !log cgoubert@cumin1002 conftool action : set/pooled=false; selector: dnsdisc=mw-api-ext-ro,name=codfw [10:02:26] (03PS3) 10JMeybohm: toolhub: ensure all containers have securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037165 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [10:02:26] (03CR) 10JMeybohm: [C:03+1] "Let's not wait. I think we're good to go here" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037165 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [10:02:40] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [10:02:49] !log cgoubert@cumin1002 conftool action : set/pooled=false; selector: dnsdisc=mw-api-int-ro,name=codfw [10:04:25] (03PS1) 10Majavah: P:openstack: opentofu: Allow everyone to enter the directory [puppet] - 10https://gerrit.wikimedia.org/r/1041042 (https://phabricator.wikimedia.org/T364458) [10:04:26] (03PS1) 10Majavah: P:openstack: opentofu: Do not log changes to the env file [puppet] - 10https://gerrit.wikimedia.org/r/1041043 (https://phabricator.wikimedia.org/T364458) [10:05:01] (03PS1) 10GergesShamon: [huwiki] Add "suppressredirect" user right to editor user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041044 (https://phabricator.wikimedia.org/T366438) [10:05:31] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [10:05:48] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1014.eqiad.wmnet [10:06:16] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2016.codfw.wmnet [10:06:24] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1041043 (https://phabricator.wikimedia.org/T364458) (owner: 10Majavah) [10:07:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [10:07:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [10:07:57] !log cgoubert@cumin1002 conftool action : set/pooled=false; selector: dnsdisc=appservers-ro,name=codfw [10:08:07] !log cgoubert@cumin1002 conftool action : set/pooled=false; selector: dnsdisc=api-ro,name=codfw [10:08:32] !log depooled all active/active mediawiki services from codfw [10:08:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:02] (03PS3) 10Brouberol: datahub: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041036 (https://phabricator.wikimedia.org/T362978) [10:09:19] (03PS2) 10Fabfur: hiera: enable IPIP for high-traffic1@drmrs for text services [puppet] - 10https://gerrit.wikimedia.org/r/1039947 (https://phabricator.wikimedia.org/T366466) [10:09:19] (03PS2) 10Fabfur: cache:hiera: enable IPIP on text@drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1039948 (https://phabricator.wikimedia.org/T366466) [10:09:24] RECOVERY - MariaDB Replica Lag: s2 on db2126 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:09:28] RECOVERY - MariaDB Replica Lag: s2 on db2125 is OK: OK slave_sql_lag Replication lag: 0.30 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:09:30] RECOVERY - MariaDB Replica Lag: s2 on db2204 is OK: OK slave_sql_lag Replication lag: 0.39 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:09:50] RECOVERY - MariaDB Replica Lag: s2 on db2138 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:09:50] RECOVERY - MariaDB Replica Lag: s2 on db2148 is OK: OK slave_sql_lag Replication lag: 0.29 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:09:50] RECOVERY - MariaDB Replica Lag: s2 on db2175 is OK: OK slave_sql_lag Replication lag: 0.42 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:09:56] RECOVERY - MariaDB Replica Lag: s2 on db2189 is OK: OK slave_sql_lag Replication lag: 0.49 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:10:33] (03CR) 10Fabfur: hiera: enable IPIP for high-traffic1@drmrs for text services (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1039947 (https://phabricator.wikimedia.org/T366466) (owner: 10Fabfur) [10:11:43] (03PS2) 10Majavah: P:openstack: opentofu: Allow everyone to enter the directory [puppet] - 10https://gerrit.wikimedia.org/r/1041042 (https://phabricator.wikimedia.org/T365696) [10:11:44] (03PS2) 10Majavah: P:openstack: opentofu: Do not log changes to the env file [puppet] - 10https://gerrit.wikimedia.org/r/1041043 (https://phabricator.wikimedia.org/T365696) [10:11:44] (03PS1) 10Majavah: P:openstack: opentofu: Add a variable for region [puppet] - 10https://gerrit.wikimedia.org/r/1041045 (https://phabricator.wikimedia.org/T365696) [10:11:45] RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate [10:11:45] (03PS1) 10MVernon: wmflib: add Wmflib::IP::Address::CIDR type [puppet] - 10https://gerrit.wikimedia.org/r/1041046 (https://phabricator.wikimedia.org/T279621) [10:13:30] RECOVERY - MariaDB Replica Lag: s2 on db2207 is OK: OK slave_sql_lag Replication lag: 0.40 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [10:13:36] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1041045 (https://phabricator.wikimedia.org/T365696) (owner: 10Majavah) [10:15:55] (03CR) 10Vgutierrez: [C:03+1] hiera: enable IPIP for high-traffic1@drmrs for text services [puppet] - 10https://gerrit.wikimedia.org/r/1039947 (https://phabricator.wikimedia.org/T366466) (owner: 10Fabfur) [10:17:42] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1014.eqiad.wmnet [10:18:35] (03CR) 10Brouberol: [V:03+1] deployment_server: alert on admin-ng pending changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1040992 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [10:18:43] FIRING: ProbeDown: Service gitlab2002:22 has failed probes (tcp_gitlab_wikimedia_org_ssh_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#gitlab2002:22 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:18:58] (03CR) 10Btullis: datahub: add securityContext to all containers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041036 (https://phabricator.wikimedia.org/T362978) (owner: 10Brouberol) [10:19:29] (03PS1) 10JMeybohm: linkrecommendation: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041049 (https://phabricator.wikimedia.org/T362978) [10:19:38] PROBLEM - Host ml-etcd1001 is DOWN: PING CRITICAL - Packet loss = 100% [10:21:01] !log cgoubert@cumin1002 conftool action : set/pooled=true; selector: dnsdisc=mw-web-ro,name=codfw [10:21:09] !log cgoubert@cumin1002 conftool action : set/pooled=true; selector: dnsdisc=mw-api-ext-ro,name=codfw [10:21:15] !log cgoubert@cumin1002 conftool action : set/pooled=true; selector: dnsdisc=mw-api-int-ro,name=codfw [10:21:22] !log cgoubert@cumin1002 conftool action : set/pooled=true; selector: dnsdisc=appservers-ro,name=codfw [10:21:22] PROBLEM - SSH on dse-k8s-etcd1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:21:29] !log cgoubert@cumin1002 conftool action : set/pooled=true; selector: dnsdisc=api-ro,name=codfw [10:21:42] !log repooled all active/active mediawiki services from codfw [10:21:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:13] (03PS4) 10Brouberol: datahub: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041036 (https://phabricator.wikimedia.org/T362978) [10:22:50] (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: use envoy proxy in ores-legacy staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040196 (https://phabricator.wikimedia.org/T366801) (owner: 10Ilias Sarantopoulos) [10:22:51] (03CR) 10CI reject: [V:04-1] datahub: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041036 (https://phabricator.wikimedia.org/T362978) (owner: 10Brouberol) [10:23:44] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1014.eqiad.wmnet [10:23:52] (03PS5) 10Brouberol: datahub: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041036 (https://phabricator.wikimedia.org/T362978) [10:24:03] (03Merged) 10jenkins-bot: ml-services: use envoy proxy in ores-legacy staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040196 (https://phabricator.wikimedia.org/T366801) (owner: 10Ilias Sarantopoulos) [10:24:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1014.eqiad.wmnet [10:24:29] (03CR) 10CI reject: [V:04-1] datahub: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041036 (https://phabricator.wikimedia.org/T362978) (owner: 10Brouberol) [10:25:12] !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db2204 T367019', diff saved to https://phabricator.wikimedia.org/P64518 and previous config saved to /var/cache/conftool/dbconfig/20240610-102511-arnaudb.json [10:25:15] T367019: Switchover s2 master (db2204 -> db2207) - https://phabricator.wikimedia.org/T367019 [10:25:23] !log isaranto@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' . [10:25:32] RECOVERY - Host ml-etcd1001 is UP: PING OK - Packet loss = 0%, RTA = 0.54 ms [10:26:07] (03CR) 10Majavah: [C:03+1] wmflib: add Wmflib::IP::Address::CIDR type [puppet] - 10https://gerrit.wikimedia.org/r/1041046 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [10:26:14] RECOVERY - SSH on dse-k8s-etcd1002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [10:26:34] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2016.codfw.wmnet [10:27:09] !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2204.codfw.wmnet with reason: Maintenance [10:27:11] !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2204.codfw.wmnet with reason: Maintenance [10:28:30] PROBLEM - Host ml-etcd2003 is DOWN: PING CRITICAL - Packet loss = 100% [10:31:11] (03CR) 10Btullis: deployment_server: alert on admin-ng pending changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1040992 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [10:34:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2016.codfw.wmnet [10:34:43] (03PS1) 10Jelto: gitlab: use IPv4 for SSH check temporarily [puppet] - 10https://gerrit.wikimedia.org/r/1041051 (https://phabricator.wikimedia.org/T367021) [10:34:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2016.codfw.wmnet [10:35:30] RECOVERY - Host ml-etcd2003 is UP: PING OK - Packet loss = 0%, RTA = 30.42 ms [10:38:53] (03CR) 10EoghanGaffney: [C:03+1] gitlab: use IPv4 for SSH check temporarily [puppet] - 10https://gerrit.wikimedia.org/r/1041051 (https://phabricator.wikimedia.org/T367021) (owner: 10Jelto) [10:39:31] (03CR) 10Jelto: [C:03+2] gitlab: use IPv4 for SSH check temporarily [puppet] - 10https://gerrit.wikimedia.org/r/1041051 (https://phabricator.wikimedia.org/T367021) (owner: 10Jelto) [10:40:31] (03CR) 10Fabfur: [C:03+2] depool text@drmrs before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1039944 (https://phabricator.wikimedia.org/T366466) (owner: 10Fabfur) [10:41:13] !log depooling text@drmrs to apply IPIP encapsulation patches (T366466) [10:41:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:16] T366466: Use IPIP encapsulation on lvs<-->text cluster - https://phabricator.wikimedia.org/T366466 [10:41:28] !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudgw1002.eqiad.wmnet [10:43:03] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2204 (re)pooling @ 1%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P64519 and previous config saved to /var/cache/conftool/dbconfig/20240610-104303-arnaudb.json [10:45:03] (03PS4) 10Giuseppe Lavagetto: Improve ability to override php-fpm configuration in kubernetes [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1039642 [10:45:29] (03PS1) 10JMeybohm: machinetranslation: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041055 (https://phabricator.wikimedia.org/T362978) [10:46:45] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Improve ability to override php-fpm configuration in kubernetes [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1039642 (owner: 10Giuseppe Lavagetto) [10:47:17] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1002.eqiad.wmnet [10:48:26] <_joe_> jouncebot: nowandnext [10:48:26] For the next 0 hour(s) and 11 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240610T1000) [10:48:26] In 2 hour(s) and 11 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240610T1300) [10:49:00] <_joe_> I will probably need to extend a bit beyond the limits I should normally have to use here [10:49:09] <_joe_> in terms of deployment window [10:53:54] !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudgw1001.eqiad.wmnet [10:54:40] !log disabling puppet on A:cp-text to enable https://gerrit.wikimedia.org/r/c/operations/puppet/+/1039948 selectively (T366466) [10:55:58] <_joe_> !log published updated php-fpm-multiversion-base,prometheus-statsd-exporter images [10:57:01] !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1001.eqiad.wmnet [10:57:21] (03CR) 10Fabfur: [C:03+2] cache:hiera: enable IPIP on text@drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1039948 (https://phabricator.wikimedia.org/T366466) (owner: 10Fabfur) [10:58:04] (03CR) 10Clément Goubert: [C:03+1] service: set similar-users to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1014499 (https://phabricator.wikimedia.org/T345274) (owner: 10Hnowlan) [10:58:09] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2204 (re)pooling @ 2%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P64520 and previous config saved to /var/cache/conftool/dbconfig/20240610-105809-arnaudb.json [10:58:41] (03PS1) 10Giuseppe Lavagetto: common_images: update statsd-exporter version [puppet] - 10https://gerrit.wikimedia.org/r/1041058 [10:59:34] !log disabled puppet on A:lvs-drmrs to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/1039947 (T366466) [10:59:58] (03CR) 10Fabfur: [C:03+2] hiera: enable IPIP for high-traffic1@drmrs for text services [puppet] - 10https://gerrit.wikimedia.org/r/1039947 (https://phabricator.wikimedia.org/T366466) (owner: 10Fabfur) [11:03:33] (03CR) 10Giuseppe Lavagetto: [C:03+2] common_images: update statsd-exporter version [puppet] - 10https://gerrit.wikimedia.org/r/1041058 (owner: 10Giuseppe Lavagetto) [11:04:10] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T364069)', diff saved to https://phabricator.wikimedia.org/P64521 and previous config saved to /var/cache/conftool/dbconfig/20240610-110409-marostegui.json [11:04:10] <_joe_> fabfur: can I merge your changes? [11:04:46] <_joe_> fabfur: ping [11:04:46] yes, I was going to but it's locked (by you) [11:04:47] thens [11:04:49] thanks [11:04:56] <_joe_> done [11:05:06] ack [11:06:22] (03CR) 10Majavah: [C:03+2] P:openstack: opentofu: Allow everyone to enter the directory [puppet] - 10https://gerrit.wikimedia.org/r/1041042 (https://phabricator.wikimedia.org/T365696) (owner: 10Majavah) [11:06:30] (03CR) 10Majavah: [C:03+2] P:openstack: opentofu: Do not log changes to the env file [puppet] - 10https://gerrit.wikimedia.org/r/1041043 (https://phabricator.wikimedia.org/T365696) (owner: 10Majavah) [11:09:06] !log tests looks good, enabling && running puppet on A:cp-text to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/1039948 (on drmrs) (T366466) [11:09:16] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2017.codfw.wmnet [11:09:36] (03PS2) 10Hnowlan: wmnet: remove similar-users [dns] - 10https://gerrit.wikimedia.org/r/1014495 (https://phabricator.wikimedia.org/T345274) [11:09:37] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1015.eqiad.wmnet [11:11:13] (03PS1) 10Giuseppe Lavagetto: mwdebug: allow dumping more variables for debugging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041060 [11:12:36] (03CR) 10Giuseppe Lavagetto: [C:03+2] mwdebug: allow dumping more variables for debugging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041060 (owner: 10Giuseppe Lavagetto) [11:12:54] (03PS5) 10Ebrahim: errorpage: Add dark mode support to error page [puppet] - 10https://gerrit.wikimedia.org/r/1040809 [11:12:57] (03CR) 10Ladsgroup: [C:03+2] errorpage: Add dark mode support to error page [puppet] - 10https://gerrit.wikimedia.org/r/1040809 (owner: 10Ebrahim) [11:13:03] (03CR) 10Ladsgroup: [V:03+2 C:03+2] errorpage: Add dark mode support to error page [puppet] - 10https://gerrit.wikimedia.org/r/1040809 (owner: 10Ebrahim) [11:13:16] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2204 (re)pooling @ 5%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P64522 and previous config saved to /var/cache/conftool/dbconfig/20240610-111315-arnaudb.json [11:16:30] PROBLEM - OpenSearch health check for shards on 9200 on logstash1032 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f3fccff3280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.w [11:16:30] org/wiki/Search%23Administration [11:17:08] (03PS2) 10Giuseppe Lavagetto: mwdebug: allow dumping more variables for debugging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041060 [11:17:19] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] mwdebug: allow dumping more variables for debugging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041060 (owner: 10Giuseppe Lavagetto) [11:18:19] !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply [11:18:30] RECOVERY - OpenSearch health check for shards on 9200 on logstash1032 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: green, timed_out: False, number_of_nodes: 18, number_of_data_nodes: 12, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 756, active_shards: 1774, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_sha [11:18:30] umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [11:18:43] RESOLVED: ProbeDown: Service gitlab2002:22 has failed probes (tcp_gitlab_wikimedia_org_ssh_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#gitlab2002:22 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [11:19:09] !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply [11:19:17] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [11:19:19] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P64523 and previous config saved to /var/cache/conftool/dbconfig/20240610-111917-marostegui.json [11:19:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1015.eqiad.wmnet [11:19:42] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [11:22:17] (03PS7) 10Brouberol: datahub: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041036 (https://phabricator.wikimedia.org/T362978) [11:23:40] !log oblivian@deploy1002 Locking from deployment [ALL REPOSITORIES]: setting global lock while working on mw-on-k8s --joe. Ping me if you need urgent deployments [11:24:21] (03CR) 10Ladsgroup: [C:03+1] "I'm merging this but it won't be deployed until we restart sanitarium hosts. That's going to take a while. There is a ticket for improving" [puppet] - 10https://gerrit.wikimedia.org/r/1028855 (https://phabricator.wikimedia.org/T363825) (owner: 10Zabe) [11:24:28] (03PS3) 10Zabe: hieradata: Add arbcom_itwiki to private wikis [puppet] - 10https://gerrit.wikimedia.org/r/1028855 (https://phabricator.wikimedia.org/T363825) [11:24:30] (03CR) 10Ladsgroup: [C:03+2] hieradata: Add arbcom_itwiki to private wikis [puppet] - 10https://gerrit.wikimedia.org/r/1028855 (https://phabricator.wikimedia.org/T363825) (owner: 10Zabe) [11:24:33] (03CR) 10Ladsgroup: [V:03+2 C:03+2] hieradata: Add arbcom_itwiki to private wikis [puppet] - 10https://gerrit.wikimedia.org/r/1028855 (https://phabricator.wikimedia.org/T363825) (owner: 10Zabe) [11:25:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1015.eqiad.wmnet [11:25:43] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1015.eqiad.wmnet [11:26:02] !log enabling && running puppet on A:lvs-drmrs to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/1039947 (T366466) [11:26:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:06] T366466: Use IPIP encapsulation on lvs<-->text cluster - https://phabricator.wikimedia.org/T366466 [11:26:10] (03CR) 10Btullis: [C:03+1] "Looks good to me, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041036 (https://phabricator.wikimedia.org/T362978) (owner: 10Brouberol) [11:27:55] (03CR) 10Brouberol: [V:03+1] deployment_server: alert on admin-ng pending changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1040992 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [11:28:10] (03CR) 10Brouberol: [C:03+2] datahub: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041036 (https://phabricator.wikimedia.org/T362978) (owner: 10Brouberol) [11:28:17] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2017.codfw.wmnet [11:28:23] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2204 (re)pooling @ 10%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P64524 and previous config saved to /var/cache/conftool/dbconfig/20240610-112821-arnaudb.json [11:28:32] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1016.eqiad.wmnet [11:29:04] (03Merged) 10jenkins-bot: datahub: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041036 (https://phabricator.wikimedia.org/T362978) (owner: 10Brouberol) [11:29:40] !log restarting pybal on lvs6003,lvs6001 to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/1039947 (T366466) [11:29:43] 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Netbox network report failing - timeout errors - https://phabricator.wikimedia.org/T321704#9874878 (10cmooney) >>! In T321704#9874622, @Volans wrote: > I've took a look today and trying to manually run all the tests there isn't anyone that... [11:29:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:31:47] (03PS1) 10Clément Goubert: weekly-update.sh: Actually skip spark images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1041063 [11:32:14] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging [11:34:02] !log oblivian@deploy1002 Unlocked for deployment [ALL REPOSITORIES]: setting global lock while working on mw-on-k8s --joe. Ping me if you need urgent deployments (duration: 10m 22s) [11:34:27] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P64525 and previous config saved to /var/cache/conftool/dbconfig/20240610-113426-marostegui.json [11:34:42] (03CR) 10Hnowlan: [C:03+1] weekly-update.sh: Actually skip spark images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1041063 (owner: 10Clément Goubert) [11:34:52] !log oblivian@deploy1002 Started scap: Deploying change to base mediawiki image [11:35:02] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging [11:36:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2017.codfw.wmnet [11:36:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2017.codfw.wmnet [11:36:32] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production [11:36:59] (03CR) 10Clément Goubert: [V:03+2 C:03+2] weekly-update.sh: Actually skip spark images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1041063 (owner: 10Clément Goubert) [11:39:16] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production [11:39:38] (03CR) 10Clément Goubert: [C:03+1] Upgrade base OS to Debian bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1039778 (https://phabricator.wikimedia.org/T355020) (owner: 10Hnowlan) [11:41:47] (03PS1) 10JMeybohm: mcrouter: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041065 (https://phabricator.wikimedia.org/T362978) [11:42:25] (03CR) 10CI reject: [V:04-1] mcrouter: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041065 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [11:42:57] (03CR) 10Clément Goubert: [C:03+2] sre.k8s.reboot-nodes: Add exclude option [cookbooks] - 10https://gerrit.wikimedia.org/r/1038865 (owner: 10Clément Goubert) [11:43:02] (03PS1) 10Fabfur: Revert "depool text@drmrs before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1041066 [11:43:03] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1016.eqiad.wmnet [11:43:29] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2204 (re)pooling @ 25%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P64526 and previous config saved to /var/cache/conftool/dbconfig/20240610-114329-arnaudb.json [11:43:35] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2018.codfw.wmnet [11:43:55] (03CR) 10Clément Goubert: [C:03+2] sre.k8s.reboot-nodes: Add exclude option (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1038865 (owner: 10Clément Goubert) [11:44:41] (03CR) 10Hnowlan: [C:03+2] Upgrade base OS to Debian bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1039778 (https://phabricator.wikimedia.org/T355020) (owner: 10Hnowlan) [11:44:44] !log oblivian@deploy1002 sync-world aborted: Deploying change to base mediawiki image (duration: 10m 21s) [11:44:59] (03PS2) 10JMeybohm: mcrouter: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041065 (https://phabricator.wikimedia.org/T362978) [11:45:23] PROBLEM - Host kubetcd1006 is DOWN: PING CRITICAL - Packet loss = 100% [11:45:46] (03PS3) 10JMeybohm: mcrouter: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041065 (https://phabricator.wikimedia.org/T362978) [11:46:41] (03PS2) 10Majavah: P:openstack: opentofu: Add a variable for region [puppet] - 10https://gerrit.wikimedia.org/r/1041045 (https://phabricator.wikimedia.org/T365696) [11:46:41] (03PS1) 10Majavah: P:openstack: opentofu: Add a diff job to catch unapplied changes [puppet] - 10https://gerrit.wikimedia.org/r/1041069 [11:46:49] (03Merged) 10jenkins-bot: sre.k8s.reboot-nodes: Add exclude option [cookbooks] - 10https://gerrit.wikimedia.org/r/1038865 (owner: 10Clément Goubert) [11:47:14] (03PS1) 10Brouberol: superset: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041068 (https://phabricator.wikimedia.org/T346638) [11:48:51] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1016.eqiad.wmnet [11:49:19] (03PS2) 10Majavah: P:openstack: opentofu: Add a diff job to catch unapplied changes [puppet] - 10https://gerrit.wikimedia.org/r/1041069 [11:49:24] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1016.eqiad.wmnet [11:49:26] (03Abandoned) 10JMeybohm: mcrouter: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041065 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [11:49:35] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T364069)', diff saved to https://phabricator.wikimedia.org/P64527 and previous config saved to /var/cache/conftool/dbconfig/20240610-114934-marostegui.json [11:49:37] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance [11:49:38] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [11:49:50] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance [11:49:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1180 (T364069)', diff saved to https://phabricator.wikimedia.org/P64528 and previous config saved to /var/cache/conftool/dbconfig/20240610-114957-marostegui.json [11:50:25] RECOVERY - Host kubetcd1006 is UP: PING OK - Packet loss = 0%, RTA = 0.99 ms [11:50:35] 06SRE, 10Wikimedia-Mailing-lists: Create a mailing list for Bangla Wikimoitree - https://phabricator.wikimedia.org/T365915#9874958 (10Ladsgroup) Again, comparing apples and oranges. They requested a mailing list for a project. Not a Wikimedia Hub. I will create this under type of project. [11:50:50] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1041069 (owner: 10Majavah) [11:52:35] (03PS4) 10JMeybohm: mcrouter: Bump chart modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038859 (https://phabricator.wikimedia.org/T362978) (owner: 10Alexandros Kosiaris) [11:53:36] !log oblivian@deploy1002 Started scap: Deploying change to base mediawiki image (take 2) [11:55:10] (03CR) 10Filippo Giunchedi: [C:03+2] webperf: don't hardcode php version [puppet] - 10https://gerrit.wikimedia.org/r/1039974 (https://phabricator.wikimedia.org/T353912) (owner: 10Filippo Giunchedi) [11:55:16] (03Merged) 10jenkins-bot: Upgrade base OS to Debian bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1039778 (https://phabricator.wikimedia.org/T355020) (owner: 10Hnowlan) [11:56:16] (03CR) 10JMeybohm: "Updated the fixture to match the changed values. Also add Bug tag to T362978, as this adds securityContext as well." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038859 (https://phabricator.wikimedia.org/T362978) (owner: 10Alexandros Kosiaris) [11:56:37] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2018.codfw.wmnet [11:57:47] (03CR) 10Hashar: [C:03+1] "What Paladox, that is due to an update in the Soy templating engine." [puppet] - 10https://gerrit.wikimedia.org/r/1037765 (owner: 10Paladox) [11:58:32] (03CR) 10Majavah: [C:03+2] gerrit: fix "its" templates for 3.9 [puppet] - 10https://gerrit.wikimedia.org/r/1037765 (owner: 10Paladox) [11:58:35] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2204 (re)pooling @ 50%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P64530 and previous config saved to /var/cache/conftool/dbconfig/20240610-115834-arnaudb.json [12:00:05] (03PS5) 10Brouberol: spark-history: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041070 (https://phabricator.wikimedia.org/T362978) [12:00:53] (03PS1) 10Brouberol: echoserver: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041071 (https://phabricator.wikimedia.org/T362978) [12:02:29] (03PS1) 10JMeybohm: python-webapp: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041072 (https://phabricator.wikimedia.org/T362978) [12:04:49] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2018.codfw.wmnet [12:04:57] (03CR) 10Filippo Giunchedi: [V:03+1] k8s: send logs to per-cluster kafka topics (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1040170 (https://phabricator.wikimedia.org/T366710) (owner: 10Filippo Giunchedi) [12:05:08] (03PS3) 10Filippo Giunchedi: k8s: send logs to per-cluster kafka topics [puppet] - 10https://gerrit.wikimedia.org/r/1040170 (https://phabricator.wikimedia.org/T366710) [12:05:52] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2018.codfw.wmnet [12:07:05] (03PS2) 10JMeybohm: python-webapp: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041072 (https://phabricator.wikimedia.org/T362978) [12:11:35] (03PS1) 10JMeybohm: calculator-service: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041076 (https://phabricator.wikimedia.org/T362978) [12:13:32] (03PS6) 10Majavah: wikitech: Stop loading OpenStackManager [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038750 (https://phabricator.wikimedia.org/T161553) [12:13:41] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2204 (re)pooling @ 75%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P64531 and previous config saved to /var/cache/conftool/dbconfig/20240610-121341-arnaudb.json [12:15:00] (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1041077 (owner: 10L10n-bot) [12:15:40] !log oblivian@deploy1002 Finished scap: Deploying change to base mediawiki image (take 2) (duration: 22m 39s) [12:20:36] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet [12:21:03] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1017.eqiad.wmnet [12:21:27] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet [12:22:33] (03CR) 10JMeybohm: "This is just a demo chart, it is not deployed anywhere" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041076 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [12:24:20] (03PS3) 10Awight: Revert "Temporary monitoring for scraper" [puppet] - 10https://gerrit.wikimedia.org/r/1040075 (https://phabricator.wikimedia.org/T366144) [12:24:21] (03CR) 10Awight: [C:03+1] "Can be merged safely. Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1040075 (https://phabricator.wikimedia.org/T366144) (owner: 10Awight) [12:25:10] 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Migrate Mailman/lists to Bullseye/Bookworm - https://phabricator.wikimedia.org/T331706#9875114 (10Ladsgroup) Overall looks good. Just noting that rebuilding index will take a very long time and that can make the downtime quite... [12:28:04] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet [12:28:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet [12:28:49] !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2204 (re)pooling @ 100%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P64532 and previous config saved to /var/cache/conftool/dbconfig/20240610-122847-arnaudb.json [12:30:22] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet [12:32:23] (03PS1) 10Brouberol: datahub: set distinct ES index prefix between staging and prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041088 [12:32:35] PROBLEM - OpenSearch health check for shards on 9200 on logstash1032 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fb6d79d9280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.w [12:32:35] org/wiki/Search%23Administration [12:33:29] 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977#9875143 (10cmooney) [12:34:33] RECOVERY - OpenSearch health check for shards on 9200 on logstash1032 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: green, timed_out: False, number_of_nodes: 18, number_of_data_nodes: 12, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 756, active_shards: 1774, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_sha [12:34:33] umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [12:34:35] (03CR) 10Elukey: [C:03+2] admin_ng: update Bookworm-based Docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040153 (https://phabricator.wikimedia.org/T356252) (owner: 10Elukey) [12:35:26] (03Abandoned) 10Nikerabbit: Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1039693 (owner: 10L10n-bot) [12:35:59] (03CR) 10Nikerabbit: [V:03+2] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1041077 (owner: 10L10n-bot) [12:36:28] (03CR) 10Vgutierrez: [C:04-1] Automated MarkMonitor domain sync (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1039849 (owner: 10Ncmonitor) [12:36:44] (03CR) 10Filippo Giunchedi: [C:03+2] Revert "Temporary monitoring for scraper" [puppet] - 10https://gerrit.wikimedia.org/r/1040075 (https://phabricator.wikimedia.org/T366144) (owner: 10Awight) [12:37:24] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1017.eqiad.wmnet [12:37:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet [12:37:46] (03CR) 10Vgutierrez: [C:04-1] "jumping from 7 certs to 20 is definitely too much IMHO, we should split this one in several CRs to be merged at different times (so we don" [puppet] - 10https://gerrit.wikimedia.org/r/1039849 (owner: 10Ncmonitor) [12:39:33] PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift [12:39:43] PROBLEM - Host kubetcd2006 is DOWN: PING CRITICAL - Packet loss = 100% [12:39:43] PROBLEM - Host ml-etcd2001 is DOWN: PING CRITICAL - Packet loss = 100% [12:40:07] !log elukey@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'. [12:40:23] RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 507 bytes in 0.563 second response time https://wikitech.wikimedia.org/wiki/Swift [12:40:47] (03CR) 10Majavah: [C:04-1] "This patch includes quite a few WMCS domains that are either delegated to openstack (so won't issue certs at all on the wikiprod acme-chie" [puppet] - 10https://gerrit.wikimedia.org/r/1039849 (owner: 10Ncmonitor) [12:41:04] !log elukey@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'. [12:41:14] jouncebot: next [12:41:14] In 0 hour(s) and 18 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240610T1300) [12:43:01] !log elukey@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [12:43:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1017.eqiad.wmnet [12:43:31] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1017.eqiad.wmnet [12:43:50] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet [12:44:09] !log elukey@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [12:44:12] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet [12:45:29] RECOVERY - Host kubetcd2006 is UP: PING OK - Packet loss = 0%, RTA = 30.53 ms [12:45:41] RECOVERY - Host ml-etcd2001 is UP: PING OK - Packet loss = 0%, RTA = 30.66 ms [12:45:44] (03PS1) 10Ebrahim: errorpages: Add dark mode support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041091 [12:45:52] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2021.codfw.wmnet [12:46:03] !log elukey@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [12:46:20] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1018.eqiad.wmnet [12:46:40] (03CR) 10Stevemunene: [C:03+1] datahub: set distinct ES index prefix between staging and prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041088 (owner: 10Brouberol) [12:48:32] !log elukey@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [12:48:57] FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:49:18] checking [12:49:18] (03CR) 10Brouberol: [C:03+2] datahub: set distinct ES index prefix between staging and prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041088 (owner: 10Brouberol) [12:49:36] !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/admin 'sync'. [12:49:42] hnowlan: thumbor is kaput [12:50:05] for now looks like a blip, should be recovering [12:50:31] maybe we should bump the replicas? [12:50:33] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging [12:50:46] !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'sync'. [12:51:10] godog: I don't think it's a blip https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor?orgId=1&refresh=1m&viewPanel=93 [12:51:26] Amir1, godog - there was a blip for ms-be2014, maybe related? It seems codfw right? [12:51:35] o/ [12:51:37] 06SRE, 10LDAP-Access-Requests: Grant Access to for - https://phabricator.wikimedia.org/T367053 (10gonyeahialam) 03NEW [12:51:43] elukey: codfw yeah [12:52:16] Amir1: mmhh I'm wondering how laggy that metric is, I'm looking at the network probes https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes%2Fservice&var-module=All&orgId=1&from=now-1h&to=now [12:52:34] yeah, it's actually recovering [12:52:39] elukey: could be, though a single host shouldn't affect things very much [12:52:46] yep yep [12:53:18] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for Gonyeahialam - https://phabricator.wikimedia.org/T367053#9875202 (10gonyeahialam) [12:53:24] and I got the name wrong, it was ms-fe2014 [12:53:26] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging [12:53:57] RESOLVED: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:54:06] hnowlan: sorry pinged too soon :D [12:54:12] from https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=ms-fe2014&var-datasource=thanos&var-cluster=swift it seems that something happened at around 11:40 UTC [12:54:42] (03CR) 10Fabfur: [C:03+2] Revert "depool text@drmrs before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1041066 (owner: 10Fabfur) [12:55:00] https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=ms-fe2014&var-datasource=thanos&var-cluster=swift&viewPanel=13 ouch [12:55:10] 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for Gonyeahialam - https://phabricator.wikimedia.org/T367053#9875204 (10gonyeahialam) [12:55:14] !log repooling text@drmrs (IPIP encapsulation enabled) (T366466) [12:55:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:17] T366466: Use IPIP encapsulation on lvs<-->text cluster - https://phabricator.wikimedia.org/T366466 [12:57:14] Amir1: yep it is weird that it doesn't happen in the previous 7 days [12:57:20] so seems quite weird [12:58:05] !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/admin 'sync'. [12:58:10] (03PS1) 10Gerrit maintenance bot: mariadb: Promote es1037 to es6 master [puppet] - 10https://gerrit.wikimedia.org/r/1041092 (https://phabricator.wikimedia.org/T367055) [12:58:15] (03PS1) 10Gerrit maintenance bot: wmnet: Update es6-master alias [dns] - 10https://gerrit.wikimedia.org/r/1041093 (https://phabricator.wikimedia.org/T367055) [12:58:16] Emperor: ^ [12:58:19] !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'sync'. [12:59:11] what am I being pinged about, sorry? [13:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: That opportune time for a UTC afternoon backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240610T1300). [13:00:05] Gerges: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:08] Emperor: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=ms-fe2014&var-datasource=thanos&var-cluster=swift [13:00:16] Hi [13:00:23] this has triggered a page [13:00:29] Emperor: there was a page earlier on :) [13:00:30] (03PS1) 10Cmelo: Enable CampaignEvents on swahili wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041094 (https://phabricator.wikimedia.org/T366502) [13:00:31] (it got resoved) [13:00:50] but it's worth taking a look https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=ms-fe2014&var-datasource=thanos&var-cluster=swift&viewPanel=13 [13:01:50] !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/admin 'sync'. [13:01:57] !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'sync'. [13:02:04] jouncebot: now [13:02:04] For the next 0 hour(s) and 57 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240610T1300) [13:02:12] jouncebot: next [13:02:12] In 2 hour(s) and 27 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240610T1530) [13:02:45] I am going to stop my deployments to wikikube for the moment [13:03:14] !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'. [13:03:17] Gerges: let me check and deploy [13:03:19] (03PS1) 10Cmelo: Enable CampaignEvents on swahili wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041096 (https://phabricator.wikimedia.org/T366502) [13:03:25] (03CR) 10Vgutierrez: [C:04-1] "`" [dns] - 10https://gerrit.wikimedia.org/r/1040335 (owner: 10Ncmonitor) [13:03:43] !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'. [13:03:45] Ok [13:04:02] !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'. [13:04:08] (03CR) 10Ladsgroup: [C:03+2] [huwiki] Add "suppressredirect" user right to editor user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041044 (https://phabricator.wikimedia.org/T366438) (owner: 10GergesShamon) [13:04:23] (03CR) 10Ottomata: "Nice!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040862 (https://phabricator.wikimedia.org/T344730) (owner: 10Snwachukwu) [13:04:24] Amir1: similar pattern seen with e.g. ms-fe2013 too, which didn't result in a spike in errors [13:04:27] (03CR) 10Vgutierrez: [C:04-1] "those should be mentioned on my review of the DNS related change: https://gerrit.wikimedia.org/r/c/operations/dns/+/1040335/comments/d97a9" [puppet] - 10https://gerrit.wikimedia.org/r/1039849 (owner: 10Ncmonitor) [13:04:32] !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'. [13:04:43] (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041044 (https://phabricator.wikimedia.org/T366438) (owner: 10GergesShamon) [13:04:50] (03Merged) 10jenkins-bot: [huwiki] Add "suppressredirect" user right to editor user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041044 (https://phabricator.wikimedia.org/T366438) (owner: 10GergesShamon) [13:05:08] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1041044|[huwiki] Add "suppressredirect" user right to editor user group (T366438)]] [13:05:14] T366438: Grant "suppressredirect" to editor on huwiki - https://phabricator.wikimedia.org/T366438 [13:06:03] Amir1: if you look at tcp retransmits, there's a similar rise in all of codfw swift frontends starting around 11:40 UTC today [13:06:40] Amir1: e.g. https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=ms-fe2013&var-datasource=thanos&var-cluster=swift&viewPanel=31&from=1717419989138&to=1718024789138 [13:07:05] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1018.eqiad.wmnet [13:07:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2021.codfw.wmnet [13:07:55] !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'. [13:07:56] (03PS1) 10Brouberol: datahub: don't use an ES index prefix for production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041098 [13:08:17] elukey: ping when you are done, I would like to perform some reboots [13:08:17] Gerges: mwdebug szervereken elérhető https://wikitech.wikimedia.org/wiki/Mwdebug [13:08:20] !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'. [13:08:25] !log ladsgroup@deploy1002 ladsgroup and gergesshamon: Backport for [[gerrit:1041044|[huwiki] Add "suppressredirect" user right to editor user group (T366438)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [13:08:37] (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041098 (owner: 10Brouberol) [13:08:57] effie: o/ I am waiting for the deploy window to close before proceeding, my deploys should take ~5 mins afterwards [13:09:34] (03CR) 10Btullis: [C:03+1] "Nice, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041068 (https://phabricator.wikimedia.org/T346638) (owner: 10Brouberol) [13:09:39] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [13:09:39] !log rebooting cp4047 (T366555) [13:09:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:47] !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4047.ulsfo.wmnet [13:09:49] cool cool, I am queueing behind you then :p [13:09:57] (03CR) 10Brouberol: [C:03+2] datahub: don't use an ES index prefix for production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041098 (owner: 10Brouberol) [13:10:05] !log elukey@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [13:10:07] (03PS2) 10Cmelo: Enable CampaignEvents on swahili wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041094 (https://phabricator.wikimedia.org/T366502) [13:10:08] !log fabfur@cumin1002 START - Cookbook sre.hosts.reboot-single for host cp4047.ulsfo.wmnet [13:10:35] Amir1: I checked mwdebug, and everything is fine [13:10:42] Amir1: I don't think it's NIC saturation (cf https://w.wiki/5$CU ) [13:10:46] (03PS4) 10Majavah: service: update labweb/cloudweb conftool pool name [puppet] - 10https://gerrit.wikimedia.org/r/941459 (https://phabricator.wikimedia.org/T317463) [13:10:46] (03PS3) 10Majavah: conftool-data: drop labweb pool [puppet] - 10https://gerrit.wikimedia.org/r/941460 (https://phabricator.wikimedia.org/T317463) [13:11:05] Gerges: awesome [13:11:10] !log ladsgroup@deploy1002 ladsgroup and gergesshamon: Continuing with sync [13:11:33] FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:11:39] Emperor: I'd say let's create a ticket and investigate [13:11:40] (03CR) 10Majavah: [C:03+2] service: update labweb/cloudweb conftool pool name [puppet] - 10https://gerrit.wikimedia.org/r/941459 (https://phabricator.wikimedia.org/T317463) (owner: 10Majavah) [13:11:48] !log restarting eqiad low-traffic LVS for https://gerrit.wikimedia.org/r/c/operations/puppet/+/941459 [13:11:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:12:10] (03CR) 10Btullis: [C:03+1] "Thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041070 (https://phabricator.wikimedia.org/T362978) (owner: 10Brouberol) [13:12:53] (03CR) 10Daimona Eaytoy: [C:04-1] Enable CampaignEvents on swahili wikipedia (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041094 (https://phabricator.wikimedia.org/T366502) (owner: 10Cmelo) [13:12:59] (03PS9) 10Vgutierrez: lvs::realserver::ipip: Provide ferm MSS clamping support [puppet] - 10https://gerrit.wikimedia.org/r/1035724 (https://phabricator.wikimedia.org/T365689) [13:13:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1018.eqiad.wmnet [13:13:11] (03CR) 10Daimona Eaytoy: [C:03+1] Enable CampaignEvents on swahili wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041096 (https://phabricator.wikimedia.org/T366502) (owner: 10Cmelo) [13:13:15] (03CR) 10Btullis: [C:03+1] "Nice, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041071 (https://phabricator.wikimedia.org/T362978) (owner: 10Brouberol) [13:13:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1018.eqiad.wmnet [13:13:20] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2021.codfw.wmnet [13:13:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2021.codfw.wmnet [13:13:47] (03CR) 10Brouberol: [C:03+2] spark-history: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041070 (https://phabricator.wikimedia.org/T362978) (owner: 10Brouberol) [13:14:38] Amir1: has been deployed? [13:15:22] Gerges: még nem [13:15:42] 80% [13:15:50] (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1035724 (https://phabricator.wikimedia.org/T365689) (owner: 10Vgutierrez) [13:16:33] RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:16:46] Amir1: Is there some way to see the output that you see during the deplay process [13:16:53] nope [13:17:06] eventually, one day [13:17:21] OK [13:17:50] (03CR) 10Elukey: "Definitely yes, otherwise it looks good! It also avoids me to re-build these images for security upgrades, please make sure the new images" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040862 (https://phabricator.wikimedia.org/T344730) (owner: 10Snwachukwu) [13:18:07] !log taavi@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-secondary-eqiad or A:lvs-low-traffic-eqiad and A:lvs [13:18:34] !log taavi@cumin1002 END (FAIL) - Cookbook sre.loadbalancer.restart-pybal (exit_code=99) rolling-restart of pybal on A:lvs-secondary-eqiad or A:lvs-low-traffic-eqiad and A:lvs [13:18:57] (03PS1) 10Elukey: services: update changeprop's Docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041105 (https://phabricator.wikimedia.org/T356252) [13:19:34] !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4047.ulsfo.wmnet [13:20:09] 06SRE, 10SRE-swift-storage, 06Traffic: Rise in ms-fe2* TCP retransmits since 11:40 UTC today - https://phabricator.wikimedia.org/T367056 (10MatthewVernon) 03NEW [13:20:10] (03CR) 10Elukey: [C:03+2] services: update the rec-api's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018717 (https://phabricator.wikimedia.org/T205870) (owner: 10Elukey) [13:20:10] Amir1: opened T367056 [13:20:13] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1041044|[huwiki] Add "suppressredirect" user right to editor user group (T366438)]] (duration: 15m 05s) [13:20:13] T367056: Rise in ms-fe2* TCP retransmits since 11:40 UTC today - https://phabricator.wikimedia.org/T367056 [13:20:15] thanks [13:20:20] T366438: Grant "suppressredirect" to editor on huwiki - https://phabricator.wikimedia.org/T366438 [13:20:36] Gerges: done [13:20:43] https://www.irccloud.com/pastebin/KPkXcFPO/ [13:20:49] Thanks [13:20:56] one of hosts failed to restart [13:21:35] this might be related to taavi's change I think [13:23:59] Amir1: yeah, probably, sorry about that. do you want me to manually restart that or did you do that already? [13:25:37] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/recommendation-api: sync [13:25:52] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/recommendation-api: sync [13:25:58] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2203.codfw.wmnet with reason: Maintenance [13:26:01] (we're debugging why the cookbook failed in -traffic) [13:26:11] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2203.codfw.wmnet with reason: Maintenance [13:26:20] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2203 (T352010)', diff saved to https://phabricator.wikimedia.org/P64534 and previous config saved to /var/cache/conftool/dbconfig/20240610-132619-ladsgroup.json [13:26:23] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [13:26:24] (03PS1) 10Brouberol: spark-history: restore the ability to get env variables from configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041106 [13:26:50] !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/recommendation-api: sync [13:27:11] (03PS1) 10Arnaudb: dbconfig: remove cluster30/es6 to switchmaster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041107 (https://phabricator.wikimedia.org/T367055) [13:27:16] !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/recommendation-api: sync [13:27:47] (03CR) 10Btullis: [C:03+1] spark-history: restore the ability to get env variables from configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041106 (owner: 10Brouberol) [13:28:05] taavi: I have to go to meeting, if you restart it, I'd be grateful [13:28:09] will do [13:28:24] (03CR) 10Brouberol: [C:03+2] spark-history: restore the ability to get env variables from configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041106 (owner: 10Brouberol) [13:28:48] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2140.codfw.wmnet with reason: Long schema change [13:28:50] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2140.codfw.wmnet with reason: Long schema change [13:29:24] !log taavi@mw1447 ~ $ sudo /usr/local/sbin/restart-php-fpm-all php7.4-fpm 9223372036854775807 # leftover from me restarting LVS during deployment [13:29:25] (03Merged) 10jenkins-bot: spark-history: restore the ability to get env variables from configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041106 (owner: 10Brouberol) [13:29:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:29:39] (03CR) 10Marostegui: "Let's make the commit a bit more clear: this is to temporary disable writes on es6." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041107 (https://phabricator.wikimedia.org/T367055) (owner: 10Arnaudb) [13:29:39] (03PS1) 10Filippo Giunchedi: hieradata: remove thanos-query settings from thanos::frontend [puppet] - 10https://gerrit.wikimedia.org/r/1041110 (https://phabricator.wikimedia.org/T357747) [13:29:41] (03PS1) 10Filippo Giunchedi: titan: trim 5m retention to 70w [puppet] - 10https://gerrit.wikimedia.org/r/1041111 (https://phabricator.wikimedia.org/T357747) [13:30:05] (03PS1) 10Ssingh: restart-pybal: increase timeout and retries for spicerack.requests_session [cookbooks] - 10https://gerrit.wikimedia.org/r/1041112 [13:30:33] !log dbmaint codfw s4 deploy schema change on db2140 T364069 [13:30:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:37] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [13:31:06] (03PS2) 10Arnaudb: dbconfig: temporary disable writes on es6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041107 (https://phabricator.wikimedia.org/T367055) [13:31:11] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2838/co" [puppet] - 10https://gerrit.wikimedia.org/r/1041110 (https://phabricator.wikimedia.org/T357747) (owner: 10Filippo Giunchedi) [13:31:30] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2839/co" [puppet] - 10https://gerrit.wikimedia.org/r/1041111 (https://phabricator.wikimedia.org/T357747) (owner: 10Filippo Giunchedi) [13:32:28] (03CR) 10Majavah: [C:03+1] restart-pybal: increase timeout and retries for spicerack.requests_session [cookbooks] - 10https://gerrit.wikimedia.org/r/1041112 (owner: 10Ssingh) [13:32:32] (03CR) 10FNegri: [C:03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1041069 (owner: 10Majavah) [13:33:49] (03CR) 10FNegri: [C:03+1] P:openstack: opentofu: Add a variable for region [puppet] - 10https://gerrit.wikimedia.org/r/1041045 (https://phabricator.wikimedia.org/T365696) (owner: 10Majavah) [13:34:02] (03CR) 10Majavah: [C:03+2] P:openstack: opentofu: Add a variable for region [puppet] - 10https://gerrit.wikimedia.org/r/1041045 (https://phabricator.wikimedia.org/T365696) (owner: 10Majavah) [13:34:08] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/spark-history: apply [13:34:10] (03CR) 10Majavah: [V:03+1 C:03+2] P:openstack: opentofu: Add a diff job to catch unapplied changes [puppet] - 10https://gerrit.wikimedia.org/r/1041069 (owner: 10Majavah) [13:34:20] (03CR) 10Marostegui: [C:03+1] dbconfig: temporary disable writes on es6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041107 (https://phabricator.wikimedia.org/T367055) (owner: 10Arnaudb) [13:34:35] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/spark-history: apply [13:34:42] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [13:35:35] (03CR) 10Ssingh: [C:03+2] restart-pybal: increase timeout and retries for spicerack.requests_session [cookbooks] - 10https://gerrit.wikimedia.org/r/1041112 (owner: 10Ssingh) [13:35:50] !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/recommendation-api: sync [13:36:15] !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/recommendation-api: sync [13:36:28] (03PS2) 10Filippo Giunchedi: hieradata: remove thanos-query settings from thanos::frontend [puppet] - 10https://gerrit.wikimedia.org/r/1041110 (https://phabricator.wikimedia.org/T357747) [13:36:28] (03PS2) 10Filippo Giunchedi: titan: trim 5m retention to 70w [puppet] - 10https://gerrit.wikimedia.org/r/1041111 (https://phabricator.wikimedia.org/T357747) [13:36:40] !log move recommendation-api on wikikube to prometheus metrics (offboarded from statsd) - T205870 [13:36:42] (03CR) 10Majavah: [C:04-1] "most of them, yes :-) but I wanted to mention the second category (I think just wikimediacloud.org and wikimedia.cloud) which are pointed" [puppet] - 10https://gerrit.wikimedia.org/r/1039849 (owner: 10Ncmonitor) [13:36:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:44] T205870: Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870 [13:36:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [13:36:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [13:36:58] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/spark-history: apply [13:37:25] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/spark-history: apply [13:38:18] (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2840/co" [puppet] - 10https://gerrit.wikimedia.org/r/1041110 (https://phabricator.wikimedia.org/T357747) (owner: 10Filippo Giunchedi) [13:40:52] (03CR) 10Giuseppe Lavagetto: [C:03+2] Add new chart statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039171 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto) [13:40:54] (03CR) 10Brouberol: [C:03+2] echoserver: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041071 (https://phabricator.wikimedia.org/T362978) (owner: 10Brouberol) [13:41:33] (03Merged) 10jenkins-bot: Add new chart statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039171 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto) [13:41:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [13:41:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [13:42:10] !log elukey@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [13:42:39] !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [13:43:07] effie: done! [13:43:45] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/echoserver: apply [13:43:56] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/echoserver: apply [13:46:01] (03CR) 10Brouberol: [C:03+2] superset: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041068 (https://phabricator.wikimedia.org/T346638) (owner: 10Brouberol) [13:46:12] !log taavi@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-secondary-eqiad or A:lvs-low-traffic-eqiad and A:lvs [13:47:04] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply [13:47:11] !log taavi@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-secondary-eqiad or A:lvs-low-traffic-eqiad and A:lvs [13:47:33] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply [13:48:57] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset: apply [13:49:25] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset: apply [13:50:59] !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4047.ulsfo.wmnet [13:51:53] (03CR) 10Clément Goubert: "Small nit inline, otherwise lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039233 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto) [13:51:57] (03PS1) 10Ottomata: Remove EventLoggingLegacyConverter code - it has been moved to EventLogging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041115 (https://phabricator.wikimedia.org/T353817) [13:52:33] (03CR) 10CI reject: [V:04-1] Remove EventLoggingLegacyConverter code - it has been moved to EventLogging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041115 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata) [13:54:03] (03PS2) 10Ottomata: Remove EventLoggingLegacyConverter code - it has been moved to EventLogging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041115 (https://phabricator.wikimedia.org/T353817) [13:55:53] (03PS3) 10Ottomata: Remove EventLoggingLegacyConverter code - it has been moved to EventLogging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041115 (https://phabricator.wikimedia.org/T353817) [13:56:30] (03PS10) 10EoghanGaffney: lists: Add option to switch mailman root [puppet] - 10https://gerrit.wikimedia.org/r/1040174 [13:57:07] (03CR) 10Clément Goubert: [C:03+1] services: update changeprop's Docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041105 (https://phabricator.wikimedia.org/T356252) (owner: 10Elukey) [13:57:16] (03PS1) 10Brouberol: datasets-config: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041119 (https://phabricator.wikimedia.org/T362978) [13:57:30] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic1107 for T348977 - bking@cumin2002 [13:57:31] !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning hosts: elastic1107 for T348977 - bking@cumin2002 [13:57:34] T348977: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977 [13:57:47] !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic1107.eqiad.wmnet for T348977 - bking@cumin2002 [13:57:50] !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic1107.eqiad.wmnet for T348977 - bking@cumin2002 [13:58:15] (03PS8) 10Giuseppe Lavagetto: statsd: add deployment to mw-debug (codfw only) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039233 (https://phabricator.wikimedia.org/T365265) [13:58:26] (03CR) 10Giuseppe Lavagetto: statsd: add deployment to mw-debug (codfw only) (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039233 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto) [13:59:08] (03CR) 10Clément Goubert: [C:03+1] statsd: add deployment to mw-debug (codfw only) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039233 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto) [13:59:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T364069)', diff saved to https://phabricator.wikimedia.org/P64535 and previous config saved to /var/cache/conftool/dbconfig/20240610-135914-marostegui.json [13:59:19] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [14:01:07] (03PS2) 10Brouberol: mpic: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041120 (https://phabricator.wikimedia.org/T362978) [14:01:37] jouncebot: now [14:01:38] No deployments scheduled for the next 1 hour(s) and 28 minute(s) [14:01:49] elukey: how are tghings on yoiur end? [14:02:11] all done! (pinged you earlier on) [14:03:54] elukey: oh sorry, notification fail :/ [14:05:27] please ping me once you're done, I want to deploy so many more patches [14:08:06] <_joe_> Amir1: hold your horses [14:08:23] (03CR) 10Giuseppe Lavagetto: [C:03+2] statsd: add deployment to mw-debug (codfw only) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039233 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto) [14:08:48] <_joe_> Amir1: I might make mediawiki un-deployable for a short while [14:09:11] 10SRE-tools, 06Infrastructure-Foundations: Add option to exclude nodes from reboot by uptime or last reboot date - https://phabricator.wikimedia.org/T366797#9875489 (10MoritzMuehlenhoff) I think we should rather base this on a given kernel version? Seems more robust than a given date. [14:09:13] (03Merged) 10jenkins-bot: statsd: add deployment to mw-debug (codfw only) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039233 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto) [14:10:07] (03CR) 10Btullis: [C:03+1] datasets-config: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041119 (https://phabricator.wikimedia.org/T362978) (owner: 10Brouberol) [14:10:16] (03PS1) 10Clément Goubert: shellbox: Bump shellbox image versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041122 (https://phabricator.wikimedia.org/T362518) [14:10:19] (03PS1) 10Hnowlan: thumbor: use bullseye image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041123 (https://phabricator.wikimedia.org/T355020) [14:10:31] !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply [14:10:31] (03CR) 10Btullis: [C:03+1] mpic: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041120 (https://phabricator.wikimedia.org/T362978) (owner: 10Brouberol) [14:11:01] !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply [14:12:09] (03CR) 10Ebrahim: "ladsgroup@gmail.com" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041091 (owner: 10Ebrahim) [14:12:13] (03CR) 10Giuseppe Lavagetto: [C:03+2] mw-debug: add statsd service everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039234 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto) [14:12:14] (03CR) 10Clément Goubert: [C:03+1] thumbor: use bullseye image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041123 (https://phabricator.wikimedia.org/T355020) (owner: 10Hnowlan) [14:12:51] _joe_: 💔 [14:12:51] (03PS2) 10Hnowlan: thumbor: use bookworm [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041123 (https://phabricator.wikimedia.org/T355020) [14:13:29] <_joe_> Amir1: gimme another 10 minutes and you'll be free [14:14:24] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P64536 and previous config saved to /var/cache/conftool/dbconfig/20240610-141422-marostegui.json [14:14:55] (03PS8) 10Giuseppe Lavagetto: mw-debug: add statsd service everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039234 (https://phabricator.wikimedia.org/T365265) [14:15:17] (03CR) 10Brouberol: [C:03+2] mpic: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041120 (https://phabricator.wikimedia.org/T362978) (owner: 10Brouberol) [14:15:28] (03CR) 10Brouberol: [C:03+2] datasets-config: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041119 (https://phabricator.wikimedia.org/T362978) (owner: 10Brouberol) [14:15:29] (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] mw-debug: add statsd service everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039234 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto) [14:15:37] (03CR) 10Ladsgroup: [C:03+1] thumbor: use bookworm [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041123 (https://phabricator.wikimedia.org/T355020) (owner: 10Hnowlan) [14:18:36] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/datasets-config-next: apply [14:18:45] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/datasets-config-next: apply [14:18:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [14:18:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [14:18:55] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/datasets-config: apply [14:19:04] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/datasets-config: apply [14:19:12] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/datasets-config: apply [14:19:28] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [14:19:37] !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [14:21:30] (03CR) 10Hnowlan: [C:03+2] thumbor: use bookworm [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041123 (https://phabricator.wikimedia.org/T355020) (owner: 10Hnowlan) [14:22:20] (03Merged) 10jenkins-bot: thumbor: use bookworm [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041123 (https://phabricator.wikimedia.org/T355020) (owner: 10Hnowlan) [14:23:44] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply [14:23:51] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [14:25:37] 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f7-eqiad - https://phabricator.wikimedia.org/T365984#9875546 (10cmooney) p:05Triage→03Medium [14:28:01] (03CR) 10Scott French: [C:03+2] proton: drop replicas from 12 to 10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040221 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [14:28:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [14:28:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [14:28:45] (03Merged) 10jenkins-bot: proton: drop replicas from 12 to 10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040221 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [14:28:53] (03PS1) 10Brouberol: rdf-streaming-updater: remove from dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041131 [14:29:21] 10SRE-tools, 06Infrastructure-Foundations: Add option to exclude nodes from reboot by uptime or last reboot date - https://phabricator.wikimedia.org/T366797#9875574 (10Volans) p:05Triage→03Medium [14:29:32] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P64537 and previous config saved to /var/cache/conftool/dbconfig/20240610-142931-marostegui.json [14:30:41] (03PS1) 10EoghanGaffney: quickdatacopy: Add optional parameter for setting destination path [puppet] - 10https://gerrit.wikimedia.org/r/1041137 [14:31:11] (03CR) 10CI reject: [V:04-1] quickdatacopy: Add optional parameter for setting destination path [puppet] - 10https://gerrit.wikimedia.org/r/1041137 (owner: 10EoghanGaffney) [14:31:42] jouncebot: next [14:31:42] In 0 hour(s) and 58 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240610T1530) [14:31:51] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet [14:31:55] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2023.codfw.wmnet [14:32:06] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet [14:32:08] <_joe_> Amir1: please go on if it wasn't clear heh [14:32:20] (03CR) 10Elukey: [C:03+2] services: update changeprop's Docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041105 (https://phabricator.wikimedia.org/T356252) (owner: 10Elukey) [14:33:16] Thank you [14:33:18] (03PS1) 10Eevans: aqs: Upgrade aqs1010 to Java 11 (canary) [puppet] - 10https://gerrit.wikimedia.org/r/1041138 (https://phabricator.wikimedia.org/T350567) [14:33:34] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1019.eqiad.wmnet [14:34:32] (03CR) 10Scott French: [C:03+2] admin_ng: bump CPU resourcequota for proton [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040220 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [14:34:35] (03PS2) 10EoghanGaffney: quickdatacopy: Add optional parameter for setting destination path [puppet] - 10https://gerrit.wikimedia.org/r/1041137 [14:35:35] (03PS1) 10Hnowlan: Adapt entrypoint-prod to bullseye + blubber path [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1041140 (https://phabricator.wikimedia.org/T355020) [14:36:52] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [14:36:54] !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: sync [14:37:05] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance [14:37:31] (03Merged) 10jenkins-bot: admin_ng: bump CPU resourcequota for proton [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040220 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French) [14:38:42] !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: sync [14:38:45] FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:45] !log swfrench@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'. [14:40:03] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet [14:40:17] (03CR) 10Clément Goubert: [C:03+1] Adapt entrypoint-prod to bullseye + blubber path [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1041140 (https://phabricator.wikimedia.org/T355020) (owner: 10Hnowlan) [14:41:12] (03PS1) 10Brouberol: superset: replace IP-based networkpolicy by its service counterpart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041142 (https://phabricator.wikimedia.org/T331894) [14:41:15] (03CR) 10JMeybohm: deployment_server: alert on admin-ng pending changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1040992 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [14:41:15] !log swfrench@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'. [14:41:19] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1019.eqiad.wmnet [14:41:32] (03PS2) 10Brouberol: superset: replace IP-based networkpolicy by its service counterpart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041142 (https://phabricator.wikimedia.org/T331894) [14:41:38] (03CR) 10Alexandros Kosiaris: [C:03+1] Adapt entrypoint-prod to bullseye + blubber path [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1041140 (https://phabricator.wikimedia.org/T355020) (owner: 10Hnowlan) [14:41:50] !log swfrench@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'. [14:41:55] PROBLEM - Host kubestagemaster2005 is DOWN: PING CRITICAL - Packet loss = 100% [14:42:25] !log swfrench@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'. [14:43:13] !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic1107.eqiad.wmnet with reason: T365982 [14:43:16] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'. [14:43:18] T365982: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f5-eqiad - https://phabricator.wikimedia.org/T365982 [14:43:22] !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: sync [14:43:27] !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic1107.eqiad.wmnet with reason: T365982 [14:43:45] FIRING: [2x] ProbeDown: Service kubestagemaster2005:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster2005:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:43:45] (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1041138 (https://phabricator.wikimedia.org/T350567) (owner: 10Eevans) [14:44:01] !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync [14:44:27] (03CR) 10Eevans: [C:03+2] aqs: Upgrade aqs1010 to Java 11 (canary) [puppet] - 10https://gerrit.wikimedia.org/r/1041138 (https://phabricator.wikimedia.org/T350567) (owner: 10Eevans) [14:44:41] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T364069)', diff saved to https://phabricator.wikimedia.org/P64538 and previous config saved to /var/cache/conftool/dbconfig/20240610-144439-marostegui.json [14:44:43] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1187.eqiad.wmnet with reason: Maintenance [14:44:44] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [14:44:54] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'. [14:44:56] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1187.eqiad.wmnet with reason: Maintenance [14:45:03] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1187 (T364069)', diff saved to https://phabricator.wikimedia.org/P64539 and previous config saved to /var/cache/conftool/dbconfig/20240610-144501-marostegui.json [14:45:04] (03CR) 10Kamila Součková: [C:03+1] "yespls :D" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041122 (https://phabricator.wikimedia.org/T362518) (owner: 10Clément Goubert) [14:45:10] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet [14:45:25] RECOVERY - Host kubestagemaster2005 is UP: PING OK - Packet loss = 0%, RTA = 30.42 ms [14:45:25] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet [14:45:33] (03CR) 10Ladsgroup: [C:03+2] errorpages: Add dark mode support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041091 (owner: 10Ebrahim) [14:45:39] !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'. [14:45:46] FIRING: [3x] ProbeDown: Service ganeti2023:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:45:51] (03CR) 10Clément Goubert: [C:03+2] shellbox: Bump shellbox image versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041122 (https://phabricator.wikimedia.org/T362518) (owner: 10Clément Goubert) [14:46:13] !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'. [14:46:23] (03Merged) 10jenkins-bot: errorpages: Add dark mode support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041091 (owner: 10Ebrahim) [14:46:42] !log cdobbins@cumin1002 conftool action : set/pooled=no; selector: name=cp4046.ulsfo.wmnet [14:46:49] (03CR) 10TrainBranchBot: [C:03+2] "Copied votes on follow-up patch sets have been updated:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041091 (owner: 10Ebrahim) [14:46:57] FIRING: KubernetesCalicoDown: kubestagemaster2005.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestagemaster2005.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:47:03] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1041091|errorpages: Add dark mode support]] [14:47:06] (03Merged) 10jenkins-bot: shellbox: Bump shellbox image versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041122 (https://phabricator.wikimedia.org/T362518) (owner: 10Clément Goubert) [14:47:56] !log aqs1010: restarting cassandra to apply upgrade to Java 11 — T350567 [14:48:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:01] T350567: Migrate Cassandra to Java 11 - https://phabricator.wikimedia.org/T350567 [14:48:42] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching aqs1010.eqiad.wmnet: Apply update to Java 11 - eevans@cumin1002 [14:48:45] RESOLVED: [3x] ProbeDown: Service ganeti2023:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [14:48:45] FIRING: [2x] JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:48:59] !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/shellbox: apply [14:49:35] !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/proton: apply [14:49:42] (03CR) 10Alexandros Kosiaris: [C:04-1] "Looks pretty nice, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040862 (https://phabricator.wikimedia.org/T344730) (owner: 10Snwachukwu) [14:50:15] !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/proton: apply [14:50:36] !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox: apply [14:50:53] !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: apply [14:51:28] !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply [14:51:40] !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply [14:51:40] !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply [14:51:45] !log cdobbins@cumin1002 START - Cookbook sre.hosts.reboot-single for host cp4046.ulsfo.wmnet [14:51:47] !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-media: apply [14:51:57] RESOLVED: KubernetesCalicoDown: kubestagemaster2005.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestagemaster2005.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:52:04] !log sudo -i cookbook sre.hosts.reboot-single -r 'Kernel upgrade' 'P{cp4046.*}' [14:52:05] !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply [14:52:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:16] !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply [14:52:35] !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [14:52:41] !log powercycling ganeti1019, stuck on reboot [14:52:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:52] !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply [14:53:13] !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply [14:53:31] !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [14:53:56] !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet [14:54:11] !log ladsgroup@deploy1002 ladsgroup and ebrahim: Backport for [[gerrit:1041091|errorpages: Add dark mode support]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:54:37] !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [14:55:06] (03CR) 10Bking: [C:03+1] "feel free to merge once once the dependent patch is merged." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041142 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [14:55:15] !log ladsgroup@deploy1002 ladsgroup and ebrahim: Continuing with sync [14:55:16] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox: apply [14:55:46] FIRING: [2x] JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:55:49] (03PS1) 10Ahmon Dancy: fix-staging-perms.sh: Add missing -r to an xargs call [puppet] - 10https://gerrit.wikimedia.org/r/1041145 (https://phabricator.wikimedia.org/T364309) [14:56:04] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox: apply [14:56:09] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching aqs1010.eqiad.wmnet: Apply update to Java 11 - eevans@cumin1002 [14:56:10] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply [14:56:31] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply [14:56:38] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply [14:57:13] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply [14:57:19] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply [14:57:20] (03CR) 10Hnowlan: [C:03+2] Adapt entrypoint-prod to bullseye + blubber path [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1041140 (https://phabricator.wikimedia.org/T355020) (owner: 10Hnowlan) [14:57:30] (03PS1) 10Majavah: hieradata: cloudweb: Fix LVS service name [puppet] - 10https://gerrit.wikimedia.org/r/1041146 [14:58:05] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [14:58:11] !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply [14:59:29] (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2843/console" [puppet] - 10https://gerrit.wikimedia.org/r/1041146 (owner: 10Majavah) [14:59:37] !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply [14:59:41] (03CR) 10Majavah: [V:03+1 C:03+2] hieradata: cloudweb: Fix LVS service name [puppet] - 10https://gerrit.wikimedia.org/r/1041146 (owner: 10Majavah) [14:59:50] 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 10netops, 06Traffic: Rise in ms-fe2* TCP retransmits since 11:40 UTC today - https://phabricator.wikimedia.org/T367056#9875713 (10MatthewVernon) [14:59:59] !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [15:00:35] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet [15:00:40] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox: apply [15:00:45] (03Merged) 10jenkins-bot: Adapt entrypoint-prod to bullseye + blubber path [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1041140 (https://phabricator.wikimedia.org/T355020) (owner: 10Hnowlan) [15:01:25] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply [15:01:27] PROBLEM - Host ganeti1019 is DOWN: PING CRITICAL - Packet loss = 100% [15:01:31] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply [15:01:35] !log cdobbins@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4046.ulsfo.wmnet [15:01:36] !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [15:01:49] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply [15:01:55] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply [15:02:15] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply [15:02:21] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply [15:02:43] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply [15:02:47] PROBLEM - Host kubetcd2004 is DOWN: PING CRITICAL - Packet loss = 100% [15:02:49] !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply [15:03:19] !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply [15:03:45] FIRING: [3x] ProbeDown: Service ganeti1019:1811 has failed probes (tcp_ganeti_noded_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:04:19] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1041091|errorpages: Add dark mode support]] (duration: 17m 15s) [15:04:41] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:05:06] !log cdobbins@cumin1002 conftool action : set/pooled=yes; selector: name=4046.ulsfo.wmnet [15:05:06] 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for milimetric - https://phabricator.wikimedia.org/T365074#9875754 (10KOfori) a:05KOfori→03WDoranWMF @WDoranWMF please check this out and let me know if this has your approval before I approve. [15:05:27] RECOVERY - Host kubetcd2004 is UP: PING OK - Packet loss = 0%, RTA = 30.81 ms [15:07:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet [15:07:18] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet [15:08:45] FIRING: [2x] ProbeDown: Service ganeti1019:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:09:52] (03PS1) 10Ladsgroup: Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041151 [15:10:57] (03CR) 10Ladsgroup: [C:03+2] Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041151 (owner: 10Ladsgroup) [15:11:37] (03Merged) 10jenkins-bot: Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041151 (owner: 10Ladsgroup) [15:11:42] 10ops-codfw, 06SRE, 06DC-Ops, 10Observability-Metrics: Memory upgrade request for prometheus200[56] - https://phabricator.wikimedia.org/T360895#9875773 (10Jhancock.wm) I got the ram out of the servers. I can do the addition tomorrow morning at 8am CDT (1600 UTC) or another time that works for you. [15:12:25] FIRING: SystemdUnitFailed: httpbb_hourly_appserver.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:13:45] 06SRE, 06Infrastructure-Foundations, 10netops: Juniper QFX5120 error logs on lsw1-e1 and lsw1-f1: Failed to get ifl for ifl index - https://phabricator.wikimedia.org/T325801#9875780 (10cmooney) 05Open→03Resolved We seem to have no such errors being logged any more, either from these switches or the d... [15:14:41] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:16:04] jouncebot: now [15:16:04] No deployments scheduled for the next 0 hour(s) and 13 minute(s) [15:16:13] jouncebot: next [15:16:13] In 0 hour(s) and 13 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240610T1530) [15:17:25] RESOLVED: SystemdUnitFailed: httpbb_hourly_appserver.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [15:17:51] (03PS8) 10Ladsgroup: Change static footer icons to the new one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027150 (https://phabricator.wikimedia.org/T256190) (owner: 10Jforrester) [15:19:13] effie: `jouncebot: nowandnext` is a sneaky shortcut for that set of lookups [15:20:15] hahaha, I know, I think I am just used to making 2 requests, keeping the bot busy:p [15:20:59] (03CR) 10MVernon: [C:03+2] wmflib: add Wmflib::IP::Address::CIDR type [puppet] - 10https://gerrit.wikimedia.org/r/1041046 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [15:22:39] (03PS1) 10Filippo Giunchedi: logstash: align benthos mw-accesslog-sampler consumer group [puppet] - 10https://gerrit.wikimedia.org/r/1041155 (https://phabricator.wikimedia.org/T366308) [15:22:54] !log ladsgroup@deploy1002 Synchronized portals/wikipedia.org/assets: (no justification provided) (duration: 10m 28s) [15:24:02] effie: fair enough. :) [15:24:23] 06SRE, 10Ganeti, 06Infrastructure-Foundations: ganeti1019 is down - https://phabricator.wikimedia.org/T367071 (10MoritzMuehlenhoff) 03NEW [15:24:37] 06SRE, 10Ganeti, 06Infrastructure-Foundations: ganeti1019 is down - https://phabricator.wikimedia.org/T367071#9875829 (10MoritzMuehlenhoff) p:05Triage→03Medium [15:27:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [15:27:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [15:28:15] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1033 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:29:00] !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/proton: apply [15:29:04] (03Abandoned) 10Elukey: Remove response_code label from totals in Lift Wing Availability SLOs [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1010196 (owner: 10Elukey) [15:30:04] jan_drewniak: How many deployers does it take to do Wikimedia Portals Update deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240610T1530). [15:30:51] (03PS1) 10Elukey: slo_template: update the SLO window [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1041159 [15:30:55] 10ops-codfw, 06SRE, 06DC-Ops, 10Observability-Metrics: Memory upgrade request for prometheus200[56] - https://phabricator.wikimedia.org/T360895#9875848 (10fgiunchedi) >>! In T360895#9875773, @Jhancock.wm wrote: > I got the ram out of the servers. I can do the addition tomorrow morning at 8am CDT (1600 UTC)... [15:30:57] (03PS1) 10Hnowlan: thumbor: new version with entrypoint script fix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041160 (https://phabricator.wikimedia.org/T355020) [15:31:05] (03CR) 10CI reject: [V:04-1] thumbor: new version with entrypoint script fix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041160 (https://phabricator.wikimedia.org/T355020) (owner: 10Hnowlan) [15:31:07] !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/proton: apply [15:31:12] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-reboot rolling reboot on A:cassandra-dev [15:32:27] 10ops-codfw, 06SRE, 06DC-Ops, 10Observability-Metrics: Memory upgrade request for prometheus200[56] - https://phabricator.wikimedia.org/T360895#9875853 (10herron) >>! In T360895#9875773, @Jhancock.wm wrote: > I got the ram out of the servers. I can do the addition tomorrow morning at 8am CDT (1600 UTC) or... [15:32:35] 10ops-codfw, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9875854 (10kamila) @Papaul could you please let me know when would be a good time for you to do this? We don't have any specific... [15:33:09] (03PS1) 10JMeybohm: flink-operator: add securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041161 (https://phabricator.wikimedia.org/T362978) [15:33:29] (03PS2) 10Hnowlan: thumbor: new version with entrypoint script fix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041160 (https://phabricator.wikimedia.org/T355020) [15:34:06] 06SRE, 06collaboration-services, 06Release-Engineering-Team, 06Traffic, 13Patch-For-Review: Move GitLab behind the CDN - https://phabricator.wikimedia.org/T366882#9875862 (10LSobanski) p:05Triage→03High [15:34:14] !log ladsgroup@deploy1002 Synchronized portals: (no justification provided) (duration: 11m 20s) [15:34:39] (03CR) 10Kamila Součková: [C:03+1] thumbor: new version with entrypoint script fix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041160 (https://phabricator.wikimedia.org/T355020) (owner: 10Hnowlan) [15:35:07] RECOVERY - Host ganeti1019 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [15:36:45] (03CR) 10Hnowlan: [C:03+2] thumbor: new version with entrypoint script fix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041160 (https://phabricator.wikimedia.org/T355020) (owner: 10Hnowlan) [15:37:28] (03CR) 10Klausman: [C:03+1] slo_template: update the SLO window [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1041159 (owner: 10Elukey) [15:37:41] (03Merged) 10jenkins-bot: thumbor: new version with entrypoint script fix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041160 (https://phabricator.wikimedia.org/T355020) (owner: 10Hnowlan) [15:38:29] (03PS2) 10Elukey: slo_template: update the SLO window [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1041159 [15:38:45] RESOLVED: ProbeDown: Service ganeti1019:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:39:00] (03CR) 10Elukey: "Added the wrong month :( (sept instead of August)" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1041159 (owner: 10Elukey) [15:39:19] 06SRE, 10SRE-Access-Requests: Requesting access to Kubernetes deployment for amastilovic - https://phabricator.wikimedia.org/T367073 (10amastilovic) 03NEW [15:39:25] 10ops-codfw, 06SRE, 06DC-Ops, 10Observability-Metrics: Memory upgrade request for prometheus200[56] - https://phabricator.wikimedia.org/T360895#9875901 (10Jhancock.wm) Yes that would work. [15:40:18] (03PS3) 10Elukey: slo_template: update the SLO window [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1041159 [15:40:36] (03CR) 10Elukey: "And August has 31 days, not 30.. Good job Luca :D" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1041159 (owner: 10Elukey) [15:40:51] !log Drop flaggedpage_pending from s6 T365568 [15:40:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:56] T365568: Drop flaggedpage_pending from production - https://phabricator.wikimedia.org/T365568 [15:41:40] !log Drop flaggedpage_pending from s7 T365568 [15:41:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:41:51] !log bounce benthos@mw_accesslog_metrics.service on centrallog hosts [15:41:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [15:42:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [15:42:49] !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply [15:42:57] !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply [15:43:13] 06SRE, 10SRE-Access-Requests: Requesting access to Kubernetes deployment for amastilovic - https://phabricator.wikimedia.org/T367073#9875920 (10Ottomata) [15:43:35] !log Drop flaggedpage_pending from s2 T365568 [15:43:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:23] (03PS1) 10MVernon: cephadm: template out cephadm spec files [puppet] - 10https://gerrit.wikimedia.org/r/1041163 (https://phabricator.wikimedia.org/T279621) [15:44:29] PROBLEM - MD RAID on ganeti1019 is CRITICAL: CRITICAL: State: degraded, Active: 10, Working: 10, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [15:44:30] ACKNOWLEDGEMENT - MD RAID on ganeti1019 is CRITICAL: CRITICAL: State: degraded, Active: 10, Working: 10, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T367075 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [15:44:38] 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ganeti1019 - https://phabricator.wikimedia.org/T367075 (10ops-monitoring-bot) 03NEW [15:44:43] (03CR) 10CI reject: [V:04-1] cephadm: template out cephadm spec files [puppet] - 10https://gerrit.wikimedia.org/r/1041163 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [15:46:24] !log Drop flaggedpage_pending from s5 T365568 [15:46:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:28] T365568: Drop flaggedpage_pending from production - https://phabricator.wikimedia.org/T365568 [15:46:52] (03PS1) 10Clément Goubert: docker_registry_ha: Bump nginx worker_rlimit_nofile [puppet] - 10https://gerrit.wikimedia.org/r/1041164 (https://phabricator.wikimedia.org/T366481) [15:47:11] !log Drop flaggedpage_pending from s3 T365568 [15:47:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:15] (03CR) 10Scott French: "Thanks, Janis! It looks like you might also need to add base.helper.restrictedSecurityContext onto the containers in `developer-portal/tem" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041039 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [15:52:42] (03CR) 10Scott French: [C:03+1] linkrecommendation: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041049 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [15:53:20] (03CR) 10Scott French: [C:03+1] machinetranslation: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041055 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [15:54:12] (03PS3) 10MVernon: cephadm: template out cephadm spec files [puppet] - 10https://gerrit.wikimedia.org/r/1041163 (https://phabricator.wikimedia.org/T279621) [15:54:28] !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply [15:54:33] (03CR) 10CI reject: [V:04-1] cephadm: template out cephadm spec files [puppet] - 10https://gerrit.wikimedia.org/r/1041163 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [15:55:59] (03CR) 10Elukey: [V:03+2 C:03+2] slo_template: update the SLO window [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1041159 (owner: 10Elukey) [15:57:13] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to Kubernetes deployment for amastilovic - https://phabricator.wikimedia.org/T367073#9876016 (10Ottomata) [15:57:27] (03CR) 10Scott French: [C:03+1] python-webapp: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041072 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [15:57:41] (03PS1) 10Ottomata: data.yaml - Add amastilovic to deployment user group [puppet] - 10https://gerrit.wikimedia.org/r/1041165 (https://phabricator.wikimedia.org/T367073) [15:58:15] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1033 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:58:25] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to Kubernetes deployment for amastilovic - https://phabricator.wikimedia.org/T367073#9876021 (10Ottomata) ^ patch to do this once approved. [15:59:38] (03CR) 10Scott French: "Thanks, Janis! Looks like this might need updates to calculator-service/templates/deployment.yaml as well?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041076 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [16:00:01] !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-reboot (exit_code=0) rolling reboot on A:cassandra-dev [16:00:12] !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply [16:00:21] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to Kubernetes deployment for amastilovic - https://phabricator.wikimedia.org/T367073#9876024 (10Ottomata) @thcipriani for group approver [16:00:41] 06SRE, 10MW-on-K8s, 10Observability-Logging, 06serviceops: benthos mw-accesslog-metrics kafka lag and interpolation errors - https://phabricator.wikimedia.org/T367076#9876028 (10fgiunchedi) [16:01:10] !log 💙cdanis@puppetserver2001.codfw.wmnet ~ 🕛☕ sudo systemctl restart sync-puppet-volatile [16:01:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:05:16] (03PS4) 10MVernon: cephadm: template out cephadm spec files [puppet] - 10https://gerrit.wikimedia.org/r/1041163 (https://phabricator.wikimedia.org/T279621) [16:05:51] !log 💙cdanis@cumin1002.eqiad.wmnet ~ 🕛☕ sudo cumin -b 8 '*.codfw.wmnet and C:geoip::data::puppet%fetch_ipinfo_dbs=true' 'sha512sum /usr/share/GeoIPInfo/GeoLite2-ASN.mmdb || run-puppet-agent' [16:05:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:28] 10ops-codfw, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9876057 (10Papaul) @kamila ? There are some planning that we need to do around this. We will need to relocate those servers for... [16:09:42] (03CR) 10Dzahn: [C:03+1] Remove iegreview module [puppet] - 10https://gerrit.wikimedia.org/r/1040873 (https://phabricator.wikimedia.org/T334415) (owner: 10Muehlenhoff) [16:13:49] (03CR) 10Scott French: [C:03+1] "Thanks, Janis! Just to confirm, the pod-level securityContext reverting to the chart defaults for runAsUser/Group (9999) should be a noop " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041161 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [16:14:45] (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1041163 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [16:20:47] !log Drop flaggedpage_pending from s1 T365568 [16:20:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:51] T365568: Drop flaggedpage_pending from production - https://phabricator.wikimedia.org/T365568 [16:21:18] !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply [16:26:51] !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply [16:38:43] .28 [16:43:05] (03CR) 10Vgutierrez: [V:03+1] "latest PS tested on WMCS and it's working as expected for several interfaces and on IPv4 only realservers as well" [puppet] - 10https://gerrit.wikimedia.org/r/1035724 (https://phabricator.wikimedia.org/T365689) (owner: 10Vgutierrez) [16:46:26] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to Kubernetes deployment for amastilovic - https://phabricator.wikimedia.org/T367073#9876336 (10Ahoelzl) Approved. [16:49:38] (03CR) 10JMeybohm: "Yeah, correct. The more precise settings (e.g. the one on container level) win." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041161 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [16:49:52] (03CR) 10JMeybohm: deployment_server: alert on admin-ng pending changes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1040992 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol) [16:58:07] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T364069)', diff saved to https://phabricator.wikimedia.org/P64543 and previous config saved to /var/cache/conftool/dbconfig/20240610-165806-marostegui.json [16:58:12] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [17:00:04] Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240610T1700) [17:00:04] ryankemper: Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240610T1700). Please do the needful. [17:00:09] 06SRE, 10Observability-Metrics, 05Goal, 13Patch-For-Review: Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870#9876416 (10elukey) @colewhite o/ I finally deployed recommendation-api, and this time it looks good. I updated also its dashboard: https://grafana.wikimedia.org/d/Y5wk... [17:01:36] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to Kubernetes deployment for amastilovic - https://phabricator.wikimedia.org/T367073#9876424 (10Ottomata) [17:01:43] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2200.codfw.wmnet with reason: Maintenance [17:01:57] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2200.codfw.wmnet with reason: Maintenance [17:02:08] (03CR) 10JMeybohm: [C:03+1] docker_registry_ha: Bump nginx worker_rlimit_nofile [puppet] - 10https://gerrit.wikimedia.org/r/1041164 (https://phabricator.wikimedia.org/T366481) (owner: 10Clément Goubert) [17:02:32] (03CR) 10Clément Goubert: [C:03+2] docker_registry_ha: Bump nginx worker_rlimit_nofile [puppet] - 10https://gerrit.wikimedia.org/r/1041164 (https://phabricator.wikimedia.org/T366481) (owner: 10Clément Goubert) [17:02:42] 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to Kubernetes deployment for amastilovic - https://phabricator.wikimedia.org/T367073#9876421 (10ttaylor) Approving in @thcipriani 's place since he is on vacation. [17:06:25] (03CR) 10Ottomata: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2850/co" [puppet] - 10https://gerrit.wikimedia.org/r/1041165 (https://phabricator.wikimedia.org/T367073) (owner: 10Ottomata) [17:07:35] (03CR) 10Scott French: [C:03+1] mcrouter: Bump chart modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038859 (https://phabricator.wikimedia.org/T362978) (owner: 10Alexandros Kosiaris) [17:08:40] (03PS2) 10JMeybohm: developer-portal: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041039 (https://phabricator.wikimedia.org/T362978) [17:09:14] (03CR) 10JMeybohm: "Absolutely, yes. Thanks for spotting this." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041039 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [17:12:09] (03CR) 10Ottomata: [V:03+1 C:03+2] data.yaml - Add amastilovic to deployment user group [puppet] - 10https://gerrit.wikimedia.org/r/1041165 (https://phabricator.wikimedia.org/T367073) (owner: 10Ottomata) [17:13:14] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P64544 and previous config saved to /var/cache/conftool/dbconfig/20240610-171313-marostegui.json [17:16:22] (03CR) 10Scott French: [C:03+1] developer-portal: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041039 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [17:20:22] (03PS3) 10Cmelo: Configures the necessary user rights for CampaignEvents on swahili [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041094 (https://phabricator.wikimedia.org/T366502) [17:23:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [17:23:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [17:24:00] (03PS4) 10Cmelo: Configures the necessary user rights for CampaignEvents on swahili [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041094 (https://phabricator.wikimedia.org/T366502) [17:25:10] (03CR) 10Cmelo: Configures the necessary user rights for CampaignEvents on swahili (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041094 (https://phabricator.wikimedia.org/T366502) (owner: 10Cmelo) [17:25:14] !log dancy@deploy1002 Installing scap version "4.87.0" for 285 hosts [17:26:26] (03CR) 10Daimona Eaytoy: [C:04-1] Configures the necessary user rights for CampaignEvents on swahili (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041094 (https://phabricator.wikimedia.org/T366502) (owner: 10Cmelo) [17:28:21] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P64545 and previous config saved to /var/cache/conftool/dbconfig/20240610-172820-marostegui.json [17:29:19] !log amastilovic@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [17:29:30] !log amastilovic@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [17:30:15] !log dancy@deploy1002 Installation of scap version "4.87.0" completed for 285 hosts [17:34:42] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [17:36:59] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply [17:37:02] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply [17:37:15] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to Kubernetes deployment for amastilovic - https://phabricator.wikimedia.org/T367073#9876692 (10amastilovic) Merged and applied - done [17:38:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [17:38:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [17:42:29] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to Kubernetes deployment for amastilovic - https://phabricator.wikimedia.org/T367073#9876716 (10Ottomata) 05Open→03Resolved a:03Ottomata [17:43:28] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T364069)', diff saved to https://phabricator.wikimedia.org/P64546 and previous config saved to /var/cache/conftool/dbconfig/20240610-174327-marostegui.json [17:43:30] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1201.eqiad.wmnet with reason: Maintenance [17:43:32] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [17:43:43] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1201.eqiad.wmnet with reason: Maintenance [17:43:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1201 (T364069)', diff saved to https://phabricator.wikimedia.org/P64547 and previous config saved to /var/cache/conftool/dbconfig/20240610-174349-marostegui.json [17:46:53] !log amastilovic@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply [17:47:06] !log amastilovic@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply [17:50:22] !log amastilovic@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply [17:50:29] !log amastilovic@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply [17:57:38] (03CR) 10Aleksandar Mastilovic: "Deployed to eqiad and codfw. Deployed to staging too, but k8s showed no pods/resources running." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039727 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm) [18:01:11] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9876800 (10BCornwall) [18:01:33] 06SRE, 10Wikimedia-Mailing-lists, 07Wikimedia-Incident: lists apache config change should trigger an apache reload - https://phabricator.wikimedia.org/T323208#9876816 (10Ladsgroup) @Dzahn The issue was that the change made the config invalid, since it was invalid, it didn't restart the apache. But then later... [18:02:12] 10ops-eqiad, 06SRE, 06DC-Ops, 10Ganeti, 06Infrastructure-Foundations: ganeti1019 is down - https://phabricator.wikimedia.org/T367071#9876817 (10MoritzMuehlenhoff) a:03Jclark-ctr All VMs moved off the server. DC ops, can you please have a look? Not sure what "unsupported event" means, never seen that be... [18:02:36] 10ops-eqiad, 06SRE, 06DC-Ops, 10Ganeti, 06Infrastructure-Foundations: ganeti1019 is down - https://phabricator.wikimedia.org/T367071#9876819 (10MoritzMuehlenhoff) [18:06:24] 10ops-codfw, 06SRE, 10SRE-tools, 06DC-Ops, and 2 others: Netbox errors caused by system board replacement - https://phabricator.wikimedia.org/T358542#9876833 (10wiki_willy) Thanks @Volans, will do on the remaining Netbox errors. >>! In T358542#9874557, @Volans wrote: > This is now completed. The new... [18:11:44] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet) [18:11:49] T349069: Design and implement a WDQS data-reload mechanism that sources its data from HDFS instead of the snapshot servers - https://phabricator.wikimedia.org/T349069 [18:17:43] 06SRE, 10Wikimedia-Mailing-lists, 07Wikimedia-Incident: lists apache config change should trigger an apache reload - https://phabricator.wikimedia.org/T323208#9876889 (10Dzahn) Gotcha! Yea, so.. I would normally support the idea of adding an Icinga check. Except my concern is that Icinga doesn't effectively... [18:17:46] (03PS2) 10Snwachukwu: Update Eventgate-Wikimedia and Eventstreams repository to Gitlab source and version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040862 (https://phabricator.wikimedia.org/T344730) [18:17:57] (03CR) 10Snwachukwu: "@ltoscano@wikimedia.org WHich images/config are you referring to please?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040862 (https://phabricator.wikimedia.org/T344730) (owner: 10Snwachukwu) [18:29:07] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9876950 (10BCornwall) [18:29:56] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9876951 (10BCornwall) [18:30:29] 10ops-ulsfo, 06SRE, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9876952 (10BCornwall) [18:49:00] (03PS11) 10CDobbins: varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040262 (https://phabricator.wikimedia.org/T354718) [18:55:26] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Ifrahkhanyaree_WMDE - https://phabricator.wikimedia.org/T366558#9877037 (10herron) [18:55:42] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Ifrahkhanyaree_WMDE - https://phabricator.wikimedia.org/T366558#9877038 (10herron) [18:55:46] FIRING: JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:56:07] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Ifrahkhanyaree_WMDE - https://phabricator.wikimedia.org/T366558#9877036 (10herron) Hi @Ifrahkhanyaree_WMDE I see the SSH key in the description is in use already. Could you please generate a fresh ssh key for production use and... [18:57:43] 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users (no kerberos, no ssh) for HNordeen - https://phabricator.wikimedia.org/T364801#9877041 (10herron) [18:58:17] (03PS12) 10CDobbins: varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040262 (https://phabricator.wikimedia.org/T354718) [19:02:06] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet) [19:02:12] T349069: Design and implement a WDQS data-reload mechanism that sources its data from HDFS instead of the snapshot servers - https://phabricator.wikimedia.org/T349069 [19:02:49] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet) [19:04:08] 06SRE, 10Wikimedia-Mailing-lists, 07Wikimedia-Incident: lists apache config change should trigger an apache reload - https://phabricator.wikimedia.org/T323208#9877057 (10Dzahn) P.S. (and when we merge apache changes and we aren't sure if a puppet refresh is enough for it to take effect, then we should do the... [19:04:55] (03PS13) 10CDobbins: varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040262 (https://phabricator.wikimedia.org/T354718) [19:06:16] (03PS1) 10Herron: admin: add hnordeen to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1041198 (https://phabricator.wikimedia.org/T364801) [19:07:11] (03CR) 10CI reject: [V:04-1] admin: add hnordeen to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1041198 (https://phabricator.wikimedia.org/T364801) (owner: 10Herron) [19:12:43] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T364069)', diff saved to https://phabricator.wikimedia.org/P64550 and previous config saved to /var/cache/conftool/dbconfig/20240610-191242-marostegui.json [19:12:47] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [19:14:30] FIRING: [2x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1014:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:15:00] (03PS2) 10Herron: admin: add hnordeen to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1041198 (https://phabricator.wikimedia.org/T364801) [19:17:22] (03CR) 10Dzahn: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1041198 (https://phabricator.wikimedia.org/T364801) (owner: 10Herron) [19:18:29] (03PS1) 10Herron: admin: add note/hint for no-ssh no-kereberos accounts [puppet] - 10https://gerrit.wikimedia.org/r/1041200 [19:19:30] RESOLVED: [2x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1014:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:20:09] (03CR) 10Herron: [C:03+2] admin: add hnordeen to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1041198 (https://phabricator.wikimedia.org/T364801) (owner: 10Herron) [19:20:22] (03CR) 10Dzahn: [C:03+1] admin: add note/hint for no-ssh no-kereberos accounts [puppet] - 10https://gerrit.wikimedia.org/r/1041200 (owner: 10Herron) [19:21:32] (03CR) 10Herron: [C:03+2] admin: add note/hint for no-ssh no-kereberos accounts [puppet] - 10https://gerrit.wikimedia.org/r/1041200 (owner: 10Herron) [19:22:50] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet) [19:22:54] T349069: Design and implement a WDQS data-reload mechanism that sources its data from HDFS instead of the snapshot servers - https://phabricator.wikimedia.org/T349069 [19:25:33] 07Puppet, 06Infrastructure-Foundations: Puppetmaster volatile data not synced to all puppet frontends for a month and a half (2024-04-27 to 2024-06-10) - https://phabricator.wikimedia.org/T367113 (10CDanis) 03NEW [19:25:46] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users (no kerberos, no ssh) for HNordeen - https://phabricator.wikimedia.org/T364801#9877139 (10herron) 05In progress→03Resolved The patch to provision this access has been merged and will be propagated fully... [19:27:50] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P64551 and previous config saved to /var/cache/conftool/dbconfig/20240610-192749-marostegui.json [19:33:13] PROBLEM - rt.wikimedia.org tls expiry on moscovium is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [19:33:17] PROBLEM - rt.wikimedia.org requires authentication on moscovium is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [19:33:23] FIRING: ProbeDown: Service moscovium:443 has failed probes (http_rt_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#moscovium:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:33:29] (03CR) 10Dzahn: [C:03+2] create u4c.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1040144 (https://phabricator.wikimedia.org/T366649) (owner: 10Zabe) [19:33:34] (03PS2) 10Zabe: create u4c.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1040144 (https://phabricator.wikimedia.org/T366649) [19:34:05] i'll take care of the moscovium alerts [19:34:09] RECOVERY - rt.wikimedia.org tls expiry on moscovium is OK: OK - Certificate rt.discovery.wmnet will expire on Tue 25 Jun 2024 02:55:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [19:34:09] RECOVERY - rt.wikimedia.org requires authentication on moscovium is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 537 bytes in 1.578 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration [19:35:28] 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting permissions for analytics-privatedata-users (with kerberos) for Mareike Heuer - https://phabricator.wikimedia.org/T364715#9877189 (10herron) 05In progress→03Resolved Resolving as the access looks to have been provisioned, please reopen if a... [19:36:56] FIRING: MaxConntrack: Max conntrack at 90.48% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [19:37:44] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Rae Adimer - https://phabricator.wikimedia.org/T365832#9877210 (10herron) [19:38:23] RESOLVED: ProbeDown: Service moscovium:443 has failed probes (http_rt_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#moscovium:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:41:55] RESOLVED: MaxConntrack: Max conntrack at 90.48% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack [19:42:08] (03PS10) 10Brouberol: deployment_server: alert on admin-ng pending changes [puppet] - 10https://gerrit.wikimedia.org/r/1040992 (https://phabricator.wikimedia.org/T331894) [19:42:43] (03PS11) 10Brouberol: deployment_server: alert on admin-ng pending changes [puppet] - 10https://gerrit.wikimedia.org/r/1040992 (https://phabricator.wikimedia.org/T331894) [19:42:58] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P64552 and previous config saved to /var/cache/conftool/dbconfig/20240610-194256-marostegui.json [19:45:24] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Tchanders - https://phabricator.wikimedia.org/T366351#9877234 (10herron) [19:47:41] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Rae Adimer - https://phabricator.wikimedia.org/T365832#9877229 (10herron) (SSH key verification email sent) [19:47:51] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Tchanders - https://phabricator.wikimedia.org/T366351#9877241 (10herron) a:03JayCano Hi @JayCano, assigning to you for approval. Thanks! [19:48:43] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for rickijay - https://phabricator.wikimedia.org/T365574#9877247 (10herron) [19:48:53] (03PS14) 10CDobbins: varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040262 (https://phabricator.wikimedia.org/T354718) [19:53:52] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Rae Adimer - https://phabricator.wikimedia.org/T365832#9877256 (10herron) [19:54:47] (03PS12) 10Brouberol: deployment_server: alert on admin-ng pending changes [puppet] - 10https://gerrit.wikimedia.org/r/1040992 (https://phabricator.wikimedia.org/T331894) [19:58:05] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T364069)', diff saved to https://phabricator.wikimedia.org/P64553 and previous config saved to /var/cache/conftool/dbconfig/20240610-195804-marostegui.json [19:58:07] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1224.eqiad.wmnet with reason: Maintenance [19:58:09] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [19:58:20] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1224.eqiad.wmnet with reason: Maintenance [19:58:26] !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1224 (T364069)', diff saved to https://phabricator.wikimedia.org/P64554 and previous config saved to /var/cache/conftool/dbconfig/20240610-195826-marostegui.json [19:58:45] FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [19:58:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [19:59:03] (03PS2) 10Scott French: kubernetes: alert on persistent unavailable replicas [alerts] - 10https://gerrit.wikimedia.org/r/1040242 (https://phabricator.wikimedia.org/T366932) [20:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240610T2000). nyaa~ [20:00:05] No Gerrit patches in the queue for this window AFAICS. [20:00:40] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2203 (T352010)', diff saved to https://phabricator.wikimedia.org/P64555 and previous config saved to /var/cache/conftool/dbconfig/20240610-200039-ladsgroup.json [20:00:44] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [20:00:51] (03CR) 10CI reject: [V:04-1] kubernetes: alert on persistent unavailable replicas [alerts] - 10https://gerrit.wikimedia.org/r/1040242 (https://phabricator.wikimedia.org/T366932) (owner: 10Scott French) [20:03:45] RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ... [20:03:45] fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate [20:03:46] (03CR) 10Scott French: "Thanks, Janis!" [alerts] - 10https://gerrit.wikimedia.org/r/1040242 (https://phabricator.wikimedia.org/T366932) (owner: 10Scott French) [20:05:36] (03PS3) 10Scott French: kubernetes: alert on persistent unavailable replicas [alerts] - 10https://gerrit.wikimedia.org/r/1040242 (https://phabricator.wikimedia.org/T366932) [20:10:15] (03PS15) 10CDobbins: varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040262 (https://phabricator.wikimedia.org/T354718) [20:14:35] (03PS5) 10JHathaway: cephadm: template out cephadm spec files [puppet] - 10https://gerrit.wikimedia.org/r/1041163 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [20:15:47] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2203', diff saved to https://phabricator.wikimedia.org/P64556 and previous config saved to /var/cache/conftool/dbconfig/20240610-201546-ladsgroup.json [20:16:29] (03PS2) 10Herron: admin: add radimer to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1036593 (https://phabricator.wikimedia.org/T365832) (owner: 10Cwhite) [20:17:16] (03CR) 10CI reject: [V:04-1] admin: add radimer to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1036593 (https://phabricator.wikimedia.org/T365832) (owner: 10Cwhite) [20:17:45] 07Puppet, 06Infrastructure-Foundations: Puppetmaster volatile data not synced to all puppet frontends for a month and a half (2024-04-27 to 2024-06-10) - https://phabricator.wikimedia.org/T367113#9877330 (10CDanis) [20:18:36] (03CR) 10Dzahn: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1040144 (https://phabricator.wikimedia.org/T366649) (owner: 10Zabe) [20:20:27] (03PS16) 10CDobbins: varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040262 (https://phabricator.wikimedia.org/T354718) [20:21:37] (03PS1) 10CDanis: enable monitoring+logging for puppetmaster syncs [puppet] - 10https://gerrit.wikimedia.org/r/1041217 (https://phabricator.wikimedia.org/T367113) [20:21:58] (03CR) 10CI reject: [V:04-1] enable monitoring+logging for puppetmaster syncs [puppet] - 10https://gerrit.wikimedia.org/r/1041217 (https://phabricator.wikimedia.org/T367113) (owner: 10CDanis) [20:21:59] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1041217 (https://phabricator.wikimedia.org/T367113) (owner: 10CDanis) [20:22:34] (03PS1) 10Herron: admin: add radimer to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1041218 (https://phabricator.wikimedia.org/T365832) [20:22:48] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you1" [puppet] - 10https://gerrit.wikimedia.org/r/1041111 (https://phabricator.wikimedia.org/T357747) (owner: 10Filippo Giunchedi) [20:23:51] (03PS2) 10CDanis: enable monitoring+logging for puppetmaster syncs [puppet] - 10https://gerrit.wikimedia.org/r/1041217 (https://phabricator.wikimedia.org/T367113) [20:23:56] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1041217 (https://phabricator.wikimedia.org/T367113) (owner: 10CDanis) [20:24:05] (03Abandoned) 10Herron: admin: add radimer to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1036593 (https://phabricator.wikimedia.org/T365832) (owner: 10Cwhite) [20:24:11] (03CR) 10CI reject: [V:04-1] enable monitoring+logging for puppetmaster syncs [puppet] - 10https://gerrit.wikimedia.org/r/1041217 (https://phabricator.wikimedia.org/T367113) (owner: 10CDanis) [20:24:15] (03CR) 10Dzahn: [C:04-1] "the expiry_contact and expiry_date should stay in there unless the manager or so states they aren't a contractor anymore" [puppet] - 10https://gerrit.wikimedia.org/r/1041218 (https://phabricator.wikimedia.org/T365832) (owner: 10Herron) [20:24:46] (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1040170 (https://phabricator.wikimedia.org/T366710) (owner: 10Filippo Giunchedi) [20:25:24] (03PS2) 10Herron: admin: add radimer to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1041218 (https://phabricator.wikimedia.org/T365832) [20:25:39] (03CR) 10JHathaway: "Pushed a patch with a few suggestions. One option you might want to consider is converting Puppet data structures to yaml directly, rather" [puppet] - 10https://gerrit.wikimedia.org/r/1041163 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon) [20:25:43] (03PS3) 10CDanis: enable monitoring+logging for puppetmaster syncs [puppet] - 10https://gerrit.wikimedia.org/r/1041217 (https://phabricator.wikimedia.org/T367113) [20:25:45] FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [20:25:50] (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1041217 (https://phabricator.wikimedia.org/T367113) (owner: 10CDanis) [20:26:03] (03CR) 10CI reject: [V:04-1] enable monitoring+logging for puppetmaster syncs [puppet] - 10https://gerrit.wikimedia.org/r/1041217 (https://phabricator.wikimedia.org/T367113) (owner: 10CDanis) [20:26:12] (03CR) 10Herron: "thanks! updated in ps2" [puppet] - 10https://gerrit.wikimedia.org/r/1041218 (https://phabricator.wikimedia.org/T365832) (owner: 10Herron) [20:26:42] (03PS1) 10Santiago Faci: Metrics Platform Instrument Configurator: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041221 (https://phabricator.wikimedia.org/T366918) [20:26:42] (03PS17) 10CDobbins: varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040262 (https://phabricator.wikimedia.org/T354718) [20:28:56] (03CR) 10Dzahn: "So the email address changed from -ctr to no -ctr suffix. And I see it's actually like that in LDAP. That brings up the question.. have th" [puppet] - 10https://gerrit.wikimedia.org/r/1041218 (https://phabricator.wikimedia.org/T365832) (owner: 10Herron) [20:29:22] (03CR) 10Clare Ming: [C:03+2] Metrics Platform Instrument Configurator: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041221 (https://phabricator.wikimedia.org/T366918) (owner: 10Santiago Faci) [20:30:02] !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet) [20:30:07] (03Merged) 10jenkins-bot: Metrics Platform Instrument Configurator: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041221 (https://phabricator.wikimedia.org/T366918) (owner: 10Santiago Faci) [20:30:07] T349069: Design and implement a WDQS data-reload mechanism that sources its data from HDFS instead of the snapshot servers - https://phabricator.wikimedia.org/T349069 [20:30:25] !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet) [20:30:54] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2203', diff saved to https://phabricator.wikimedia.org/P64557 and previous config saved to /var/cache/conftool/dbconfig/20240610-203053-ladsgroup.json [20:31:32] (03PS4) 10CDanis: enable monitoring+logging for puppetmaster syncs [puppet] - 10https://gerrit.wikimedia.org/r/1041217 (https://phabricator.wikimedia.org/T367113) [20:36:10] !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply [20:36:27] !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply [20:36:52] (03PS5) 10Cmelo: Configures the necessary user rights for CampaignEvents on swahili [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041094 (https://phabricator.wikimedia.org/T366502) [20:37:01] (03PS5) 10CDanis: enable monitoring+logging for puppetmaster syncs [puppet] - 10https://gerrit.wikimedia.org/r/1041217 (https://phabricator.wikimedia.org/T367113) [20:37:14] (03CR) 10Cmelo: Configures the necessary user rights for CampaignEvents on swahili (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041094 (https://phabricator.wikimedia.org/T366502) (owner: 10Cmelo) [20:41:11] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for rickijay - https://phabricator.wikimedia.org/T365574#9877425 (10Dzahn) 05Stalled→03Open [20:43:31] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Rae Adimer - https://phabricator.wikimedia.org/T365832#9877420 (10herron) >>! In T365832#9830059, @elappen-WMF wrote: > Approving access from my end. Hi @LMccabe @elappen-WMF we noticed when writing the pa... [20:46:01] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2203 (T352010)', diff saved to https://phabricator.wikimedia.org/P64558 and previous config saved to /var/cache/conftool/dbconfig/20240610-204600-ladsgroup.json [20:46:03] !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2216.codfw.wmnet with reason: Maintenance [20:46:05] T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010 [20:46:16] !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2216.codfw.wmnet with reason: Maintenance [20:46:23] !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2216 (T352010)', diff saved to https://phabricator.wikimedia.org/P64559 and previous config saved to /var/cache/conftool/dbconfig/20240610-204622-ladsgroup.json [20:46:26] (03CR) 10Herron: [C:03+1] k8s: send logs to per-cluster kafka topics [puppet] - 10https://gerrit.wikimedia.org/r/1040170 (https://phabricator.wikimedia.org/T366710) (owner: 10Filippo Giunchedi) [20:48:08] (03PS18) 10CDobbins: varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040262 (https://phabricator.wikimedia.org/T354718) [20:49:23] (03CR) 10Herron: [C:03+1] titan: trim 5m retention to 70w [puppet] - 10https://gerrit.wikimedia.org/r/1041111 (https://phabricator.wikimedia.org/T357747) (owner: 10Filippo Giunchedi) [20:56:24] 07Puppet, 06Infrastructure-Foundations: Install a default timeout for systemd::timer::jobs - https://phabricator.wikimedia.org/T367119 (10CDanis) 03NEW [20:57:41] (03PS1) 10Dzahn: admin: add rickijay to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1041227 (https://phabricator.wikimedia.org/T365574) [20:59:24] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for rickijay - https://phabricator.wikimedia.org/T365574#9877476 (10Dzahn) 05Open→03In progress [21:00:05] Reedy, sbassett, Maryum, and manfredi: #bothumor I � Unicode. All rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240610T2100). [21:04:13] (03PS19) 10CDobbins: varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040262 (https://phabricator.wikimedia.org/T354718) [21:05:06] (03PS1) 10Dzahn: remote iegreview.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1041229 (https://phabricator.wikimedia.org/T367011) [21:05:21] (03PS2) 10Dzahn: remove iegreview.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1041229 (https://phabricator.wikimedia.org/T367011) [21:06:12] (03PS3) 10Dzahn: remove iegreview.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1041229 (https://phabricator.wikimedia.org/T367011) [21:09:21] (03CR) 10Dzahn: [C:03+2] remove iegreview.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1041229 (https://phabricator.wikimedia.org/T367011) (owner: 10Dzahn) [21:11:01] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T364069)', diff saved to https://phabricator.wikimedia.org/P64560 and previous config saved to /var/cache/conftool/dbconfig/20240610-211101-marostegui.json [21:11:05] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [21:13:02] 06SRE, 10DNS, 06Traffic, 13Patch-For-Review: Remove iegreview.wikimedia.org from DNS - https://phabricator.wikimedia.org/T367011#9877497 (10Dzahn) 05Open→03Resolved a:03Dzahn thanks for reporting. removed. Host iegreview.wikimedia.org not found: 3(NXDOMAIN) [21:13:10] (03PS1) 10Ahmon Dancy: Testing Gerrit. Please Disregard [puppet] - 10https://gerrit.wikimedia.org/r/1041230 [21:17:38] (03PS1) 10Eevans: sre.cassandra.roll-reboot: Add missing Cassandra cluster aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/1041231 [21:20:01] (03CR) 10Muehlenhoff: sre.cassandra.roll-reboot: Add missing Cassandra cluster aliases (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1041231 (owner: 10Eevans) [21:20:31] (03PS20) 10CDobbins: varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040262 (https://phabricator.wikimedia.org/T354718) [21:21:17] (03CR) 10CI reject: [V:04-1] sre.cassandra.roll-reboot: Add missing Cassandra cluster aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/1041231 (owner: 10Eevans) [21:21:29] (03PS8) 10Jdlrobson: Drop unused config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031459 (https://phabricator.wikimedia.org/T301212) [21:23:52] (03PS1) 10EoghanGaffney: lists: Remove quickdatacopy and use our own rsyncd and systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/1041232 (https://phabricator.wikimedia.org/T331706) [21:23:54] (03PS2) 10Eevans: sre.cassandra.roll-reboot: Add missing Cassandra cluster aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/1041231 [21:24:12] (03CR) 10CI reject: [V:04-1] lists: Remove quickdatacopy and use our own rsyncd and systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/1041232 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [21:24:16] (03CR) 10Eevans: sre.cassandra.roll-reboot: Add missing Cassandra cluster aliases (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1041231 (owner: 10Eevans) [21:25:22] 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Rae Adimer - https://phabricator.wikimedia.org/T365832#9877582 (10elappen-WMF) Hello! Yes I can confirm the email change is correct and yes you can remove the expiry. Also if needed confirming the change in... [21:25:23] (03PS2) 10EoghanGaffney: lists: Remove quickdatacopy and use our own rsyncd and systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/1041232 (https://phabricator.wikimedia.org/T331706) [21:26:08] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P64561 and previous config saved to /var/cache/conftool/dbconfig/20240610-212608-marostegui.json [21:27:40] !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-reboot rolling reboot on A:restbase-eqiad [21:28:42] (03CR) 10Stoyofuku-wmf: [C:03+1] Drop unused config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031459 (https://phabricator.wikimedia.org/T301212) (owner: 10Jdlrobson) [21:30:43] (03CR) 10VolkerE: [C:03+1] Drop unused config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031459 (https://phabricator.wikimedia.org/T301212) (owner: 10Jdlrobson) [21:30:46] FIRING: ProbeDown: Service restbase1028-c:9042 has failed probes (tcp_cassandra_c_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#restbase1028-c:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:33:45] FIRING: [6x] ProbeDown: Service restbase1028-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:34:42] FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [21:34:57] (03PS3) 10EoghanGaffney: lists: Remove quickdatacopy and use our own rsyncd and systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/1041232 (https://phabricator.wikimedia.org/T331706) [21:35:18] (03CR) 10CI reject: [V:04-1] lists: Remove quickdatacopy and use our own rsyncd and systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/1041232 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [21:35:46] RESOLVED: [6x] ProbeDown: Service restbase1028-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:36:04] 06SRE, 10DNS, 06Traffic: Cleanup unused DNS subdomains - https://phabricator.wikimedia.org/T367012#9877639 (10Dzahn) cache.wikimedia.org goes so far back in history that I reached 2012 when using git blame and the change before that was made by root and isn't in gerrit anymore. langcom.wikimedia.org - same... [21:37:03] (03PS4) 10EoghanGaffney: lists: Remove quickdatacopy and use our own rsyncd and systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/1041232 (https://phabricator.wikimedia.org/T331706) [21:38:18] (03CR) 10EoghanGaffney: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2868/co" [puppet] - 10https://gerrit.wikimedia.org/r/1041232 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [21:39:30] 06SRE, 10DNS, 06Traffic: Cleanup unused DNS subdomains - https://phabricator.wikimedia.org/T367012#9877666 (10taavi) >>! In T367012#9877639, @Dzahn wrote: > langcom.wikimedia.org - same. It was already there in an initial import in 2011. Apparently there once was a `langcomwiki` which was [[ https://gerrit.... [21:40:49] (03PS1) 10Scott French: aqs-http-gateway: allow cross-DC Cassandra client connection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041235 (https://phabricator.wikimedia.org/T366851) [21:41:16] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P64562 and previous config saved to /var/cache/conftool/dbconfig/20240610-214115-marostegui.json [21:41:37] (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1041231 (owner: 10Eevans) [21:41:51] 06SRE, 10DNS, 06Traffic: Cleanup unused DNS subdomains - https://phabricator.wikimedia.org/T367012#9877668 (10Dzahn) pk.wikimedia.org was added in 2013 in https://gerrit.wikimedia.org/r/c/operations/dns/+/86650 to add a redirect but in 2023 the redirect was removed in https://gerrit.wikimedia.org/r/c/operati... [21:43:45] FIRING: [12x] ProbeDown: Service restbase1028-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:44:29] (03CR) 10Volans: sre.cassandra.roll-reboot: Add missing Cassandra cluster aliases (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1041231 (owner: 10Eevans) [21:48:23] jouncebot: nowandnext [21:48:23] For the next 1 hour(s) and 11 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240610T2100) [21:48:23] In 4 hour(s) and 11 minute(s): Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240611T0200) [21:48:45] RESOLVED: [12x] ProbeDown: Service restbase1028-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:49:22] (03PS1) 10Reedy: langlist-labs: Add bn and fr [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041236 (https://phabricator.wikimedia.org/T367126) [21:49:49] (03PS2) 10Reedy: interwiki(-labs).php: De-duplicate and update from meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1040766 (https://phabricator.wikimedia.org/T365679) [21:49:55] (03CR) 10Reedy: [C:03+2] interwiki(-labs).php: De-duplicate and update from meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1040766 (https://phabricator.wikimedia.org/T365679) (owner: 10Reedy) [21:50:41] (03PS1) 10Dzahn: delete langcom.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1041237 (https://phabricator.wikimedia.org/T367012) [21:50:42] (03CR) 10Volans: sre.cassandra.roll-reboot: Add missing Cassandra cluster aliases (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1041231 (owner: 10Eevans) [21:53:17] (03Merged) 10jenkins-bot: interwiki(-labs).php: De-duplicate and update from meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1040766 (https://phabricator.wikimedia.org/T365679) (owner: 10Reedy) [21:53:29] (03PS2) 10Reedy: langlist-labs: Add bn and fr [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041236 (https://phabricator.wikimedia.org/T367126) [21:53:33] (03CR) 10Reedy: [C:03+2] langlist-labs: Add bn and fr [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041236 (https://phabricator.wikimedia.org/T367126) (owner: 10Reedy) [21:55:14] (03Merged) 10jenkins-bot: langlist-labs: Add bn and fr [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041236 (https://phabricator.wikimedia.org/T367126) (owner: 10Reedy) [21:55:46] FIRING: [12x] ProbeDown: Service restbase1029-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:56:20] (03PS1) 10Reedy: interwiki-labs.php: Update as per langlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041238 [21:56:23] !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T364069)', diff saved to https://phabricator.wikimedia.org/P64563 and previous config saved to /var/cache/conftool/dbconfig/20240610-215622-marostegui.json [21:56:24] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance [21:56:28] T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069 [21:56:38] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance [21:57:41] (03CR) 10Reedy: [C:03+2] interwiki-labs.php: Update as per langlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041238 (owner: 10Reedy) [21:58:22] (03Merged) 10jenkins-bot: interwiki-labs.php: Update as per langlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041238 (owner: 10Reedy) [22:00:46] RESOLVED: [12x] ProbeDown: Service restbase1029-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:04:05] (03PS2) 10Scott French: aqs-http-gateway: allow cross-DC Cassandra client connection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041235 (https://phabricator.wikimedia.org/T366851) [22:07:47] (03PS1) 10Zabe: Add Apache configuration for u4c.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1041240 (https://phabricator.wikimedia.org/T366649) [22:08:45] FIRING: [12x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:10:33] (03PS1) 10Zabe: Add u4cwiki to private_wikis [puppet] - 10https://gerrit.wikimedia.org/r/1041242 (https://phabricator.wikimedia.org/T366649) [22:11:29] (03CR) 10Scott French: "After discussion on T366851 and chatting with @brouberol@wikimedia.org earlier today, I think we're on the same page that this seems like " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041235 (https://phabricator.wikimedia.org/T366851) (owner: 10Scott French) [22:13:45] RESOLVED: [12x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:14:10] !log reedy@deploy1002 Synchronized langlist-labs: Add fr and bn (duration: 14m 29s) [22:18:36] (03PS1) 10Hashar: wm-zuul-status: fix reload button [software/gerrit] (deploy/wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1041243 (https://phabricator.wikimedia.org/T360550) [22:19:27] (03PS1) 10Dzahn: delete pk.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1041245 (https://phabricator.wikimedia.org/T367012) [22:19:59] (03CR) 10Dzahn: "Good catch, taavi" [dns] - 10https://gerrit.wikimedia.org/r/1041237 (https://phabricator.wikimedia.org/T367012) (owner: 10Dzahn) [22:20:07] (03CR) 10Hashar: "I have tried it by copy pasting in the the browser console:" [software/gerrit] (deploy/wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1041243 (https://phabricator.wikimedia.org/T360550) (owner: 10Hashar) [22:20:46] FIRING: [12x] ProbeDown: Service restbase1031-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:21:16] (03Abandoned) 10Zabe: trafficserver: Move test-commons to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1034106 (owner: 10Zabe) [22:21:19] (03CR) 10Dzahn: [C:03+1] "lgtm, we traced this back to when we converted cron jobs to systemd timers. I don't fully remember but I think we just didn't turn on moni" [puppet] - 10https://gerrit.wikimedia.org/r/1041217 (https://phabricator.wikimedia.org/T367113) (owner: 10CDanis) [22:23:46] (03PS1) 10DCausse: cirrus-streaming-updater: use new image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041246 [22:24:51] (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: use new image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041246 (owner: 10DCausse) [22:25:14] !log reedy@deploy1002 Synchronized wmf-config/: sync interwiki lists (duration: 10m 07s) [22:25:46] FIRING: [12x] ProbeDown: Service restbase1031-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:25:51] (03Merged) 10jenkins-bot: cirrus-streaming-updater: use new image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041246 (owner: 10DCausse) [22:27:25] FIRING: SystemdUnitFailed: php7.4-fpm_check_restart.service on mw1489:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [22:27:45] !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [22:28:03] !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:28:45] RESOLVED: [12x] ProbeDown: Service restbase1031-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:30:37] !log dcausse@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply [22:30:51] !log dcausse@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:35:46] FIRING: [12x] ProbeDown: Service restbase1032-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:36:10] !log dcausse@deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply [22:36:19] !log dcausse@deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply [22:38:45] RESOLVED: [12x] ProbeDown: Service restbase1032-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:40:45] RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable [22:46:53] (03CR) 10Dzahn: lists: Remove quickdatacopy and use our own rsyncd and systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1041232 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney) [22:48:45] FIRING: [12x] ProbeDown: Service restbase1033-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:53:45] RESOLVED: [12x] ProbeDown: Service restbase1033-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:55:31] 06SRE, 10Language-Technical Support, 06serviceops, 10Wikimedia-Site-requests, 13Patch-For-Review: Change $wgMaxArticleSize limit from byte-based to character-based - https://phabricator.wikimedia.org/T275319#9877825 (10stjn) While discussing performance issues on Discord, I looked at https://he.wikisourc... [22:55:46] FIRING: JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:00:46] FIRING: [12x] ProbeDown: Service restbase1034-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:03:25] (03PS1) 10Pppery: MediaWiki.org: restrict unfuzzy rights to autoconfirmed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041249 (https://phabricator.wikimedia.org/T366994) [23:05:46] FIRING: [12x] ProbeDown: Service restbase1034-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:08:45] RESOLVED: [12x] ProbeDown: Service restbase1034-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:12:01] (03PS1) 10Jdlrobson: Enable dark mode on data table pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041250 (https://phabricator.wikimedia.org/T366373) [23:13:45] FIRING: [11x] ProbeDown: Service restbase1035-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:15:46] FIRING: [12x] ProbeDown: Service restbase1035-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:20:46] RESOLVED: [12x] ProbeDown: Service restbase1035-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:23:45] FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [23:28:45] FIRING: [12x] ProbeDown: Service restbase1036-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:31:27] (03PS3) 10Eevans: sre.cassandra.roll-reboot: Add missing Cassandra cluster aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/1041231 [23:32:05] (03CR) 10Eevans: sre.cassandra.roll-reboot: Add missing Cassandra cluster aliases (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1041231 (owner: 10Eevans) [23:33:45] RESOLVED: [12x] ProbeDown: Service restbase1036-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:38:16] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1041254 [23:38:16] (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1041254 (owner: 10TrainBranchBot) [23:40:46] FIRING: [7x] ProbeDown: Service restbase1037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:41:56] (03PS1) 10Reedy: Remove old wgAbuseFilterActorTableSchemaMigrationStage config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041255 (https://phabricator.wikimedia.org/T188180) [23:42:42] (03CR) 10Reedy: [C:04-2] "Not yet; T188180#9877744 and I86ec2b816eed17b62bf02bfd085570f132011b3e to ride the train and become stable" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041255 (https://phabricator.wikimedia.org/T188180) (owner: 10Reedy) [23:43:45] FIRING: [12x] ProbeDown: Service restbase1037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:48:45] RESOLVED: [12x] ProbeDown: Service restbase1037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [23:52:14] !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [23:52:27] !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance [23:55:46] FIRING: [12x] ProbeDown: Service restbase1038-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4) - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown