[00:01:21] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1039603 (owner: 10TrainBranchBot)
[00:09:10] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9600.service on elastic1086:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:15:46] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:18:45] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:35:03] <icinga-wm_>	 PROBLEM - CirrusSearch more_like eqiad 95th percentile latency on graphite1005 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39
[00:40:01] <icinga-wm_>	 RECOVERY - CirrusSearch more_like eqiad 95th percentile latency on graphite1005 is OK: OK: Less than 20.00% above the threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/d/000000455/elasticsearch-percentiles?orgId=1&var-cirrus_group=eqiad&var-cluster=elasticsearch&var-exported_cluster=production-search&var-smoothing=1&viewPanel=39
[00:45:46] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:48:45] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:59:42] <wikibugs>	 06SRE, 10DNS, 06Traffic: benefactors.wikimedia.org should point somewhere better then the wikimedia.org homepage - https://phabricator.wikimedia.org/T367012 (10Pppery) 03NEW
[01:31:20] <wikibugs>	 06SRE, 10DNS, 06Traffic: Cleanup unused DNS subdomains - https://phabricator.wikimedia.org/T367012#9874025 (10Pppery)
[01:32:04] <wikibugs>	 06SRE, 10DNS, 06Traffic: Remove iegreview.wikimedia.org from DNS - https://phabricator.wikimedia.org/T367011#9874028 (10Pppery) In for a penny, in for a pound - I tested every wikimedia.org subdomain and filed T367012 and T367013
[01:32:20] <wikibugs>	 (03PS3) 10Huji: Add tfj as a shortcut for toolforge-jobs command [puppet] - 10https://gerrit.wikimedia.org/r/802596 (https://phabricator.wikimedia.org/T309308)
[01:34:42] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[01:45:46] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:48:45] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:11:55] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9600.service on elastic1086:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:38:45] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:45:46] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:48:45] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:58:45] <jinxer-wm>	 RESOLVED: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:09:10] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-wmf-elasticsearch-exporter-9600.service on elastic1086:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:45:46] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:48:45] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:12:45] <wikibugs>	 (03PS1) 10Snwachukwu: Update Eventgate-Wikimedia and Eventstreams repository to Gitlab source and version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040862 (https://phabricator.wikimedia.org/T344730)
[04:15:46] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:18:45] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[04:35:01] <wikibugs>	 (03CR) 10KartikMistry: [C:03+2] Update Apertium to 2024-06-07-143238-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040195 (https://phabricator.wikimedia.org/T356252) (owner: 10KartikMistry)
[04:35:36] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[04:35:49] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1165.eqiad.wmnet with reason: Maintenance
[04:35:51] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[04:36:07] <kart_>	 Updating Apertium service in some time.
[04:36:07] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1015,1019,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[04:36:09] <wikibugs>	 (03Merged) 10jenkins-bot: Update Apertium to 2024-06-07-143238-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040195 (https://phabricator.wikimedia.org/T356252) (owner: 10KartikMistry)
[04:36:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1165 (T364069)', diff saved to https://phabricator.wikimedia.org/P64474 and previous config saved to /var/cache/conftool/dbconfig/20240610-043615-marostegui.json
[04:36:19] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[04:36:40] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: Primary switchover s7 T366875
[04:36:46] <stashbot>	 T366875: Switchover s7 master (db2218 -> db2121) - https://phabricator.wikimedia.org/T366875
[04:36:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2121 with weight 0 T366875', diff saved to https://phabricator.wikimedia.org/P64475 and previous config saved to /var/cache/conftool/dbconfig/20240610-043649-root.json
[04:37:04] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s7 T366875
[04:37:35] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] START helmfile.d/services/apertium: apply
[04:37:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db2121 from API/vslow/dump T366875', diff saved to https://phabricator.wikimedia.org/P64476 and previous config saved to /var/cache/conftool/dbconfig/20240610-043741-root.json
[04:37:56] <logmsgbot>	 !log kartik@deploy1002 helmfile [staging] DONE helmfile.d/services/apertium: apply
[04:38:13] <wikibugs>	 (03PS2) 10Gerrit maintenance bot: mariadb: Promote db2121 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1039593 (https://phabricator.wikimedia.org/T366875)
[04:38:29] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2121 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1039593 (https://phabricator.wikimedia.org/T366875) (owner: 10Gerrit maintenance bot)
[04:38:30] <wikibugs>	 (03CR) 10Marostegui: [V:03+2 C:03+2] mariadb: Promote db2121 to s7 master [puppet] - 10https://gerrit.wikimedia.org/r/1039593 (https://phabricator.wikimedia.org/T366875) (owner: 10Gerrit maintenance bot)
[04:40:58] <wikibugs>	 (03PS1) 10Marostegui: db1180: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1040863
[04:41:30] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db1180: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/1040863 (owner: 10Marostegui)
[04:41:59] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] START helmfile.d/services/apertium: apply
[04:42:36] <logmsgbot>	 !log kartik@deploy1002 helmfile [codfw] DONE helmfile.d/services/apertium: apply
[04:44:15] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T352010)', diff saved to https://phabricator.wikimedia.org/P64477 and previous config saved to /var/cache/conftool/dbconfig/20240610-044414-ladsgroup.json
[04:44:21] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[04:44:30] <marostegui>	 !log Rename flaggedpage_pending in s5 T365568
[04:44:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:44:34] <stashbot>	 T365568: Drop flaggedpage_pending from production - https://phabricator.wikimedia.org/T365568
[04:49:22] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] START helmfile.d/services/apertium: apply
[04:49:56] <logmsgbot>	 !log kartik@deploy1002 helmfile [eqiad] DONE helmfile.d/services/apertium: apply
[04:52:41] <kart_>	 !log Updated Apertium to 2024-06-07-143238-production (T356252)
[04:52:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:59:24] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P64478 and previous config saved to /var/cache/conftool/dbconfig/20240610-045922-ladsgroup.json
[05:02:15] <icinga-wm_>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:02:19] <icinga-wm_>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:04:09] <icinga-wm_>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:06:01] <icinga-wm_>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Tue 13 Aug 2024 12:55:14 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:06:11] <icinga-wm_>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 52067 bytes in 5.371 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:06:11] <icinga-wm_>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8617 bytes in 1.724 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:06:16] <marostegui>	 !log Starting s7 codfw failover from db2218 to db2121 - T366875
[05:06:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:06:21] <stashbot>	 T366875: Switchover s7 master (db2218 -> db2121) - https://phabricator.wikimedia.org/T366875
[05:06:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2121 to s7 primary T366875', diff saved to https://phabricator.wikimedia.org/P64479 and previous config saved to /var/cache/conftool/dbconfig/20240610-050637-marostegui.json
[05:07:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2218 T366875', diff saved to https://phabricator.wikimedia.org/P64480 and previous config saved to /var/cache/conftool/dbconfig/20240610-050738-root.json
[05:11:45] <wikibugs>	 (03PS1) 10Marostegui: db2218: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1040865
[05:12:49] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2218.codfw.wmnet with reason: Long schema change
[05:12:51] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2218.codfw.wmnet with reason: Long schema change
[05:13:06] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2218: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1040865 (owner: 10Marostegui)
[05:13:30] <marostegui>	 !log dbmaint codfw s7 deploy schema change on db2218 T364299
[05:13:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[05:13:33] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[05:14:34] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195', diff saved to https://phabricator.wikimedia.org/P64481 and previous config saved to /var/cache/conftool/dbconfig/20240610-051432-ladsgroup.json
[05:15:46] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:18:45] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:29:42] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2195 (T352010)', diff saved to https://phabricator.wikimedia.org/P64482 and previous config saved to /var/cache/conftool/dbconfig/20240610-052941-ladsgroup.json
[05:29:45] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance
[05:29:47] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2198.codfw.wmnet with reason: Maintenance
[05:29:47] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[05:34:42] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[05:41:51] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2179 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1039604 (https://phabricator.wikimedia.org/T367017)
[05:45:46] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:48:45] <jinxer-wm>	 FIRING: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:04:21] <jinxer-wm>	 FIRING: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:08:45] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:09:21] <jinxer-wm>	 RESOLVED: PoolcounterFullQueues: Full queues for poolcounter1004:9106 poolcounter - https://www.mediawiki.org/wiki/PoolCounter#Request_tracing_in_production - https://grafana.wikimedia.org/d/aIcYxuxZk/poolcounter?orgId=1&viewPanel=6&from=now-1h&to=now&var-dc=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DPoolcounterFullQueues
[06:11:16] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T352010)', diff saved to https://phabricator.wikimedia.org/P64483 and previous config saved to /var/cache/conftool/dbconfig/20240610-061116-ladsgroup.json
[06:11:23] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[06:14:58] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2207 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1039605 (https://phabricator.wikimedia.org/T367019)
[06:15:26] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote db1162 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1040886 (https://phabricator.wikimedia.org/T367020)
[06:15:31] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: wmnet: Update s2-master alias [dns] - 10https://gerrit.wikimedia.org/r/1040887 (https://phabricator.wikimedia.org/T367020)
[06:15:46] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:17:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T364069)', diff saved to https://phabricator.wikimedia.org/P64484 and previous config saved to /var/cache/conftool/dbconfig/20240610-061658-marostegui.json
[06:17:04] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[06:18:42] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 32 hosts with reason: Primary switchover s4 T367017
[06:18:45] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:18:46] <stashbot>	 T367017: Switchover s4 master (db2140 -> db2179) - https://phabricator.wikimedia.org/T367017
[06:18:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Set db2179 with weight 0 T367017', diff saved to https://phabricator.wikimedia.org/P64485 and previous config saved to /var/cache/conftool/dbconfig/20240610-061849-root.json
[06:19:09] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 32 hosts with reason: Primary switchover s4 T367017
[06:19:39] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Remove db2179 from API/vslow/dump T367017', diff saved to https://phabricator.wikimedia.org/P64486 and previous config saved to /var/cache/conftool/dbconfig/20240610-061939-root.json
[06:19:57] <wikibugs>	 (03PS1) 10Marostegui: Revert "db2218: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1040571
[06:20:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2218 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P64487 and previous config saved to /var/cache/conftool/dbconfig/20240610-062017-root.json
[06:20:21] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] Revert "db2218: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/1040571 (owner: 10Marostegui)
[06:26:25] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P64488 and previous config saved to /var/cache/conftool/dbconfig/20240610-062624-ladsgroup.json
[06:32:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P64489 and previous config saved to /var/cache/conftool/dbconfig/20240610-063208-marostegui.json
[06:35:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2218 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P64490 and previous config saved to /var/cache/conftool/dbconfig/20240610-063524-root.json
[06:36:14] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] mariadb: Promote db2179 to s4 master [puppet] - 10https://gerrit.wikimedia.org/r/1039604 (https://phabricator.wikimedia.org/T367017) (owner: 10Gerrit maintenance bot)
[06:38:13] <marostegui>	 !log Starting s4 codfw failover from db2140 to db2179 - T367017
[06:38:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:38:20] <stashbot>	 T367017: Switchover s4 master (db2140 -> db2179) - https://phabricator.wikimedia.org/T367017
[06:38:22] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 26 hosts with reason: Primary switchover s2 T367019
[06:38:30] <stashbot>	 T367019: Switchover s2 master (db2204 -> db2207) - https://phabricator.wikimedia.org/T367019
[06:38:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Promote db2179 to s4 primary T367017', diff saved to https://phabricator.wikimedia.org/P64491 and previous config saved to /var/cache/conftool/dbconfig/20240610-063830-root.json
[06:38:44] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 26 hosts with reason: Primary switchover s2 T367019
[06:39:04] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depool db2140 T367017', diff saved to https://phabricator.wikimedia.org/P64492 and previous config saved to /var/cache/conftool/dbconfig/20240610-063904-root.json
[06:39:13] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set db2207 with weight 0 T367019', diff saved to https://phabricator.wikimedia.org/P64493 and previous config saved to /var/cache/conftool/dbconfig/20240610-063912-arnaudb.json
[06:41:06] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Remove profile::base::use_linux510_on_buster [puppet] - 10https://gerrit.wikimedia.org/r/1040109 (owner: 10Muehlenhoff)
[06:41:32] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188', diff saved to https://phabricator.wikimedia.org/P64494 and previous config saved to /var/cache/conftool/dbconfig/20240610-064132-ladsgroup.json
[06:42:17] <wikibugs>	 (03CR) 10DCausse: [C:03+1] Deprecate system::role for search roles [puppet] - 10https://gerrit.wikimedia.org/r/1040125 (owner: 10Muehlenhoff)
[06:43:47] <wikibugs>	 (03PS1) 10Marostegui: db2140: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1040869
[06:44:23] <wikibugs>	 (03CR) 10Marostegui: [C:03+2] db2140: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/1040869 (owner: 10Marostegui)
[06:45:10] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2009.codfw.wmnet
[06:45:46] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:46:42] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2140.codfw.wmnet with reason: Long schema change
[06:46:45] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2140.codfw.wmnet with reason: Long schema change
[06:47:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165', diff saved to https://phabricator.wikimedia.org/P64495 and previous config saved to /var/cache/conftool/dbconfig/20240610-064716-marostegui.json
[06:47:37] <marostegui>	 !log dbmaint codfw s4 deploy schema change on db2140 T364299
[06:47:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:47:40] <stashbot>	 T364299: Make rc_id a bigint - https://phabricator.wikimedia.org/T364299
[06:48:45] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[06:50:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2218 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P64496 and previous config saved to /var/cache/conftool/dbconfig/20240610-065031-root.json
[06:53:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2009.codfw.wmnet
[06:54:01] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Deprecate system::role for search roles [puppet] - 10https://gerrit.wikimedia.org/r/1040125 (owner: 10Muehlenhoff)
[06:56:41] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2188 (T352010)', diff saved to https://phabricator.wikimedia.org/P64497 and previous config saved to /var/cache/conftool/dbconfig/20240610-065640-ladsgroup.json
[06:56:43] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2202.codfw.wmnet with reason: Maintenance
[06:56:44] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[06:56:56] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2202.codfw.wmnet with reason: Maintenance
[06:58:19] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1240.eqiad.wmnet with reason: Maintenance
[06:58:32] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1240.eqiad.wmnet with reason: Maintenance
[06:59:34] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2009.codfw.wmnet
[06:59:45] <jinxer-wm>	 FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate
[07:00:01] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2009.codfw.wmnet
[07:00:05] <jouncebot>	 Amir1 and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240610T0700).
[07:00:05] <jouncebot>	 kostajh: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:01:18] <kostajh>	 hello
[07:01:44] <kostajh>	 I'll deploy the patch now
[07:02:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1165 (T364069)', diff saved to https://phabricator.wikimedia.org/P64498 and previous config saved to /var/cache/conftool/dbconfig/20240610-070224-marostegui.json
[07:02:28] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance
[07:02:29] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[07:02:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+2] Deprecate system::role for wikikube roles [puppet] - 10https://gerrit.wikimedia.org/r/1040124 (owner: 10Muehlenhoff)
[07:02:41] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1168.eqiad.wmnet with reason: Maintenance
[07:02:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1168 (T364069)', diff saved to https://phabricator.wikimedia.org/P64499 and previous config saved to /var/cache/conftool/dbconfig/20240610-070249-marostegui.json
[07:03:32] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2010.codfw.wmnet
[07:05:38] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2218 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P64500 and previous config saved to /var/cache/conftool/dbconfig/20240610-070537-root.json
[07:05:46] <jinxer-wm>	 FIRING: [2x] SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:07:56] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2010.codfw.wmnet
[07:10:50] <_joe_>	 jouncebot: nowandnext
[07:10:50] <jouncebot>	 For the next 0 hour(s) and 49 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240610T0700)
[07:10:50] <jouncebot>	 In 0 hour(s) and 49 minute(s): Gerrit upgrade (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240610T0800)
[07:11:14] <_joe_>	 kostajh: lmk when you're done :)
[07:12:09] <kostajh>	 _joe_: are you able to check something for me with mw kubernetes via https://wikitech.wikimedia.org/wiki/MediaWiki_On_Kubernetes#Get_a_shell_on_a_production_pod ?
[07:12:24] <kostajh>	 I'd like to see the output of `scandir('/usr/share/GeoIP')` 
[07:12:40] <wikibugs>	 (03PS1) 10Brouberol: global_config: expose services for all mariadb hosts and masters [puppet] - 10https://gerrit.wikimedia.org/r/1040872 (https://phabricator.wikimedia.org/T331894)
[07:13:00] <kostajh>	 because I need some verification that https://gerrit.wikimedia.org/r/c/operations/puppet/+/1037528 propagated the files to the locations we care about
[07:13:27] <kostajh>	 it seems like on mwmaint, mwdebug, and mwdeploy, the files are not updated (but then again, that puppet config doesn't target those locations AFAIK)
[07:13:59] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2010.codfw.wmnet
[07:14:23] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2010.codfw.wmnet
[07:14:45] <jinxer-wm>	 RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate
[07:15:25] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2011.codfw.wmnet
[07:16:31] <_joe_>	 yeah let me look at that patch for a sec
[07:17:07] <_joe_>	 kostajh: in theory the change should affect all mw servers
[07:17:35] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1009.eqiad.wmnet
[07:17:42] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] push-notifications: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032519 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French)
[07:18:19] <kostajh>	 _joe_: I think $fetch_private is not true for the mwdebug/mwmaint servers, perhaps
[07:18:29] <kostajh>	 https://gerrit.wikimedia.org/r/c/operations/puppet/+/1037528/21/modules/puppetmaster/manifests/geoip.pp#26
[07:18:32] <_joe_>	 kostajh: no you're wrong
[07:18:39] <_joe_>	 they are full mediawiki servers
[07:18:40] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] function-orchestrator: ensure all containers have securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037163 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French)
[07:18:42] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] function-evaluator: ensure all containers have securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037162 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French)
[07:18:50] <wikibugs>	 (03Merged) 10jenkins-bot: push-notifications: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1032519 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French)
[07:19:51] <wikibugs>	 (03Merged) 10jenkins-bot: function-orchestrator: ensure all containers have securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037163 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French)
[07:19:55] <wikibugs>	 (03Merged) 10jenkins-bot: function-evaluator: ensure all containers have securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037162 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French)
[07:20:44] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2218 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P64501 and previous config saved to /var/cache/conftool/dbconfig/20240610-072043-root.json
[07:20:45] <_joe_>	 kostajh: on a k8s node, clearly the change had no effect
[07:20:57] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] proton: drop replicas from 12 to 10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040221 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French)
[07:21:39] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove iegreview module [puppet] - 10https://gerrit.wikimedia.org/r/1040873 (https://phabricator.wikimedia.org/T334415)
[07:21:54] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] admin_ng: bump CPU resourcequota for proton [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040220 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French)
[07:22:16] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/push-notifications: apply
[07:22:26] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/push-notifications: apply
[07:23:07] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2011.codfw.wmnet
[07:23:22] <kostajh>	 _joe_: hmm. In the past, I was told (sorry, I have forgotten by whom) that the GeoIP changes would show up on mwmaint server. That's why I added this note to operations/mediawki-config https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1038723/7/wmf-config/CommonSettings.php#3953
[07:23:30] <kostajh>	 *would *not* show up 
[07:23:43] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/push-notifications: apply
[07:23:53] <wikibugs>	 (03CR) 10JMeybohm: [V:03+1 C:03+2] etcd::v3: Allow all nodes of an etcd cluster to connect to each other [puppet] - 10https://gerrit.wikimedia.org/r/1038283 (https://phabricator.wikimedia.org/T366465) (owner: 10JMeybohm)
[07:24:15] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/push-notifications: apply
[07:24:20] <_joe_>	 kostajh: whoever told you that is very wrong
[07:24:34] <kostajh>	 just to confirm, could we please try `scandir('/usr/share/GeoIP')` in a production k8s shell? 
[07:25:25] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/push-notifications: apply
[07:25:59] <_joe_>	 kostajh: already did on the physical hosts where it's mounted from
[07:26:05] <_joe_>	 there is no trace of the new files
[07:26:09] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/push-notifications: apply
[07:26:44] <wikibugs>	 (03PS2) 10Brouberol: global_config: expose services for all mariadb hosts and masters [puppet] - 10https://gerrit.wikimedia.org/r/1040872 (https://phabricator.wikimedia.org/T331894)
[07:27:17] <kostajh>	 shouldn't we see the GeoIP enterprise files on `/usr/share/GeoIP`?
[07:27:23] <_joe_>	 kostajh: well actually, they're there but not updated since friday, to be clearer
[07:27:45] <kostajh>	 I see these ones https://gerrit.wikimedia.org/r/c/operations/puppet/+/1037528/21/modules/puppetmaster/manifests/geoip.pp#37
[07:27:47] <_joe_>	 it should be mounted, yes
[07:28:05] <_joe_>	 wait a sec
[07:28:12] <kostajh>	 are there some logs we can look at of the puppet run?
[07:28:25] <_joe_>	 I am trying to figure out what is going on rn
[07:29:19] <wikibugs>	 (03PS1) 10Brouberol: datahub-next: replace IPs by Services in network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040874 (https://phabricator.wikimedia.org/T359423)
[07:29:20] <wikibugs>	 (03PS1) 10Brouberol: datahub: replace IPs by Services in network policies [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040875 (https://phabricator.wikimedia.org/T359423)
[07:30:02] <wikibugs>	 (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1040872 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol)
[07:30:17] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1009.eqiad.wmnet
[07:31:12] <_joe_>	 kostajh: ok I got what your mistake is, I misunderstood your original request
[07:31:16] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2011.codfw.wmnet
[07:31:21] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2011.codfw.wmnet
[07:31:24] <_joe_>	 the enterprise file is under /usr/share/GeoIPInfo
[07:31:26] <_joe_>	 not under 
[07:31:32] <_joe_>	 /usr/share/GeoIP
[07:32:08] <_joe_>	 not sure why we're separating files in those two directories
[07:32:23] <kostajh>	 hmm. On mwmaint I get `ls: cannot access '/usr/share/GeoIPInfo/': No such file or directory`
[07:32:23] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2012.codfw.wmnet
[07:32:43] <_joe_>	 I'm talking inside the container
[07:32:53] <kostajh>	 _joe_: can you see the GeoLite2 files alongside the Enterprise file?
[07:33:04] <_joe_>	 kostajh: yes
[07:33:10] <kostajh>	 alright, thank you
[07:33:16] <kostajh>	 sorry for the confusion
[07:33:27] <_joe_>	 but they're last week
[07:33:33] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply
[07:33:42] <_joe_>	 not updated today like the enterprise ones
[07:33:48] <kostajh>	 that should be ok
[07:34:02] <kostajh>	 I think
[07:34:15] <kostajh>	 upstream, they are updated twice per week
[07:34:21] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply
[07:34:31] <kostajh>	 but I guess the puppet module is supposed to download them more frequently
[07:34:59] <kostajh>	 _joe_: do you think it's ok to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/1038723/ or do we need to confirm that those files are updating regularly?
[07:35:30] <_joe_>	 kostajh: I think it's ok, but tbh I question the whole approach
[07:35:30] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[07:35:40] <_joe_>	 I'm soryr I wasn't around when this was decided
[07:35:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2218 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P64502 and previous config saved to /var/cache/conftool/dbconfig/20240610-073549-root.json
[07:35:54] <_joe_>	 but *imho* it would make sense to have ipoid read the maxmind data
[07:36:12] <_joe_>	 instead of mounting these databases inside mediawiki, which we should stop doing instead of expanding
[07:36:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1009.eqiad.wmnet
[07:36:37] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1009.eqiad.wmnet
[07:36:50] <kostajh>	 _joe_: I can add that as a proposal in T357753
[07:36:50] <stashbot>	 T357753: Build next iteration of IPoid using OpenSearch as backend - https://phabricator.wikimedia.org/T357753
[07:37:10] <_joe_>	 yeah I think it's quite important
[07:37:24] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1010.eqiad.wmnet
[07:37:29] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti1010.eqiad.wmnet
[07:37:34] <_joe_>	 I don't know why my team didn't tell you, we've been planning to dismiss the maxmind data inside mediawiki for quite some time :(
[07:37:45] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1010.eqiad.wmnet
[07:37:53] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[07:38:39] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Revert db2207 with weight 500 T367019', diff saved to https://phabricator.wikimedia.org/P64503 and previous config saved to /var/cache/conftool/dbconfig/20240610-073838-arnaudb.json
[07:38:42] <stashbot>	 T367019: Switchover s2 master (db2204 -> db2207) - https://phabricator.wikimedia.org/T367019
[07:39:38] <kostajh>	 _joe_: well, for now it is trying to preserve status quo, we are just trying to remove references to Enterprise files which will disappear at the end of July
[07:39:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by kharlan@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038723 (https://phabricator.wikimedia.org/T361884) (owner: 10Kosta Harlan)
[07:40:32] <wikibugs>	 (03Merged) 10jenkins-bot: IPInfo: Switch to using GeoLite2 data [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038723 (https://phabricator.wikimedia.org/T361884) (owner: 10Kosta Harlan)
[07:41:11] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 2:00:00 on db2207.codfw.wmnet with reason: maintenance
[07:41:17] <logmsgbot>	 !log kharlan@deploy1002 Started scap: Backport for [[gerrit:1038723|IPInfo: Switch to using GeoLite2 data (T361884)]]
[07:41:22] <stashbot>	 T361884: Remove $wgIPInfoGeoIP2EnterprisePath from production config - https://phabricator.wikimedia.org/T361884
[07:41:24] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db2207.codfw.wmnet with reason: maintenance
[07:41:58] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'depool db2207 maintenance', diff saved to https://phabricator.wikimedia.org/P64504 and previous config saved to /var/cache/conftool/dbconfig/20240610-074157-arnaudb.json
[07:43:54] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.mysql.upgrade for db2207.codfw.wmnet
[07:44:05] <icinga-wm_>	 PROBLEM - OpenSearch health check for shards on 9200 on logstash1025 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fbc34695280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.w
[07:44:05] <icinga-wm_>	 org/wiki/Search%23Administration
[07:44:55] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2012.codfw.wmnet
[07:45:05] <icinga-wm_>	 RECOVERY - OpenSearch health check for shards on 9200 on logstash1025 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: green, timed_out: False, number_of_nodes: 18, number_of_data_nodes: 12, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 756, active_shards: 1774, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_sha
[07:45:05] <icinga-wm_>	 umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[07:46:59] <icinga-wm_>	 PROBLEM - Host kubetcd2005 is DOWN: PING CRITICAL - Packet loss = 100%
[07:46:59] <icinga-wm_>	 PROBLEM - Host ml-etcd2002 is DOWN: PING CRITICAL - Packet loss = 100%
[07:47:00] <wikibugs>	 (03CR) 10DCausse: "Thanks for working on this!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040211 (https://phabricator.wikimedia.org/T362310) (owner: 10Peter Fischer)
[07:47:16] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.mysql.upgrade (exit_code=0) for db2207.codfw.wmnet
[07:48:34] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1010.eqiad.wmnet
[07:50:16] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/wikifunctions: apply
[07:50:31] <icinga-wm_>	 RECOVERY - Host kubetcd2005 is UP: PING OK - Packet loss = 0%, RTA = 30.54 ms
[07:50:51] <icinga-wm_>	 RECOVERY - Host ml-etcd2002 is UP: PING OK - Packet loss = 0%, RTA = 35.27 ms
[07:50:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db2218 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P64505 and previous config saved to /var/cache/conftool/dbconfig/20240610-075056-root.json
[07:51:19] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2012.codfw.wmnet
[07:51:32] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2012.codfw.wmnet
[07:53:05] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/wikifunctions: apply
[07:53:25] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/wikifunctions: apply
[07:54:29] <kostajh>	 still deploying
[07:54:35] <kostajh>	 my tmux session vanished :(
[07:54:58] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1010.eqiad.wmnet
[07:55:24] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1010.eqiad.wmnet
[07:55:25] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 10%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P64506 and previous config saved to /var/cache/conftool/dbconfig/20240610-075524-arnaudb.json
[07:56:03] <kostajh>	 `tmux ls` shows no session. And if I try `scap backport` again, I see `07:55:23 backport is locked by kharlan`. Amir1 urbanecm how should I proceed?
[07:56:36] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/wikifunctions: apply
[07:57:21] <urbanecm>	 kostajh: there is a process under your account running
[07:57:26] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] "> - the limit is not configurable and is 1000rps per UA" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040211 (https://phabricator.wikimedia.org/T362310) (owner: 10Peter Fischer)
[07:57:41] <urbanecm>	 `ps aux | grep scap` shows some
[07:57:43] <logmsgbot>	 !log kharlan@deploy1002 kharlan: Backport for [[gerrit:1038723|IPInfo: Switch to using GeoLite2 data (T361884)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[07:57:47] <stashbot>	 T361884: Remove $wgIPInfoGeoIP2EnterprisePath from production config - https://phabricator.wikimedia.org/T361884
[07:57:50] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host ping1004.eqiad.wmnet with OS bookworm
[07:57:56] <wikibugs>	 06SRE, 06Infrastructure-Foundations: Move the ping* servers to Bookworm - https://phabricator.wikimedia.org/T366695#9874387 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jmm@cumin2002 for host ping1004.eqiad.wmnet with OS bookworm
[07:58:04] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "looks mostly good and thanks for the ansers. I think a proper hiera lookup for the configure-projects-bot api token is needed. Let me know" [puppet] - 10https://gerrit.wikimedia.org/r/1035820 (https://phabricator.wikimedia.org/T355097) (owner: 10Brennen Bearnes)
[07:58:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1011.eqiad.wmnet
[07:58:24] <urbanecm>	 kostajh: at this point, it should be waiting for your response, so it sounds like a good idea to kill it and start over?
[07:58:38] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2013.codfw.wmnet
[07:58:56] <kostajh>	 urbanecm: I tried "kill" but now I get a message about another lock
[07:59:02] <kostajh>	 `07:58:45 concurrent prep is locked by kharlan (pid 29297) on Mon Jun 10 07:41:17 2024`
[07:59:09] <kostajh>	 urbanecm: so remove that process as well?
[07:59:20] <urbanecm>	 kostajh: i'd kill the parent (29297)
[07:59:31] <taavi>	 urbanecm: kostajh: can you ping me when you're done deploying?
[07:59:37] <kostajh>	 yeah
[07:59:43] * urbanecm is not deploying anything
[07:59:46] <logmsgbot>	 !log kharlan@deploy1002 Started scap: Backport for [[gerrit:1038723|IPInfo: Switch to using GeoLite2 data (T361884)]]
[07:59:50] <kostajh>	 taavi: will do. _joe_ is also waiting to hear when I'm done.
[08:00:05] <jouncebot>	 hashar: Deploy window Gerrit upgrade (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240610T0800)
[08:00:11] <kostajh>	 hashar: still finishing up the backport
[08:00:22] <hashar>	 ^ I will do it once the backports have been completed
[08:00:23] <hashar>	 no rush
[08:02:54] <logmsgbot>	 !log kharlan@deploy1002 kharlan: Backport for [[gerrit:1038723|IPInfo: Switch to using GeoLite2 data (T361884)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[08:03:00] <stashbot>	 T361884: Remove $wgIPInfoGeoIP2EnterprisePath from production config - https://phabricator.wikimedia.org/T361884
[08:03:08] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on gerrit2002.wikimedia.org with reason: Gerrit upgrade
[08:03:22] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gerrit2002.wikimedia.org with reason: Gerrit upgrade
[08:03:33] <logmsgbot>	 !log jelto@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on gerrit1003.wikimedia.org with reason: Gerrit upgrade
[08:03:47] <logmsgbot>	 !log jelto@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gerrit1003.wikimedia.org with reason: Gerrit upgrade
[08:04:58] <logmsgbot>	 !log kharlan@deploy1002 kharlan: Continuing with sync
[08:09:49] <wikibugs>	 (03CR) 10Klausman: [C:03+1] admin_ng: update Bookworm-based Docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040153 (https://phabricator.wikimedia.org/T356252) (owner: 10Elukey)
[08:10:30] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 25%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P64507 and previous config saved to /var/cache/conftool/dbconfig/20240610-081030-arnaudb.json
[08:13:54] <logmsgbot>	 !log kharlan@deploy1002 Finished scap: Backport for [[gerrit:1038723|IPInfo: Switch to using GeoLite2 data (T361884)]] (duration: 14m 07s)
[08:13:58] <stashbot>	 T361884: Remove $wgIPInfoGeoIP2EnterprisePath from production config - https://phabricator.wikimedia.org/T361884
[08:14:15] <kostajh>	 !log UTC morning deploys done
[08:14:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:14:31] <kostajh>	 _joe_ taavi hashar I am done with backporting. 
[08:15:00] <kostajh>	 please coordinate with each other as to who goes next :) 
[08:17:45] <taavi>	 I think _joe_ was first :-)
[08:17:49] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on ping1004.eqiad.wmnet with reason: host reimage
[08:17:57] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1011.eqiad.wmnet
[08:18:06] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2013.codfw.wmnet
[08:19:13] <hashar>	 I am upgrading Gerrit
[08:19:27] <wikibugs>	 (03CR) 10Hashar: [C:03+2] Merge branch 'deploy/wmf/stable-3.8' into deploy/wmf/stable-3.9 [software/gerrit] (deploy/wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1039201 (owner: 10Hashar)
[08:19:31] <wikibugs>	 (03CR) 10Hashar: [C:03+2] Gerrit 3.9.5, rebuild plugins and update TypeScript API [software/gerrit] (deploy/wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1039610 (https://phabricator.wikimedia.org/T354887) (owner: 10Hashar)
[08:20:01] <wikibugs>	 (03Merged) 10jenkins-bot: Merge branch 'deploy/wmf/stable-3.8' into deploy/wmf/stable-3.9 [software/gerrit] (deploy/wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1039201 (owner: 10Hashar)
[08:20:02] <wikibugs>	 (03Merged) 10jenkins-bot: Gerrit 3.9.5, rebuild plugins and update TypeScript API [software/gerrit] (deploy/wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1039610 (https://phabricator.wikimedia.org/T354887) (owner: 10Hashar)
[08:21:15] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ping1004.eqiad.wmnet with reason: host reimage
[08:21:36] <wikibugs>	 (03CR) 10JMeybohm: "This is what I had in mind as well. 30min does seem a good choice I'd say, given this is more like a "hey, something is off" then "I'm on " [alerts] - 10https://gerrit.wikimedia.org/r/1040242 (https://phabricator.wikimedia.org/T366932) (owner: 10Scott French)
[08:22:15] <logmsgbot>	 !log hashar@deploy1002 Started deploy [gerrit/gerrit@092aade]: Gerrit to version 3.9.5 on gerrit2002 - T354887
[08:22:22] <logmsgbot>	 !log hashar@deploy1002 Finished deploy [gerrit/gerrit@092aade]: Gerrit to version 3.9.5 on gerrit2002 - T354887 (duration: 00m 07s)
[08:23:35] <wikibugs>	 (03PS3) 10Brouberol: deployment_server: alert on admin-ng pending changes [puppet] - 10https://gerrit.wikimedia.org/r/1040992 (https://phabricator.wikimedia.org/T331894)
[08:24:10] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1011.eqiad.wmnet
[08:24:21] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2013.codfw.wmnet
[08:24:42] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1011.eqiad.wmnet
[08:24:42] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2013.codfw.wmnet
[08:25:00] <wikibugs>	 (03Abandoned) 10Volans: Disable Accounting report alerting until bug is fixed [puppet] - 10https://gerrit.wikimedia.org/r/1039978 (https://phabricator.wikimedia.org/T366874) (owner: 10Ayounsi)
[08:25:01] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1012.eqiad.wmnet
[08:25:37] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 50%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P64508 and previous config saved to /var/cache/conftool/dbconfig/20240610-082536-arnaudb.json
[08:25:58] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2014.codfw.wmnet
[08:26:52] <hashar>	 I am doing the primary Gerrit now
[08:26:59] <logmsgbot>	 !log hashar@deploy1002 Started deploy [gerrit/gerrit@092aade]: Gerrit to version 3.9.5 on gerrit1003 - T354887
[08:27:05] <logmsgbot>	 !log hashar@deploy1002 Finished deploy [gerrit/gerrit@092aade]: Gerrit to version 3.9.5 on gerrit1003 - T354887 (duration: 00m 05s)
[08:30:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T364069)', diff saved to https://phabricator.wikimedia.org/P64509 and previous config saved to /var/cache/conftool/dbconfig/20240610-083042-marostegui.json
[08:32:17] <hashar>	 !log Gerrit has been upgraded
[08:33:20] <kostajh>	 hashar: I can't seem to add comments to patches. I've done a hard refresh of the page
[08:33:34] <kostajh>	 *inline comments to files on patches, that is
[08:33:59] <wikibugs>	 (03PS1) 10Jelto: gitlab:: add blackbox check for ssh service [puppet] - 10https://gerrit.wikimedia.org/r/1041028 (https://phabricator.wikimedia.org/T367021)
[08:34:47] <hashar>	 kostajh: my guess would be some cache is not in sync and some javascript is lost
[08:35:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1012.eqiad.wmnet
[08:35:37] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2014.codfw.wmnet
[08:35:44] <hashar>	 kostajh: that worked on https://gerrit.wikimedia.org/r/c/test/gerrit-ping/+/998940
[08:35:47] <hashar>	 anything in the console?
[08:36:22] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ping1004.eqiad.wmnet with OS bookworm
[08:36:42] <kostajh>	 hashar: it works on a commit message
[08:36:49] <kostajh>	 but not here https://gerrit.wikimedia.org/r/c/mediawiki/core/+/1025719/3/composer.json#181
[08:36:59] <jelto>	 I can add a comment to patches: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1037065/3#message-02c8decf3bbbe9e6a60fdc64bee00418ba48a811
[08:37:06] <icinga-wm_>	 PROBLEM - Host dse-k8s-etcd1003 is DOWN: PING CRITICAL - Packet loss = 100%
[08:37:06] <icinga-wm_>	 PROBLEM - Host kubestagemaster2004 is DOWN: PING CRITICAL - Packet loss = 100%
[08:37:28] <kostajh>	 hashar: the only browser warnings are about font downloads
[08:37:34] <icinga-wm_>	 PROBLEM - Host ml-etcd1002 is DOWN: PING CRITICAL - Packet loss = 100%
[08:37:36] <icinga-wm_>	 PROBLEM - Host kubetcd1004 is DOWN: PING CRITICAL - Packet loss = 100%
[08:38:08] <kostajh>	 hashar / jelto bah, works on Chrome, not on Firefox. Let me try Firefox without plugins
[08:38:28] <jelto>	 I'm on firefox
[08:38:45] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service ganeti2013:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:38:47] <jelto>	 maybe try clear your cache for gerrit?
[08:38:49] <hashar>	 I too ( 115.10.0esr from Debian )
[08:39:33] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4048.ulsfo.wmnet
[08:39:44] <wikibugs>	 (03CR) 10Muehlenhoff: "Dummy comment to test Gerrit after update" [puppet] - 10https://gerrit.wikimedia.org/r/1040873 (https://phabricator.wikimedia.org/T334415) (owner: 10Muehlenhoff)
[08:39:54] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.hosts.reboot-single for host cp4048.ulsfo.wmnet
[08:40:30] <icinga-wm_>	 RECOVERY - Host dse-k8s-etcd1003 is UP: PING OK - Packet loss = 0%, RTA = 0.41 ms
[08:40:36] <icinga-wm_>	 RECOVERY - Host ml-etcd1002 is UP: PING OK - Packet loss = 0%, RTA = 5.51 ms
[08:40:36] <moritzm>	 JFTR, works for me as well (firefox, no plugins othe than WikimediaDebug)
[08:40:43] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 75%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P64510 and previous config saved to /var/cache/conftool/dbconfig/20240610-084042-arnaudb.json
[08:40:49] <moritzm>	 also 115.10 from Debian
[08:40:50] <icinga-wm_>	 RECOVERY - Host kubestagemaster2004 is UP: PING OK - Packet loss = 0%, RTA = 30.59 ms
[08:41:06] <icinga-wm_>	 RECOVERY - Host kubetcd1004 is UP: PING OK - Packet loss = 0%, RTA = 0.67 ms
[08:41:12] <kostajh>	 I'm on Firefox nightly (128.0a1)
[08:41:25] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1012.eqiad.wmnet
[08:41:28] <kostajh>	 Commenting doesn't work in safe mode (extensions/add-ons disabled)
[08:41:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1012.eqiad.wmnet
[08:41:42] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2014.codfw.wmnet
[08:41:48] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2014.codfw.wmnet
[08:42:40] <kostajh>	 On macOS, the firefox version is 126, and commenting doesn't work with that version either
[08:43:45] <jinxer-wm>	 RESOLVED: [4x] ProbeDown: Service ganeti1012:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:44:14] <wikibugs>	 (03PS2) 10Jelto: gitlab:: add blackbox check for ssh service [puppet] - 10https://gerrit.wikimedia.org/r/1041028 (https://phabricator.wikimedia.org/T367021)
[08:45:41] <kostajh>	 hashar: I have cleared the cache for gerrit on Firefox nightly, and inline commenting still doesn't work
[08:45:46] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: netbox_report_accounting_run.service on netbox1002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:45:51] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P64511 and previous config saved to /var/cache/conftool/dbconfig/20240610-084550-marostegui.json
[08:45:51] <hashar>	 nightly?
[08:46:04] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2015.codfw.wmnet
[08:46:24] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1013.eqiad.wmnet
[08:46:38] <taavi>	 am I good to deploy my config patch now?
[08:46:39] <kostajh>	 hashar: tested on nightly (128) and stable (126). 
[08:46:40] <jinxer-wm>	 FIRING: KubernetesRsyslogDown: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2001 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[08:46:40] <hashar>	 kostajh: is there anything showing up in the browser console?
[08:47:33] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.dns.netbox
[08:48:46] <kostajh>	 hashar: T367029
[08:48:48] <stashbot>	 T367029: Inline commenting doesn't work on Gerrit 3.9 with Firefox on macOS - https://phabricator.wikimedia.org/T367029
[08:48:51] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4048.ulsfo.wmnet
[08:48:54] <wikibugs>	 (03PS1) 10Muehlenhoff: Change ping host in codfw to ping2004 [homer/public] - 10https://gerrit.wikimedia.org/r/1041030 (https://phabricator.wikimedia.org/T366695)
[08:50:12] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add new entries for cr2-codfw peering to ssw1-d8-codfw - cmooney@cumin1002"
[08:50:58] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add new entries for cr2-codfw peering to ssw1-d8-codfw - cmooney@cumin1002"
[08:50:58] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[08:53:46] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2015.codfw.wmnet
[08:53:55] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1013.eqiad.wmnet
[08:54:23] <godog>	 !log upgrade prometheus-statsd-exporter on webperf - T302373
[08:54:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:54:27] <stashbot>	 T302373: Upgrade prometheus-statsd-exporter - https://phabricator.wikimedia.org/T302373
[08:55:49] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2207 (re)pooling @ 100%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P64512 and previous config saved to /var/cache/conftool/dbconfig/20240610-085548-arnaudb.json
[08:56:40] <jinxer-wm>	 RESOLVED: KubernetesRsyslogDown: rsyslog on kubestage2001:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=kubestage2001 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown
[08:56:45] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 1:00:00 on 26 hosts with reason: Primary switchover s2 T367019
[08:56:49] <stashbot>	 T367019: Switchover s2 master (db2204 -> db2207) - https://phabricator.wikimedia.org/T367019
[08:57:08] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 26 hosts with reason: Primary switchover s2 T367019
[08:57:21] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Set db2207 with weight 0 T367019', diff saved to https://phabricator.wikimedia.org/P64513 and previous config saved to /var/cache/conftool/dbconfig/20240610-085721-arnaudb.json
[08:58:32] <icinga-wm_>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[09:00:01] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1013.eqiad.wmnet
[09:00:04] <hashar>	 kostajh: so previously one could double click to add a comment below? 
[09:00:33] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1013.eqiad.wmnet
[09:00:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168', diff saved to https://phabricator.wikimedia.org/P64514 and previous config saved to /var/cache/conftool/dbconfig/20240610-090058-marostegui.json
[09:01:15] <godog>	 !log upload prometheus-statsd-exporter 0.26.1-1 to apt - T302373
[09:01:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:01:18] <stashbot>	 T302373: Upgrade prometheus-statsd-exporter - https://phabricator.wikimedia.org/T302373
[09:01:45] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2015.codfw.wmnet
[09:01:50] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2015.codfw.wmnet
[09:03:22] <_joe_>	 oh sorry folks I went to do other stuff and decided to deploy in the infra window
[09:03:28] <_joe_>	 given mine is an infra change 
[09:06:31] <wikibugs>	 (03PS2) 10Volans: reports: accounting, support swapped motherboards [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1037538 (https://phabricator.wikimedia.org/T358542)
[09:07:52] <wikibugs>	 (03PS3) 10Volans: reports: accounting, support swapped motherboards [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1037538 (https://phabricator.wikimedia.org/T358542)
[09:08:23] <wikibugs>	 (03CR) 10Volans: "tested on netbox-next:" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1037538 (https://phabricator.wikimedia.org/T358542) (owner: 10Volans)
[09:09:33] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-tools, 06DC-Ops, and 2 others: Netbox errors caused by system board replacement - https://phabricator.wikimedia.org/T358542#9874533 (10Volans) I've also manually fixed a bunch of warnings due to a clearly mistyped phabricator task number in the spreadsheet. The patch has been test...
[09:13:01] <wikibugs>	 (03CR) 10Arnaudb: [C:03+2] mariadb: Promote db2207 to s2 master [puppet] - 10https://gerrit.wikimedia.org/r/1039605 (https://phabricator.wikimedia.org/T367019) (owner: 10Gerrit maintenance bot)
[09:13:17] <wikibugs>	 (03CR) 10JMeybohm: deployment_server: alert on admin-ng pending changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1040992 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol)
[09:14:23] <arnaudb>	 !log Starting s2 codfw failover from db2204 to db2207 - T367019
[09:14:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:14:26] <stashbot>	 T367019: Switchover s2 master (db2204 -> db2207) - https://phabricator.wikimedia.org/T367019
[09:14:31] <wikibugs>	 (03CR) 10EoghanGaffney: [C:03+1] gitlab:: add blackbox check for ssh service [puppet] - 10https://gerrit.wikimedia.org/r/1041028 (https://phabricator.wikimedia.org/T367021) (owner: 10Jelto)
[09:15:07] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Promote db2207 to s2 primary T367019', diff saved to https://phabricator.wikimedia.org/P64515 and previous config saved to /var/cache/conftool/dbconfig/20240610-091506-arnaudb.json
[09:16:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1168 (T364069)', diff saved to https://phabricator.wikimedia.org/P64516 and previous config saved to /var/cache/conftool/dbconfig/20240610-091606-marostegui.json
[09:16:10] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance
[09:16:13] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[09:16:23] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance
[09:16:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1173 (T364069)', diff saved to https://phabricator.wikimedia.org/P64517 and previous config saved to /var/cache/conftool/dbconfig/20240610-091631-marostegui.json
[09:17:02] <wikibugs>	 (03CR) 10Volans: [C:03+2] "Self-merging as the diffs from PS1 to PS3 are trivial typos" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1037538 (https://phabricator.wikimedia.org/T358542) (owner: 10Volans)
[09:17:49] <wikibugs>	 (03Merged) 10jenkins-bot: reports: accounting, support swapped motherboards [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/1037538 (https://phabricator.wikimedia.org/T358542) (owner: 10Volans)
[09:20:48] <taavi>	 jouncebot: nownadnext
[09:20:54] <taavi>	 jouncebot: nowandnext
[09:20:54] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 39 minute(s)
[09:20:54] <jouncebot>	 In 0 hour(s) and 39 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240610T1000)
[09:21:47] <wikibugs>	 (03PS1) 10Hnowlan: fonts: add opendyslexic [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1041033 (https://phabricator.wikimedia.org/T285650)
[09:21:50] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by taavi@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1040222 (owner: 10Majavah)
[09:22:07] <logmsgbot>	 !log volans@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox-canary
[09:22:10] <logmsgbot>	 !log volans@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox-canary
[09:22:26] <wikibugs>	 (03Merged) 10jenkins-bot: Reapply "wikitech: Replace OSM class in Gerrit blocking hook" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1040222 (owner: 10Majavah)
[09:22:45] <logmsgbot>	 !log taavi@deploy1002 Started scap: Backport for [[gerrit:1040222|Reapply "wikitech: Replace OSM class in Gerrit blocking hook"]]
[09:24:37] <logmsgbot>	 !log volans@cumin1002 START - Cookbook sre.netbox.update-extras rolling restart_daemons on A:netbox
[09:24:43] <logmsgbot>	 !log volans@cumin1002 END (PASS) - Cookbook sre.netbox.update-extras (exit_code=0) rolling restart_daemons on A:netbox
[09:25:02] <logmsgbot>	 !log taavi@deploy1002 taavi: Backport for [[gerrit:1040222|Reapply "wikitech: Replace OSM class in Gerrit blocking hook"]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[09:25:13] <logmsgbot>	 !log taavi@deploy1002 taavi: Continuing with sync
[09:25:26] <taavi>	 (no way to test wikitech changes on mwdebug :/)
[09:26:45] <jinxer-wm>	 FIRING: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate
[09:30:02] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-tools, 06DC-Ops, and 2 others: Netbox errors caused by system board replacement - https://phabricator.wikimedia.org/T358542#9874557 (10Volans) 05Open→03Resolved This is now completed. The new runs are not alerting for these hosts with replaced motherboards.  @wiki_willy co...
[09:33:14] <wikibugs>	 (03CR) 10JMeybohm: k8s: send logs to per-cluster kafka topics (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1040170 (https://phabricator.wikimedia.org/T366710) (owner: 10Filippo Giunchedi)
[09:34:02] <logmsgbot>	 !log taavi@deploy1002 Finished scap: Backport for [[gerrit:1040222|Reapply "wikitech: Replace OSM class in Gerrit blocking hook"]] (duration: 11m 17s)
[09:34:42] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[09:35:50] <wikibugs>	 (03PS1) 10Brouberol: datahub: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041036 (https://phabricator.wikimedia.org/T362978)
[09:36:16] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] thumbor: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037166 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French)
[09:37:07] <godog>	 !log roll upgrade prometheus-statsd-exporter to baremetal - T302373
[09:37:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:37:11] <stashbot>	 T302373: Upgrade prometheus-statsd-exporter - https://phabricator.wikimedia.org/T302373
[09:37:33] <wikibugs>	 (03CR) 10JMeybohm: [C:03+2] thumbor: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037166 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French)
[09:38:36] <wikibugs>	 (03Merged) 10jenkins-bot: thumbor: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037166 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French)
[09:47:45] <jinxer-wm>	 FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ...
[09:47:45] <jinxer-wm>	 fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[09:47:51] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4048.ulsfo.wmnet
[09:49:38] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2829/console" [puppet] - 10https://gerrit.wikimedia.org/r/1041028 (https://phabricator.wikimedia.org/T367021) (owner: 10Jelto)
[09:50:17] <wikibugs>	 (03PS1) 10Filippo Giunchedi: statsd-exporter: bump version to upgrade [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1041037
[09:51:35] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+1] "nitpick on the version number, otherwise LGTM" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1041037 (owner: 10Filippo Giunchedi)
[09:53:16] <wikibugs>	 (03CR) 10Jelto: [V:03+1] "PCC SUCCESS (DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2830/" [puppet] - 10https://gerrit.wikimedia.org/r/1041028 (https://phabricator.wikimedia.org/T367021) (owner: 10Jelto)
[09:53:28] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply
[09:53:52] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply
[09:54:01] <wikibugs>	 (03PS2) 10Brouberol: datahub: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041036 (https://phabricator.wikimedia.org/T362978)
[09:54:25] <wikibugs>	 (03PS2) 10Majavah: P:openstack: opentofu: fix configuration [puppet] - 10https://gerrit.wikimedia.org/r/1040145
[09:54:34] <wikibugs>	 (03PS2) 10Filippo Giunchedi: statsd-exporter: bump version to upgrade [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1041037
[09:54:40] <wikibugs>	 (03CR) 10CI reject: [V:04-1] datahub: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041036 (https://phabricator.wikimedia.org/T362978) (owner: 10Brouberol)
[09:54:50] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 5:00:00 on 870 hosts with reason: Issue from T367019
[09:54:51] <logmsgbot>	 !log arnaudb@cumin1002 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) for 5:00:00 on 870 hosts with reason: Issue from T367019
[09:54:54] <stashbot>	 T367019: Switchover s2 master (db2204 -> db2207) - https://phabricator.wikimedia.org/T367019
[09:55:05] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] "Thank you, fixed the comments" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1041037 (owner: 10Filippo Giunchedi)
[09:55:08] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+2 C:03+2] statsd-exporter: bump version to upgrade [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1041037 (owner: 10Filippo Giunchedi)
[09:56:50] <icinga-wm_>	 PROBLEM - MariaDB Replica Lag: s2 on db2138 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2526.44 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:56:50] <icinga-wm_>	 PROBLEM - MariaDB Replica Lag: s2 on db2148 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2526.60 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:56:50] <icinga-wm_>	 PROBLEM - MariaDB Replica Lag: s2 on db2175 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2527.19 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:56:56] <icinga-wm_>	 PROBLEM - MariaDB Replica Lag: s2 on db2189 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2534.40 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:57:03] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] depool text@drmrs before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1039944 (https://phabricator.wikimedia.org/T366466) (owner: 10Fabfur)
[09:57:24] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 5:00:00 on 26 hosts with reason: Issue from T367019
[09:57:26] <icinga-wm_>	 PROBLEM - MariaDB Replica Lag: s2 on db2126 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2562.56 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:57:30] <icinga-wm_>	 PROBLEM - MariaDB Replica Lag: s2 on db2125 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2566.52 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:57:30] <icinga-wm_>	 PROBLEM - MariaDB Replica Lag: s2 on db2204 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2568.10 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:57:30] <icinga-wm_>	 PROBLEM - MariaDB Replica Lag: s2 on db2207 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 2568.24 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[09:57:39] <wikibugs>	 (03PS1) 10JMeybohm: developer-portal: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041039 (https://phabricator.wikimedia.org/T362978)
[09:57:46] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 5:00:00 on 26 hosts with reason: Issue from T367019
[09:57:49] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply
[09:58:06] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] hiera: enable IPIP for high-traffic1@drmrs for text services (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1039947 (https://phabricator.wikimedia.org/T366466) (owner: 10Fabfur)
[09:58:20] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:openstack: opentofu: fix configuration [puppet] - 10https://gerrit.wikimedia.org/r/1040145 (owner: 10Majavah)
[09:58:25] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] cache:hiera: enable IPIP on text@drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1039948 (https://phabricator.wikimedia.org/T366466) (owner: 10Fabfur)
[09:59:12] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Netbox network report failing - timeout errors - https://phabricator.wikimedia.org/T321704#9874622 (10Volans) I've took a look today and trying to manually run all the tests there isn't anyone that takes so long to trigger the 300s timeout,...
[09:59:16] <wikibugs>	 (03CR) 10Jelto: [V:03+1 C:03+2] gitlab:: add blackbox check for ssh service [puppet] - 10https://gerrit.wikimedia.org/r/1041028 (https://phabricator.wikimedia.org/T367021) (owner: 10Jelto)
[09:59:21] <wikibugs>	 (03PS1) 10Arnaudb: depool: codfw [dns] - 10https://gerrit.wikimedia.org/r/1041041 (https://phabricator.wikimedia.org/T367019)
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240610T1000)
[10:01:08] <logmsgbot>	 !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[10:01:47] <logmsgbot>	 !log cgoubert@cumin1002 conftool action : set/pooled=false; selector: dnsdisc=mw-web-ro,name=codfw
[10:01:56] <wikibugs>	 (03Abandoned) 10Arnaudb: depool: codfw [dns] - 10https://gerrit.wikimedia.org/r/1041041 (https://phabricator.wikimedia.org/T367019) (owner: 10Arnaudb)
[10:01:58] <logmsgbot>	 !log cgoubert@cumin1002 conftool action : set/pooled=false; selector: dnsdisc=mw-api-ext-ro,name=codfw
[10:02:26] <wikibugs>	 (03PS3) 10JMeybohm: toolhub: ensure all containers have securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037165 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French)
[10:02:26] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] "Let's not wait. I think we're good to go here" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1037165 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French)
[10:02:40] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply
[10:02:49] <logmsgbot>	 !log cgoubert@cumin1002 conftool action : set/pooled=false; selector: dnsdisc=mw-api-int-ro,name=codfw
[10:04:25] <wikibugs>	 (03PS1) 10Majavah: P:openstack: opentofu: Allow everyone to enter the directory [puppet] - 10https://gerrit.wikimedia.org/r/1041042 (https://phabricator.wikimedia.org/T364458)
[10:04:26] <wikibugs>	 (03PS1) 10Majavah: P:openstack: opentofu: Do not log changes to the env file [puppet] - 10https://gerrit.wikimedia.org/r/1041043 (https://phabricator.wikimedia.org/T364458)
[10:05:01] <wikibugs>	 (03PS1) 10GergesShamon: [huwiki] Add "suppressredirect" user right to editor user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041044 (https://phabricator.wikimedia.org/T366438)
[10:05:31] <logmsgbot>	 !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply
[10:05:48] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1014.eqiad.wmnet
[10:06:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2016.codfw.wmnet
[10:06:24] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1041043 (https://phabricator.wikimedia.org/T364458) (owner: 10Majavah)
[10:07:45] <jinxer-wm>	 RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ...
[10:07:45] <jinxer-wm>	 fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[10:07:57] <logmsgbot>	 !log cgoubert@cumin1002 conftool action : set/pooled=false; selector: dnsdisc=appservers-ro,name=codfw
[10:08:07] <logmsgbot>	 !log cgoubert@cumin1002 conftool action : set/pooled=false; selector: dnsdisc=api-ro,name=codfw
[10:08:32] <claime>	 !log depooled all active/active mediawiki services from codfw
[10:08:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:09:02] <wikibugs>	 (03PS3) 10Brouberol: datahub: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041036 (https://phabricator.wikimedia.org/T362978)
[10:09:19] <wikibugs>	 (03PS2) 10Fabfur: hiera: enable IPIP for high-traffic1@drmrs for text services [puppet] - 10https://gerrit.wikimedia.org/r/1039947 (https://phabricator.wikimedia.org/T366466)
[10:09:19] <wikibugs>	 (03PS2) 10Fabfur: cache:hiera: enable IPIP on text@drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1039948 (https://phabricator.wikimedia.org/T366466)
[10:09:24] <icinga-wm_>	 RECOVERY - MariaDB Replica Lag: s2 on db2126 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:09:28] <icinga-wm_>	 RECOVERY - MariaDB Replica Lag: s2 on db2125 is OK: OK slave_sql_lag Replication lag: 0.30 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:09:30] <icinga-wm_>	 RECOVERY - MariaDB Replica Lag: s2 on db2204 is OK: OK slave_sql_lag Replication lag: 0.39 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:09:50] <icinga-wm_>	 RECOVERY - MariaDB Replica Lag: s2 on db2138 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:09:50] <icinga-wm_>	 RECOVERY - MariaDB Replica Lag: s2 on db2148 is OK: OK slave_sql_lag Replication lag: 0.29 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:09:50] <icinga-wm_>	 RECOVERY - MariaDB Replica Lag: s2 on db2175 is OK: OK slave_sql_lag Replication lag: 0.42 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:09:56] <icinga-wm_>	 RECOVERY - MariaDB Replica Lag: s2 on db2189 is OK: OK slave_sql_lag Replication lag: 0.49 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:10:33] <wikibugs>	 (03CR) 10Fabfur: hiera: enable IPIP for high-traffic1@drmrs for text services (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1039947 (https://phabricator.wikimedia.org/T366466) (owner: 10Fabfur)
[10:11:43] <wikibugs>	 (03PS2) 10Majavah: P:openstack: opentofu: Allow everyone to enter the directory [puppet] - 10https://gerrit.wikimedia.org/r/1041042 (https://phabricator.wikimedia.org/T365696)
[10:11:44] <wikibugs>	 (03PS2) 10Majavah: P:openstack: opentofu: Do not log changes to the env file [puppet] - 10https://gerrit.wikimedia.org/r/1041043 (https://phabricator.wikimedia.org/T365696)
[10:11:44] <wikibugs>	 (03PS1) 10Majavah: P:openstack: opentofu: Add a variable for region [puppet] - 10https://gerrit.wikimedia.org/r/1041045 (https://phabricator.wikimedia.org/T365696)
[10:11:45] <jinxer-wm>	 RESOLVED: CirrusConsumerFetchErrorRate: cirrus_streaming_updater_consumer_search_codfw in codfw (k8s): fetch error rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=codfw+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-search - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerFetchErrorRate
[10:11:45] <wikibugs>	 (03PS1) 10MVernon: wmflib: add Wmflib::IP::Address::CIDR type [puppet] - 10https://gerrit.wikimedia.org/r/1041046 (https://phabricator.wikimedia.org/T279621)
[10:13:30] <icinga-wm_>	 RECOVERY - MariaDB Replica Lag: s2 on db2207 is OK: OK slave_sql_lag Replication lag: 0.40 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[10:13:36] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 1 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1041045 (https://phabricator.wikimedia.org/T365696) (owner: 10Majavah)
[10:15:55] <wikibugs>	 (03CR) 10Vgutierrez: [C:03+1] hiera: enable IPIP for high-traffic1@drmrs for text services [puppet] - 10https://gerrit.wikimedia.org/r/1039947 (https://phabricator.wikimedia.org/T366466) (owner: 10Fabfur)
[10:17:42] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1014.eqiad.wmnet
[10:18:35] <wikibugs>	 (03CR) 10Brouberol: [V:03+1] deployment_server: alert on admin-ng pending changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1040992 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol)
[10:18:43] <jinxer-wm>	 FIRING: ProbeDown: Service gitlab2002:22 has failed probes (tcp_gitlab_wikimedia_org_ssh_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#gitlab2002:22 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[10:18:58] <wikibugs>	 (03CR) 10Btullis: datahub: add securityContext to all containers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041036 (https://phabricator.wikimedia.org/T362978) (owner: 10Brouberol)
[10:19:29] <wikibugs>	 (03PS1) 10JMeybohm: linkrecommendation: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041049 (https://phabricator.wikimedia.org/T362978)
[10:19:38] <icinga-wm_>	 PROBLEM - Host ml-etcd1001 is DOWN: PING CRITICAL - Packet loss = 100%
[10:21:01] <logmsgbot>	 !log cgoubert@cumin1002 conftool action : set/pooled=true; selector: dnsdisc=mw-web-ro,name=codfw
[10:21:09] <logmsgbot>	 !log cgoubert@cumin1002 conftool action : set/pooled=true; selector: dnsdisc=mw-api-ext-ro,name=codfw
[10:21:15] <logmsgbot>	 !log cgoubert@cumin1002 conftool action : set/pooled=true; selector: dnsdisc=mw-api-int-ro,name=codfw
[10:21:22] <logmsgbot>	 !log cgoubert@cumin1002 conftool action : set/pooled=true; selector: dnsdisc=appservers-ro,name=codfw
[10:21:22] <icinga-wm_>	 PROBLEM - SSH on dse-k8s-etcd1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[10:21:29] <logmsgbot>	 !log cgoubert@cumin1002 conftool action : set/pooled=true; selector: dnsdisc=api-ro,name=codfw
[10:21:42] <claime>	 !log repooled all active/active mediawiki services from codfw
[10:21:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:22:13] <wikibugs>	 (03PS4) 10Brouberol: datahub: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041036 (https://phabricator.wikimedia.org/T362978)
[10:22:50] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C:03+2] ml-services: use envoy proxy in ores-legacy staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040196 (https://phabricator.wikimedia.org/T366801) (owner: 10Ilias Sarantopoulos)
[10:22:51] <wikibugs>	 (03CR) 10CI reject: [V:04-1] datahub: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041036 (https://phabricator.wikimedia.org/T362978) (owner: 10Brouberol)
[10:23:44] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1014.eqiad.wmnet
[10:23:52] <wikibugs>	 (03PS5) 10Brouberol: datahub: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041036 (https://phabricator.wikimedia.org/T362978)
[10:24:03] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: use envoy proxy in ores-legacy staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040196 (https://phabricator.wikimedia.org/T366801) (owner: 10Ilias Sarantopoulos)
[10:24:13] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1014.eqiad.wmnet
[10:24:29] <wikibugs>	 (03CR) 10CI reject: [V:04-1] datahub: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041036 (https://phabricator.wikimedia.org/T362978) (owner: 10Brouberol)
[10:25:12] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'Depool db2204 T367019', diff saved to https://phabricator.wikimedia.org/P64518 and previous config saved to /var/cache/conftool/dbconfig/20240610-102511-arnaudb.json
[10:25:15] <stashbot>	 T367019: Switchover s2 master (db2204 -> db2207) - https://phabricator.wikimedia.org/T367019
[10:25:23] <logmsgbot>	 !log isaranto@deploy1002 helmfile [ml-staging-codfw] 'sync' command on namespace 'ores-legacy' for release 'main' .
[10:25:32] <icinga-wm_>	 RECOVERY - Host ml-etcd1001 is UP: PING OK - Packet loss = 0%, RTA = 0.54 ms
[10:26:07] <wikibugs>	 (03CR) 10Majavah: [C:03+1] wmflib: add Wmflib::IP::Address::CIDR type [puppet] - 10https://gerrit.wikimedia.org/r/1041046 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon)
[10:26:14] <icinga-wm_>	 RECOVERY - SSH on dse-k8s-etcd1002 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[10:26:34] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2016.codfw.wmnet
[10:27:09] <logmsgbot>	 !log arnaudb@cumin1002 START - Cookbook sre.hosts.downtime for 12:00:00 on db2204.codfw.wmnet with reason: Maintenance
[10:27:11] <logmsgbot>	 !log arnaudb@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db2204.codfw.wmnet with reason: Maintenance
[10:28:30] <icinga-wm_>	 PROBLEM - Host ml-etcd2003 is DOWN: PING CRITICAL - Packet loss = 100%
[10:31:11] <wikibugs>	 (03CR) 10Btullis: deployment_server: alert on admin-ng pending changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1040992 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol)
[10:34:35] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2016.codfw.wmnet
[10:34:43] <wikibugs>	 (03PS1) 10Jelto: gitlab: use IPv4 for SSH check temporarily [puppet] - 10https://gerrit.wikimedia.org/r/1041051 (https://phabricator.wikimedia.org/T367021)
[10:34:58] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2016.codfw.wmnet
[10:35:30] <icinga-wm_>	 RECOVERY - Host ml-etcd2003 is UP: PING OK - Packet loss = 0%, RTA = 30.42 ms
[10:38:53] <wikibugs>	 (03CR) 10EoghanGaffney: [C:03+1] gitlab: use IPv4 for SSH check temporarily [puppet] - 10https://gerrit.wikimedia.org/r/1041051 (https://phabricator.wikimedia.org/T367021) (owner: 10Jelto)
[10:39:31] <wikibugs>	 (03CR) 10Jelto: [C:03+2] gitlab: use IPv4 for SSH check temporarily [puppet] - 10https://gerrit.wikimedia.org/r/1041051 (https://phabricator.wikimedia.org/T367021) (owner: 10Jelto)
[10:40:31] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] depool text@drmrs before enabling IPIP encapsulation [dns] - 10https://gerrit.wikimedia.org/r/1039944 (https://phabricator.wikimedia.org/T366466) (owner: 10Fabfur)
[10:41:13] <fabfur>	 !log depooling text@drmrs to apply IPIP encapsulation patches (T366466)
[10:41:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:41:16] <stashbot>	 T366466: Use IPIP encapsulation on lvs<-->text cluster - https://phabricator.wikimedia.org/T366466
[10:41:28] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudgw1002.eqiad.wmnet
[10:43:03] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2204 (re)pooling @ 1%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P64519 and previous config saved to /var/cache/conftool/dbconfig/20240610-104303-arnaudb.json
[10:45:03] <wikibugs>	 (03PS4) 10Giuseppe Lavagetto: Improve ability to override php-fpm configuration in kubernetes [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1039642
[10:45:29] <wikibugs>	 (03PS1) 10JMeybohm: machinetranslation: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041055 (https://phabricator.wikimedia.org/T362978)
[10:46:45] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] Improve ability to override php-fpm configuration in kubernetes [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1039642 (owner: 10Giuseppe Lavagetto)
[10:47:17] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1002.eqiad.wmnet
[10:48:26] <_joe_>	 jouncebot: nowandnext
[10:48:26] <jouncebot>	 For the next 0 hour(s) and 11 minute(s): MediaWiki infrastructure (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240610T1000)
[10:48:26] <jouncebot>	 In 2 hour(s) and 11 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240610T1300)
[10:49:00] <_joe_>	 I will probably need to extend a bit beyond the limits I should normally have to use here
[10:49:09] <_joe_>	 in terms of deployment window
[10:53:54] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.hosts.reboot-single for host cloudgw1001.eqiad.wmnet
[10:54:40] <fabfur>	 !log disabling puppet on A:cp-text to enable https://gerrit.wikimedia.org/r/c/operations/puppet/+/1039948 selectively (T366466)
[10:55:58] <_joe_>	 !log published updated php-fpm-multiversion-base,prometheus-statsd-exporter images
[10:57:01] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudgw1001.eqiad.wmnet
[10:57:21] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] cache:hiera: enable IPIP on text@drmrs [puppet] - 10https://gerrit.wikimedia.org/r/1039948 (https://phabricator.wikimedia.org/T366466) (owner: 10Fabfur)
[10:58:04] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] service: set similar-users to lvs_setup [puppet] - 10https://gerrit.wikimedia.org/r/1014499 (https://phabricator.wikimedia.org/T345274) (owner: 10Hnowlan)
[10:58:09] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2204 (re)pooling @ 2%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P64520 and previous config saved to /var/cache/conftool/dbconfig/20240610-105809-arnaudb.json
[10:58:41] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: common_images: update statsd-exporter version [puppet] - 10https://gerrit.wikimedia.org/r/1041058
[10:59:34] <fabfur>	 !log disabled puppet on A:lvs-drmrs to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/1039947 (T366466)
[10:59:58] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] hiera: enable IPIP for high-traffic1@drmrs for text services [puppet] - 10https://gerrit.wikimedia.org/r/1039947 (https://phabricator.wikimedia.org/T366466) (owner: 10Fabfur)
[11:03:33] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+2] common_images: update statsd-exporter version [puppet] - 10https://gerrit.wikimedia.org/r/1041058 (owner: 10Giuseppe Lavagetto)
[11:04:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T364069)', diff saved to https://phabricator.wikimedia.org/P64521 and previous config saved to /var/cache/conftool/dbconfig/20240610-110409-marostegui.json
[11:04:10] <_joe_>	 fabfur: can I merge your changes?
[11:04:46] <_joe_>	 fabfur: ping
[11:04:46] <fabfur>	 yes, I was going to but it's locked (by you)
[11:04:47] <fabfur>	 thens
[11:04:49] <fabfur>	 thanks
[11:04:56] <_joe_>	 done
[11:05:06] <fabfur>	 ack
[11:06:22] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:openstack: opentofu: Allow everyone to enter the directory [puppet] - 10https://gerrit.wikimedia.org/r/1041042 (https://phabricator.wikimedia.org/T365696) (owner: 10Majavah)
[11:06:30] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:openstack: opentofu: Do not log changes to the env file [puppet] - 10https://gerrit.wikimedia.org/r/1041043 (https://phabricator.wikimedia.org/T365696) (owner: 10Majavah)
[11:09:06] <fabfur>	 !log tests looks good, enabling && running puppet on A:cp-text to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/1039948 (on drmrs) (T366466)
[11:09:16] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2017.codfw.wmnet
[11:09:36] <wikibugs>	 (03PS2) 10Hnowlan: wmnet: remove similar-users [dns] - 10https://gerrit.wikimedia.org/r/1014495 (https://phabricator.wikimedia.org/T345274)
[11:09:37] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1015.eqiad.wmnet
[11:11:13] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: mwdebug: allow dumping more variables for debugging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041060
[11:12:36] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+2] mwdebug: allow dumping more variables for debugging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041060 (owner: 10Giuseppe Lavagetto)
[11:12:54] <wikibugs>	 (03PS5) 10Ebrahim: errorpage: Add dark mode support to error page [puppet] - 10https://gerrit.wikimedia.org/r/1040809
[11:12:57] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] errorpage: Add dark mode support to error page [puppet] - 10https://gerrit.wikimedia.org/r/1040809 (owner: 10Ebrahim)
[11:13:03] <wikibugs>	 (03CR) 10Ladsgroup: [V:03+2 C:03+2] errorpage: Add dark mode support to error page [puppet] - 10https://gerrit.wikimedia.org/r/1040809 (owner: 10Ebrahim)
[11:13:16] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2204 (re)pooling @ 5%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P64522 and previous config saved to /var/cache/conftool/dbconfig/20240610-111315-arnaudb.json
[11:16:30] <icinga-wm_>	 PROBLEM - OpenSearch health check for shards on 9200 on logstash1032 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f3fccff3280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.w
[11:16:30] <icinga-wm_>	 org/wiki/Search%23Administration
[11:17:08] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: mwdebug: allow dumping more variables for debugging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041060
[11:17:19] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] mwdebug: allow dumping more variables for debugging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041060 (owner: 10Giuseppe Lavagetto)
[11:18:19] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[11:18:30] <icinga-wm_>	 RECOVERY - OpenSearch health check for shards on 9200 on logstash1032 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: green, timed_out: False, number_of_nodes: 18, number_of_data_nodes: 12, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 756, active_shards: 1774, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_sha
[11:18:30] <icinga-wm_>	 umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[11:18:43] <jinxer-wm>	 RESOLVED: ProbeDown: Service gitlab2002:22 has failed probes (tcp_gitlab_wikimedia_org_ssh_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#gitlab2002:22 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[11:19:09] <logmsgbot>	 !log oblivian@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[11:19:17] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[11:19:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P64523 and previous config saved to /var/cache/conftool/dbconfig/20240610-111917-marostegui.json
[11:19:36] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1015.eqiad.wmnet
[11:19:42] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[11:22:17] <wikibugs>	 (03PS7) 10Brouberol: datahub: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041036 (https://phabricator.wikimedia.org/T362978)
[11:23:40] <logmsgbot>	 !log oblivian@deploy1002 Locking from deployment [ALL REPOSITORIES]: setting global lock while working on mw-on-k8s --joe. Ping me if you need urgent deployments
[11:24:21] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] "I'm merging this but it won't be deployed until we restart sanitarium hosts. That's going to take a while. There is a ticket for improving" [puppet] - 10https://gerrit.wikimedia.org/r/1028855 (https://phabricator.wikimedia.org/T363825) (owner: 10Zabe)
[11:24:28] <wikibugs>	 (03PS3) 10Zabe: hieradata: Add arbcom_itwiki to private wikis [puppet] - 10https://gerrit.wikimedia.org/r/1028855 (https://phabricator.wikimedia.org/T363825)
[11:24:30] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] hieradata: Add arbcom_itwiki to private wikis [puppet] - 10https://gerrit.wikimedia.org/r/1028855 (https://phabricator.wikimedia.org/T363825) (owner: 10Zabe)
[11:24:33] <wikibugs>	 (03CR) 10Ladsgroup: [V:03+2 C:03+2] hieradata: Add arbcom_itwiki to private wikis [puppet] - 10https://gerrit.wikimedia.org/r/1028855 (https://phabricator.wikimedia.org/T363825) (owner: 10Zabe)
[11:25:37] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1015.eqiad.wmnet
[11:25:43] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1015.eqiad.wmnet
[11:26:02] <fabfur>	 !log enabling && running puppet on A:lvs-drmrs to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/1039947 (T366466)
[11:26:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:26:06] <stashbot>	 T366466: Use IPIP encapsulation on lvs<-->text cluster - https://phabricator.wikimedia.org/T366466
[11:26:10] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Looks good to me, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041036 (https://phabricator.wikimedia.org/T362978) (owner: 10Brouberol)
[11:27:55] <wikibugs>	 (03CR) 10Brouberol: [V:03+1] deployment_server: alert on admin-ng pending changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1040992 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol)
[11:28:10] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] datahub: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041036 (https://phabricator.wikimedia.org/T362978) (owner: 10Brouberol)
[11:28:17] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2017.codfw.wmnet
[11:28:23] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2204 (re)pooling @ 10%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P64524 and previous config saved to /var/cache/conftool/dbconfig/20240610-112821-arnaudb.json
[11:28:32] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1016.eqiad.wmnet
[11:29:04] <wikibugs>	 (03Merged) 10jenkins-bot: datahub: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041036 (https://phabricator.wikimedia.org/T362978) (owner: 10Brouberol)
[11:29:40] <fabfur>	 !log restarting pybal on lvs6003,lvs6001 to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/1039947 (T366466)
[11:29:43] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops, 13Patch-For-Review: Netbox network report failing - timeout errors - https://phabricator.wikimedia.org/T321704#9874878 (10cmooney) >>! In T321704#9874622, @Volans wrote: > I've took a look today and trying to manually run all the tests there isn't anyone that...
[11:29:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:31:47] <wikibugs>	 (03PS1) 10Clément Goubert: weekly-update.sh: Actually skip spark images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1041063
[11:32:14] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging
[11:34:02] <logmsgbot>	 !log oblivian@deploy1002 Unlocked for deployment [ALL REPOSITORIES]: setting global lock while working on mw-on-k8s --joe. Ping me if you need urgent deployments (duration: 10m 22s)
[11:34:27] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173', diff saved to https://phabricator.wikimedia.org/P64525 and previous config saved to /var/cache/conftool/dbconfig/20240610-113426-marostegui.json
[11:34:42] <wikibugs>	 (03CR) 10Hnowlan: [C:03+1] weekly-update.sh: Actually skip spark images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1041063 (owner: 10Clément Goubert)
[11:34:52] <logmsgbot>	 !log oblivian@deploy1002 Started scap: Deploying change to base mediawiki image
[11:35:02] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging
[11:36:05] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2017.codfw.wmnet
[11:36:10] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2017.codfw.wmnet
[11:36:32] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub: apply on production
[11:36:59] <wikibugs>	 (03CR) 10Clément Goubert: [V:03+2 C:03+2] weekly-update.sh: Actually skip spark images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/1041063 (owner: 10Clément Goubert)
[11:39:16] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub: sync on production
[11:39:38] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] Upgrade base OS to Debian bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1039778 (https://phabricator.wikimedia.org/T355020) (owner: 10Hnowlan)
[11:41:47] <wikibugs>	 (03PS1) 10JMeybohm: mcrouter: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041065 (https://phabricator.wikimedia.org/T362978)
[11:42:25] <wikibugs>	 (03CR) 10CI reject: [V:04-1] mcrouter: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041065 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm)
[11:42:57] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] sre.k8s.reboot-nodes: Add exclude option [cookbooks] - 10https://gerrit.wikimedia.org/r/1038865 (owner: 10Clément Goubert)
[11:43:02] <wikibugs>	 (03PS1) 10Fabfur: Revert "depool text@drmrs before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1041066
[11:43:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1016.eqiad.wmnet
[11:43:29] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2204 (re)pooling @ 25%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P64526 and previous config saved to /var/cache/conftool/dbconfig/20240610-114329-arnaudb.json
[11:43:35] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2018.codfw.wmnet
[11:43:55] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] sre.k8s.reboot-nodes: Add exclude option (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1038865 (owner: 10Clément Goubert)
[11:44:41] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] Upgrade base OS to Debian bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1039778 (https://phabricator.wikimedia.org/T355020) (owner: 10Hnowlan)
[11:44:44] <logmsgbot>	 !log oblivian@deploy1002 sync-world aborted: Deploying change to base mediawiki image (duration: 10m 21s)
[11:44:59] <wikibugs>	 (03PS2) 10JMeybohm: mcrouter: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041065 (https://phabricator.wikimedia.org/T362978)
[11:45:23] <icinga-wm_>	 PROBLEM - Host kubetcd1006 is DOWN: PING CRITICAL - Packet loss = 100%
[11:45:46] <wikibugs>	 (03PS3) 10JMeybohm: mcrouter: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041065 (https://phabricator.wikimedia.org/T362978)
[11:46:41] <wikibugs>	 (03PS2) 10Majavah: P:openstack: opentofu: Add a variable for region [puppet] - 10https://gerrit.wikimedia.org/r/1041045 (https://phabricator.wikimedia.org/T365696)
[11:46:41] <wikibugs>	 (03PS1) 10Majavah: P:openstack: opentofu: Add a diff job to catch unapplied changes [puppet] - 10https://gerrit.wikimedia.org/r/1041069
[11:46:49] <wikibugs>	 (03Merged) 10jenkins-bot: sre.k8s.reboot-nodes: Add exclude option [cookbooks] - 10https://gerrit.wikimedia.org/r/1038865 (owner: 10Clément Goubert)
[11:47:14] <wikibugs>	 (03PS1) 10Brouberol: superset: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041068 (https://phabricator.wikimedia.org/T346638)
[11:48:51] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1016.eqiad.wmnet
[11:49:19] <wikibugs>	 (03PS2) 10Majavah: P:openstack: opentofu: Add a diff job to catch unapplied changes [puppet] - 10https://gerrit.wikimedia.org/r/1041069
[11:49:24] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1016.eqiad.wmnet
[11:49:26] <wikibugs>	 (03Abandoned) 10JMeybohm: mcrouter: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041065 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm)
[11:49:35] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1173 (T364069)', diff saved to https://phabricator.wikimedia.org/P64527 and previous config saved to /var/cache/conftool/dbconfig/20240610-114934-marostegui.json
[11:49:37] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance
[11:49:38] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[11:49:50] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1180.eqiad.wmnet with reason: Maintenance
[11:49:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1180 (T364069)', diff saved to https://phabricator.wikimedia.org/P64528 and previous config saved to /var/cache/conftool/dbconfig/20240610-114957-marostegui.json
[11:50:25] <icinga-wm_>	 RECOVERY - Host kubetcd1006 is UP: PING OK - Packet loss = 0%, RTA = 0.99 ms
[11:50:35] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists: Create a mailing list for Bangla Wikimoitree - https://phabricator.wikimedia.org/T365915#9874958 (10Ladsgroup) Again, comparing apples and oranges. They requested a mailing list for a project. Not a Wikimedia Hub.   I will create this under type of project.
[11:50:50] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1041069 (owner: 10Majavah)
[11:52:35] <wikibugs>	 (03PS4) 10JMeybohm: mcrouter: Bump chart modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038859 (https://phabricator.wikimedia.org/T362978) (owner: 10Alexandros Kosiaris)
[11:53:36] <logmsgbot>	 !log oblivian@deploy1002 Started scap: Deploying change to base mediawiki image (take 2)
[11:55:10] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] webperf: don't hardcode php version [puppet] - 10https://gerrit.wikimedia.org/r/1039974 (https://phabricator.wikimedia.org/T353912) (owner: 10Filippo Giunchedi)
[11:55:16] <wikibugs>	 (03Merged) 10jenkins-bot: Upgrade base OS to Debian bookworm [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1039778 (https://phabricator.wikimedia.org/T355020) (owner: 10Hnowlan)
[11:56:16] <wikibugs>	 (03CR) 10JMeybohm: "Updated the fixture to match the changed values. Also add Bug tag to T362978, as this adds securityContext as well." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038859 (https://phabricator.wikimedia.org/T362978) (owner: 10Alexandros Kosiaris)
[11:56:37] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2018.codfw.wmnet
[11:57:47] <wikibugs>	 (03CR) 10Hashar: [C:03+1] "What Paladox, that is due to an update in the Soy templating engine." [puppet] - 10https://gerrit.wikimedia.org/r/1037765 (owner: 10Paladox)
[11:58:32] <wikibugs>	 (03CR) 10Majavah: [C:03+2] gerrit: fix "its" templates for 3.9 [puppet] - 10https://gerrit.wikimedia.org/r/1037765 (owner: 10Paladox)
[11:58:35] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2204 (re)pooling @ 50%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P64530 and previous config saved to /var/cache/conftool/dbconfig/20240610-115834-arnaudb.json
[12:00:05] <wikibugs>	 (03PS5) 10Brouberol: spark-history: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041070 (https://phabricator.wikimedia.org/T362978)
[12:00:53] <wikibugs>	 (03PS1) 10Brouberol: echoserver: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041071 (https://phabricator.wikimedia.org/T362978)
[12:02:29] <wikibugs>	 (03PS1) 10JMeybohm: python-webapp: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041072 (https://phabricator.wikimedia.org/T362978)
[12:04:49] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2018.codfw.wmnet
[12:04:57] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+1] k8s: send logs to per-cluster kafka topics (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1040170 (https://phabricator.wikimedia.org/T366710) (owner: 10Filippo Giunchedi)
[12:05:08] <wikibugs>	 (03PS3) 10Filippo Giunchedi: k8s: send logs to per-cluster kafka topics [puppet] - 10https://gerrit.wikimedia.org/r/1040170 (https://phabricator.wikimedia.org/T366710)
[12:05:52] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2018.codfw.wmnet
[12:07:05] <wikibugs>	 (03PS2) 10JMeybohm: python-webapp: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041072 (https://phabricator.wikimedia.org/T362978)
[12:11:35] <wikibugs>	 (03PS1) 10JMeybohm: calculator-service: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041076 (https://phabricator.wikimedia.org/T362978)
[12:13:32] <wikibugs>	 (03PS6) 10Majavah: wikitech: Stop loading OpenStackManager [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1038750 (https://phabricator.wikimedia.org/T161553)
[12:13:41] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2204 (re)pooling @ 75%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P64531 and previous config saved to /var/cache/conftool/dbconfig/20240610-121341-arnaudb.json
[12:15:00] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1041077 (owner: 10L10n-bot)
[12:15:40] <logmsgbot>	 !log oblivian@deploy1002 Finished scap: Deploying change to base mediawiki image (take 2) (duration: 22m 39s)
[12:20:36] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2019.codfw.wmnet
[12:21:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1017.eqiad.wmnet
[12:21:27] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2019.codfw.wmnet
[12:22:33] <wikibugs>	 (03CR) 10JMeybohm: "This is just a demo chart, it is not deployed anywhere" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041076 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm)
[12:24:20] <wikibugs>	 (03PS3) 10Awight: Revert "Temporary monitoring for scraper" [puppet] - 10https://gerrit.wikimedia.org/r/1040075 (https://phabricator.wikimedia.org/T366144)
[12:24:21] <wikibugs>	 (03CR) 10Awight: [C:03+1] "Can be merged safely.  Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/1040075 (https://phabricator.wikimedia.org/T366144) (owner: 10Awight)
[12:25:10] <wikibugs>	 06SRE, 06collaboration-services, 10Wikimedia-Mailing-lists, 13Patch-For-Review: Migrate Mailman/lists to Bullseye/Bookworm - https://phabricator.wikimedia.org/T331706#9875114 (10Ladsgroup) Overall looks good. Just noting that rebuilding index will take a very long time and that can make the downtime quite...
[12:28:04] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2019.codfw.wmnet
[12:28:10] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2019.codfw.wmnet
[12:28:49] <logmsgbot>	 !log arnaudb@cumin1002 dbctl commit (dc=all): 'db2204 (re)pooling @ 100%: post maintenance repool', diff saved to https://phabricator.wikimedia.org/P64532 and previous config saved to /var/cache/conftool/dbconfig/20240610-122847-arnaudb.json
[12:30:22] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2020.codfw.wmnet
[12:32:23] <wikibugs>	 (03PS1) 10Brouberol: datahub: set distinct ES index prefix between staging and prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041088
[12:32:35] <icinga-wm_>	 PROBLEM - OpenSearch health check for shards on 9200 on logstash1032 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fb6d79d9280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.w
[12:32:35] <icinga-wm_>	 org/wiki/Search%23Administration
[12:33:29] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977#9875143 (10cmooney)
[12:34:33] <icinga-wm_>	 RECOVERY - OpenSearch health check for shards on 9200 on logstash1032 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: green, timed_out: False, number_of_nodes: 18, number_of_data_nodes: 12, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 756, active_shards: 1774, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_sha
[12:34:33] <icinga-wm_>	 umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[12:34:35] <wikibugs>	 (03CR) 10Elukey: [C:03+2] admin_ng: update Bookworm-based Docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040153 (https://phabricator.wikimedia.org/T356252) (owner: 10Elukey)
[12:35:26] <wikibugs>	 (03Abandoned) 10Nikerabbit: Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1039693 (owner: 10L10n-bot)
[12:35:59] <wikibugs>	 (03CR) 10Nikerabbit: [V:03+2] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/1041077 (owner: 10L10n-bot)
[12:36:28] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] Automated MarkMonitor domain sync (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1039849 (owner: 10Ncmonitor)
[12:36:44] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C:03+2] Revert "Temporary monitoring for scraper" [puppet] - 10https://gerrit.wikimedia.org/r/1040075 (https://phabricator.wikimedia.org/T366144) (owner: 10Awight)
[12:37:24] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1017.eqiad.wmnet
[12:37:36] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2020.codfw.wmnet
[12:37:46] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] "jumping from 7 certs to 20 is definitely too much IMHO, we should split this one in several CRs to be merged at different times (so we don" [puppet] - 10https://gerrit.wikimedia.org/r/1039849 (owner: 10Ncmonitor)
[12:39:33] <icinga-wm_>	 PROBLEM - Swift https backend on ms-fe2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Swift
[12:39:43] <icinga-wm_>	 PROBLEM - Host kubetcd2006 is DOWN: PING CRITICAL - Packet loss = 100%
[12:39:43] <icinga-wm_>	 PROBLEM - Host ml-etcd2001 is DOWN: PING CRITICAL - Packet loss = 100%
[12:40:07] <logmsgbot>	 !log elukey@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'sync'.
[12:40:23] <icinga-wm_>	 RECOVERY - Swift https backend on ms-fe2014 is OK: HTTP OK: HTTP/1.1 200 OK - 507 bytes in 0.563 second response time https://wikitech.wikimedia.org/wiki/Swift
[12:40:47] <wikibugs>	 (03CR) 10Majavah: [C:04-1] "This patch includes quite a few WMCS domains that are either delegated to openstack (so won't issue certs at all on the wikiprod acme-chie" [puppet] - 10https://gerrit.wikimedia.org/r/1039849 (owner: 10Ncmonitor)
[12:41:04] <logmsgbot>	 !log elukey@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'sync'.
[12:41:14] <elukey>	 jouncebot: next
[12:41:14] <jouncebot>	 In 0 hour(s) and 18 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240610T1300)
[12:43:01] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'.
[12:43:25] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1017.eqiad.wmnet
[12:43:31] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1017.eqiad.wmnet
[12:43:50] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2020.codfw.wmnet
[12:44:09] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'.
[12:44:12] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2020.codfw.wmnet
[12:45:29] <icinga-wm_>	 RECOVERY - Host kubetcd2006 is UP: PING OK - Packet loss = 0%, RTA = 30.53 ms
[12:45:41] <icinga-wm_>	 RECOVERY - Host ml-etcd2001 is UP: PING OK - Packet loss = 0%, RTA = 30.66 ms
[12:45:44] <wikibugs>	 (03PS1) 10Ebrahim: errorpages: Add dark mode support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041091
[12:45:52] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2021.codfw.wmnet
[12:46:03] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'sync'.
[12:46:20] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1018.eqiad.wmnet
[12:46:40] <wikibugs>	 (03CR) 10Stevemunene: [C:03+1] datahub: set distinct ES index prefix between staging and prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041088 (owner: 10Brouberol)
[12:48:32] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
[12:48:57] <jinxer-wm>	 FIRING: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:49:18] <godog>	 checking
[12:49:18] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] datahub: set distinct ES index prefix between staging and prod [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041088 (owner: 10Brouberol)
[12:49:36] <logmsgbot>	 !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/admin 'sync'.
[12:49:42] <Amir1>	 hnowlan: thumbor is kaput
[12:50:05] <godog>	 for now looks like a blip, should be recovering
[12:50:31] <Amir1>	 maybe we should bump the replicas?
[12:50:33] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s_services/services/datahub-next: apply on staging
[12:50:46] <logmsgbot>	 !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'sync'.
[12:51:10] <Amir1>	 godog: I don't think it's a blip https://grafana.wikimedia.org/d/Pukjw6cWk/thumbor?orgId=1&refresh=1m&viewPanel=93
[12:51:26] <elukey>	 Amir1, godog - there was a blip for ms-be2014, maybe related? It seems codfw right?
[12:51:35] <jhathaway>	 o/
[12:51:37] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to <wmf> for <Gonyeahialam> - https://phabricator.wikimedia.org/T367053 (10gonyeahialam) 03NEW
[12:51:43] <godog>	 elukey: codfw yeah
[12:52:16] <godog>	 Amir1: mmhh I'm wondering how laggy that metric is, I'm looking at the network probes https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes%2Fservice&var-module=All&orgId=1&from=now-1h&to=now
[12:52:34] <Amir1>	 yeah, it's actually recovering 
[12:52:39] <godog>	 elukey: could be, though a single host shouldn't affect things very much
[12:52:46] <elukey>	 yep yep
[12:53:18] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for Gonyeahialam - https://phabricator.wikimedia.org/T367053#9875202 (10gonyeahialam)
[12:53:24] <elukey>	 and I got the name wrong, it was ms-fe2014
[12:53:26] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s_services/services/datahub-next: sync on staging
[12:53:57] <jinxer-wm>	 RESOLVED: ProbeDown: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[12:54:06] <Amir1>	 hnowlan: sorry pinged too soon :D
[12:54:12] <elukey>	 from https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=ms-fe2014&var-datasource=thanos&var-cluster=swift it seems that something happened at around 11:40 UTC
[12:54:42] <wikibugs>	 (03CR) 10Fabfur: [C:03+2] Revert "depool text@drmrs before enabling IPIP encapsulation" [dns] - 10https://gerrit.wikimedia.org/r/1041066 (owner: 10Fabfur)
[12:55:00] <Amir1>	 https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=ms-fe2014&var-datasource=thanos&var-cluster=swift&viewPanel=13 ouch
[12:55:10] <wikibugs>	 06SRE, 10LDAP-Access-Requests: Grant Access to wmf for Gonyeahialam - https://phabricator.wikimedia.org/T367053#9875204 (10gonyeahialam)
[12:55:14] <fabfur>	 !log repooling text@drmrs (IPIP encapsulation enabled) (T366466)
[12:55:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:55:17] <stashbot>	 T366466: Use IPIP encapsulation on lvs<-->text cluster - https://phabricator.wikimedia.org/T366466
[12:57:14] <elukey>	 Amir1: yep it is weird that it doesn't happen in the previous 7 days
[12:57:20] <elukey>	 so seems quite weird
[12:58:05] <logmsgbot>	 !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/admin 'sync'.
[12:58:10] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: mariadb: Promote es1037 to es6 master [puppet] - 10https://gerrit.wikimedia.org/r/1041092 (https://phabricator.wikimedia.org/T367055)
[12:58:15] <wikibugs>	 (03PS1) 10Gerrit maintenance bot: wmnet: Update es6-master alias [dns] - 10https://gerrit.wikimedia.org/r/1041093 (https://phabricator.wikimedia.org/T367055)
[12:58:16] <Amir1>	 Emperor: ^
[12:58:19] <logmsgbot>	 !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'sync'.
[12:59:11] <Emperor>	 what am I being pinged about, sorry?
[13:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: That opportune time for a UTC afternoon backport window deploy is upon us again. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240610T1300).
[13:00:05] <jouncebot>	 Gerges: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[13:00:08] <Amir1>	 Emperor: https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=ms-fe2014&var-datasource=thanos&var-cluster=swift 
[13:00:16] <Gerges>	 Hi
[13:00:23] <Amir1>	 this has triggered a page
[13:00:29] <elukey>	 Emperor: there was a page earlier on :)
[13:00:30] <wikibugs>	 (03PS1) 10Cmelo: Enable CampaignEvents on swahili wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041094 (https://phabricator.wikimedia.org/T366502)
[13:00:31] <Amir1>	 (it got resoved)
[13:00:50] <Amir1>	 but it's worth taking a look https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=ms-fe2014&var-datasource=thanos&var-cluster=swift&viewPanel=13
[13:01:50] <logmsgbot>	 !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/admin 'sync'.
[13:01:57] <logmsgbot>	 !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'sync'.
[13:02:04] <effie>	 jouncebot: now
[13:02:04] <jouncebot>	 For the next 0 hour(s) and 57 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240610T1300)
[13:02:12] <effie>	 jouncebot: next
[13:02:12] <jouncebot>	 In 2 hour(s) and 27 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240610T1530)
[13:02:45] <elukey>	 I am going to stop my deployments to wikikube for the moment
[13:03:14] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[13:03:17] <Amir1>	 Gerges: let me check and deploy
[13:03:19] <wikibugs>	 (03PS1) 10Cmelo: Enable CampaignEvents on swahili wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041096 (https://phabricator.wikimedia.org/T366502)
[13:03:25] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] "`" [dns] - 10https://gerrit.wikimedia.org/r/1040335 (owner: 10Ncmonitor)
[13:03:43] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[13:03:45] <Gerges>	 Ok
[13:04:02] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
[13:04:08] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] [huwiki] Add "suppressredirect" user right to editor user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041044 (https://phabricator.wikimedia.org/T366438) (owner: 10GergesShamon)
[13:04:23] <wikibugs>	 (03CR) 10Ottomata: "Nice!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040862 (https://phabricator.wikimedia.org/T344730) (owner: 10Snwachukwu)
[13:04:24] <Emperor>	 Amir1: similar pattern seen with e.g. ms-fe2013 too, which didn't result in a spike in errors
[13:04:27] <wikibugs>	 (03CR) 10Vgutierrez: [C:04-1] "those should be mentioned on my review of the DNS related change: https://gerrit.wikimedia.org/r/c/operations/dns/+/1040335/comments/d97a9" [puppet] - 10https://gerrit.wikimedia.org/r/1039849 (owner: 10Ncmonitor)
[13:04:32] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
[13:04:43] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041044 (https://phabricator.wikimedia.org/T366438) (owner: 10GergesShamon)
[13:04:50] <wikibugs>	 (03Merged) 10jenkins-bot: [huwiki] Add "suppressredirect" user right to editor user group [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041044 (https://phabricator.wikimedia.org/T366438) (owner: 10GergesShamon)
[13:05:08] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1041044|[huwiki] Add "suppressredirect" user right to editor user group (T366438)]]
[13:05:14] <stashbot>	 T366438: Grant "suppressredirect" to editor on huwiki - https://phabricator.wikimedia.org/T366438
[13:06:03] <Emperor>	 Amir1: if you look at tcp retransmits, there's a similar rise in all of codfw swift frontends starting around 11:40 UTC today
[13:06:40] <Emperor>	 Amir1: e.g. https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m&var-server=ms-fe2013&var-datasource=thanos&var-cluster=swift&viewPanel=31&from=1717419989138&to=1718024789138
[13:07:05] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1018.eqiad.wmnet
[13:07:12] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2021.codfw.wmnet
[13:07:55] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
[13:07:56] <wikibugs>	 (03PS1) 10Brouberol: datahub: don't use an ES index prefix for production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041098
[13:08:17] <effie>	 elukey: ping when you are done, I would like to perform some reboots
[13:08:17] <Amir1>	 Gerges: mwdebug szervereken elérhető https://wikitech.wikimedia.org/wiki/Mwdebug
[13:08:20] <logmsgbot>	 !log elukey@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
[13:08:25] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup and gergesshamon: Backport for [[gerrit:1041044|[huwiki] Add "suppressredirect" user right to editor user group (T366438)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[13:08:37] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Looks good, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041098 (owner: 10Brouberol)
[13:08:57] <elukey>	 effie: o/ I am waiting for the deploy window to close before proceeding, my deploys should take ~5 mins afterwards
[13:09:34] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Nice, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041068 (https://phabricator.wikimedia.org/T346638) (owner: 10Brouberol)
[13:09:39] <logmsgbot>	 !log elukey@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[13:09:39] <fabfur>	 !log rebooting cp4047 (T366555)
[13:09:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:09:47] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=no; selector: name=cp4047.ulsfo.wmnet
[13:09:49] <effie>	 cool cool, I am queueing behind you then :p
[13:09:57] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] datahub: don't use an ES index prefix for production [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041098 (owner: 10Brouberol)
[13:10:05] <logmsgbot>	 !log elukey@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[13:10:07] <wikibugs>	 (03PS2) 10Cmelo: Enable CampaignEvents on swahili wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041094 (https://phabricator.wikimedia.org/T366502)
[13:10:08] <logmsgbot>	 !log fabfur@cumin1002 START - Cookbook sre.hosts.reboot-single for host cp4047.ulsfo.wmnet
[13:10:35] <Gerges>	 Amir1: I checked mwdebug, and everything is fine
[13:10:42] <Emperor>	 Amir1: I don't think it's NIC saturation (cf https://w.wiki/5$CU )
[13:10:46] <wikibugs>	 (03PS4) 10Majavah: service: update labweb/cloudweb conftool pool name [puppet] - 10https://gerrit.wikimedia.org/r/941459 (https://phabricator.wikimedia.org/T317463)
[13:10:46] <wikibugs>	 (03PS3) 10Majavah: conftool-data: drop labweb pool [puppet] - 10https://gerrit.wikimedia.org/r/941460 (https://phabricator.wikimedia.org/T317463)
[13:11:05] <Amir1>	 Gerges: awesome
[13:11:10] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup and gergesshamon: Continuing with sync
[13:11:33] <jinxer-wm>	 FIRING: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:11:39] <Amir1>	 Emperor: I'd say let's create a ticket and investigate 
[13:11:40] <wikibugs>	 (03CR) 10Majavah: [C:03+2] service: update labweb/cloudweb conftool pool name [puppet] - 10https://gerrit.wikimedia.org/r/941459 (https://phabricator.wikimedia.org/T317463) (owner: 10Majavah)
[13:11:48] <taavi>	 !log restarting eqiad low-traffic LVS for https://gerrit.wikimedia.org/r/c/operations/puppet/+/941459
[13:11:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:12:10] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041070 (https://phabricator.wikimedia.org/T362978) (owner: 10Brouberol)
[13:12:53] <wikibugs>	 (03CR) 10Daimona Eaytoy: [C:04-1] Enable CampaignEvents on swahili wikipedia (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041094 (https://phabricator.wikimedia.org/T366502) (owner: 10Cmelo)
[13:12:59] <wikibugs>	 (03PS9) 10Vgutierrez: lvs::realserver::ipip: Provide ferm MSS clamping support [puppet] - 10https://gerrit.wikimedia.org/r/1035724 (https://phabricator.wikimedia.org/T365689)
[13:13:09] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti1018.eqiad.wmnet
[13:13:11] <wikibugs>	 (03CR) 10Daimona Eaytoy: [C:03+1] Enable CampaignEvents on swahili wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041096 (https://phabricator.wikimedia.org/T366502) (owner: 10Cmelo)
[13:13:15] <wikibugs>	 (03CR) 10Btullis: [C:03+1] "Nice, thanks." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041071 (https://phabricator.wikimedia.org/T362978) (owner: 10Brouberol)
[13:13:15] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti1018.eqiad.wmnet
[13:13:20] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2021.codfw.wmnet
[13:13:25] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2021.codfw.wmnet
[13:13:47] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] spark-history: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041070 (https://phabricator.wikimedia.org/T362978) (owner: 10Brouberol)
[13:14:38] <Gerges>	 Amir1: has been deployed?
[13:15:22] <Amir1>	 Gerges: még nem
[13:15:42] <Amir1>	 80%
[13:15:50] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+1] "PCC SUCCESS (CORE_DIFF 2 DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/1035724 (https://phabricator.wikimedia.org/T365689) (owner: 10Vgutierrez)
[13:16:33] <jinxer-wm>	 RESOLVED: KubernetesAPILatency: High Kubernetes API latency (LIST certificaterequests) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/ddNd-sLnk/kubernetes-api-details?var-site=codfw&var-cluster=k8s-mlstaging&var-latency_percentile=0.95&var-verb=LIST - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[13:16:46] <Gerges>	 Amir1: Is there some way to see the output that you see during the deplay process
[13:16:53] <Amir1>	 nope
[13:17:06] <Amir1>	 eventually, one day
[13:17:21] <Gerges>	 OK 
[13:17:50] <wikibugs>	 (03CR) 10Elukey: "Definitely yes, otherwise it looks good! It also avoids me to re-build these images for security upgrades, please make sure the new images" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040862 (https://phabricator.wikimedia.org/T344730) (owner: 10Snwachukwu)
[13:18:07] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-secondary-eqiad or A:lvs-low-traffic-eqiad and A:lvs
[13:18:34] <logmsgbot>	 !log taavi@cumin1002 END (FAIL) - Cookbook sre.loadbalancer.restart-pybal (exit_code=99) rolling-restart of pybal on A:lvs-secondary-eqiad or A:lvs-low-traffic-eqiad and A:lvs
[13:18:57] <wikibugs>	 (03PS1) 10Elukey: services: update changeprop's Docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041105 (https://phabricator.wikimedia.org/T356252)
[13:19:34] <logmsgbot>	 !log fabfur@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4047.ulsfo.wmnet
[13:20:09] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Traffic: Rise in ms-fe2* TCP retransmits since 11:40 UTC today - https://phabricator.wikimedia.org/T367056 (10MatthewVernon) 03NEW
[13:20:10] <wikibugs>	 (03CR) 10Elukey: [C:03+2] services: update the rec-api's Docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1018717 (https://phabricator.wikimedia.org/T205870) (owner: 10Elukey)
[13:20:10] <Emperor>	 Amir1: opened T367056
[13:20:13] <logmsgbot>	 !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1041044|[huwiki] Add "suppressredirect" user right to editor user group (T366438)]] (duration: 15m 05s)
[13:20:13] <stashbot>	 T367056: Rise in ms-fe2* TCP retransmits since 11:40 UTC today  - https://phabricator.wikimedia.org/T367056
[13:20:15] <Amir1>	 thanks
[13:20:20] <stashbot>	 T366438: Grant "suppressredirect" to editor on huwiki - https://phabricator.wikimedia.org/T366438
[13:20:36] <Amir1>	 Gerges: done
[13:20:43] <Amir1>	 https://www.irccloud.com/pastebin/KPkXcFPO/
[13:20:49] <Gerges>	 Thanks 
[13:20:56] <Amir1>	 one of hosts failed to restart 
[13:21:35] <Amir1>	 this might be related to taavi's change I think
[13:23:59] <taavi>	 Amir1: yeah, probably, sorry about that. do you want me to manually restart that or did you do that already?
[13:25:37] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/recommendation-api: sync
[13:25:52] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/recommendation-api: sync
[13:25:58] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2203.codfw.wmnet with reason: Maintenance
[13:26:01] <taavi>	 (we're debugging why the cookbook failed in -traffic)
[13:26:11] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2203.codfw.wmnet with reason: Maintenance
[13:26:20] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2203 (T352010)', diff saved to https://phabricator.wikimedia.org/P64534 and previous config saved to /var/cache/conftool/dbconfig/20240610-132619-ladsgroup.json
[13:26:23] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[13:26:24] <wikibugs>	 (03PS1) 10Brouberol: spark-history: restore the ability to get env variables from configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041106
[13:26:50] <logmsgbot>	 !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/recommendation-api: sync
[13:27:11] <wikibugs>	 (03PS1) 10Arnaudb: dbconfig: remove cluster30/es6 to switchmaster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041107 (https://phabricator.wikimedia.org/T367055)
[13:27:16] <logmsgbot>	 !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/recommendation-api: sync
[13:27:47] <wikibugs>	 (03CR) 10Btullis: [C:03+1] spark-history: restore the ability to get env variables from configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041106 (owner: 10Brouberol)
[13:28:05] <Amir1>	 taavi: I have to go to meeting, if you restart it, I'd be grateful
[13:28:09] <taavi>	 will do
[13:28:24] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] spark-history: restore the ability to get env variables from configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041106 (owner: 10Brouberol)
[13:28:48] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2140.codfw.wmnet with reason: Long schema change
[13:28:50] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2140.codfw.wmnet with reason: Long schema change
[13:29:24] <taavi>	 !log taavi@mw1447 ~ $ sudo /usr/local/sbin/restart-php-fpm-all php7.4-fpm 9223372036854775807 # leftover from me restarting LVS during deployment
[13:29:25] <wikibugs>	 (03Merged) 10jenkins-bot: spark-history: restore the ability to get env variables from configmap [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041106 (owner: 10Brouberol)
[13:29:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:29:39] <wikibugs>	 (03CR) 10Marostegui: "Let's make the commit a bit more clear: this is to temporary disable writes on es6." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041107 (https://phabricator.wikimedia.org/T367055) (owner: 10Arnaudb)
[13:29:39] <wikibugs>	 (03PS1) 10Filippo Giunchedi: hieradata: remove thanos-query settings from thanos::frontend [puppet] - 10https://gerrit.wikimedia.org/r/1041110 (https://phabricator.wikimedia.org/T357747)
[13:29:41] <wikibugs>	 (03PS1) 10Filippo Giunchedi: titan: trim 5m retention to 70w [puppet] - 10https://gerrit.wikimedia.org/r/1041111 (https://phabricator.wikimedia.org/T357747)
[13:30:05] <wikibugs>	 (03PS1) 10Ssingh: restart-pybal: increase timeout and retries for spicerack.requests_session [cookbooks] - 10https://gerrit.wikimedia.org/r/1041112
[13:30:33] <marostegui>	 !log dbmaint codfw s4 deploy schema change on db2140 T364069
[13:30:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:30:37] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[13:31:06] <wikibugs>	 (03PS2) 10Arnaudb: dbconfig: temporary disable writes on es6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041107 (https://phabricator.wikimedia.org/T367055)
[13:31:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2838/co" [puppet] - 10https://gerrit.wikimedia.org/r/1041110 (https://phabricator.wikimedia.org/T357747) (owner: 10Filippo Giunchedi)
[13:31:30] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2839/co" [puppet] - 10https://gerrit.wikimedia.org/r/1041111 (https://phabricator.wikimedia.org/T357747) (owner: 10Filippo Giunchedi)
[13:32:28] <wikibugs>	 (03CR) 10Majavah: [C:03+1] restart-pybal: increase timeout and retries for spicerack.requests_session [cookbooks] - 10https://gerrit.wikimedia.org/r/1041112 (owner: 10Ssingh)
[13:32:32] <wikibugs>	 (03CR) 10FNegri: [C:03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/1041069 (owner: 10Majavah)
[13:33:49] <wikibugs>	 (03CR) 10FNegri: [C:03+1] P:openstack: opentofu: Add a variable for region [puppet] - 10https://gerrit.wikimedia.org/r/1041045 (https://phabricator.wikimedia.org/T365696) (owner: 10Majavah)
[13:34:02] <wikibugs>	 (03CR) 10Majavah: [C:03+2] P:openstack: opentofu: Add a variable for region [puppet] - 10https://gerrit.wikimedia.org/r/1041045 (https://phabricator.wikimedia.org/T365696) (owner: 10Majavah)
[13:34:08] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/spark-history: apply
[13:34:10] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] P:openstack: opentofu: Add a diff job to catch unapplied changes [puppet] - 10https://gerrit.wikimedia.org/r/1041069 (owner: 10Majavah)
[13:34:20] <wikibugs>	 (03CR) 10Marostegui: [C:03+1] dbconfig: temporary disable writes on es6 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041107 (https://phabricator.wikimedia.org/T367055) (owner: 10Arnaudb)
[13:34:35] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/spark-history: apply
[13:34:42] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[13:35:35] <wikibugs>	 (03CR) 10Ssingh: [C:03+2] restart-pybal: increase timeout and retries for spicerack.requests_session [cookbooks] - 10https://gerrit.wikimedia.org/r/1041112 (owner: 10Ssingh)
[13:35:50] <logmsgbot>	 !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/recommendation-api: sync
[13:36:15] <logmsgbot>	 !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/recommendation-api: sync
[13:36:28] <wikibugs>	 (03PS2) 10Filippo Giunchedi: hieradata: remove thanos-query settings from thanos::frontend [puppet] - 10https://gerrit.wikimedia.org/r/1041110 (https://phabricator.wikimedia.org/T357747)
[13:36:28] <wikibugs>	 (03PS2) 10Filippo Giunchedi: titan: trim 5m retention to 70w [puppet] - 10https://gerrit.wikimedia.org/r/1041111 (https://phabricator.wikimedia.org/T357747)
[13:36:40] <elukey>	 !log move recommendation-api on wikikube to prometheus metrics (offboarded from statsd) - T205870
[13:36:42] <wikibugs>	 (03CR) 10Majavah: [C:04-1] "most of them, yes :-) but I wanted  to mention the second category (I think just wikimediacloud.org and wikimedia.cloud) which are pointed" [puppet] - 10https://gerrit.wikimedia.org/r/1039849 (owner: 10Ncmonitor)
[13:36:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:36:44] <stashbot>	 T205870: Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870
[13:36:45] <jinxer-wm>	 FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ...
[13:36:45] <jinxer-wm>	 fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[13:36:58] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/spark-history: apply
[13:37:25] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/spark-history: apply
[13:38:18] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2840/co" [puppet] - 10https://gerrit.wikimedia.org/r/1041110 (https://phabricator.wikimedia.org/T357747) (owner: 10Filippo Giunchedi)
[13:40:52] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+2] Add new chart statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039171 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto)
[13:40:54] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] echoserver: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041071 (https://phabricator.wikimedia.org/T362978) (owner: 10Brouberol)
[13:41:33] <wikibugs>	 (03Merged) 10jenkins-bot: Add new chart statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039171 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto)
[13:41:45] <jinxer-wm>	 RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ...
[13:41:45] <jinxer-wm>	 fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[13:42:10] <logmsgbot>	 !log elukey@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'.
[13:42:39] <logmsgbot>	 !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[13:43:07] <elukey>	 effie: done!
[13:43:45] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/echoserver: apply
[13:43:56] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/echoserver: apply
[13:46:01] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] superset: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041068 (https://phabricator.wikimedia.org/T346638) (owner: 10Brouberol)
[13:46:12] <logmsgbot>	 !log taavi@cumin1002 START - Cookbook sre.loadbalancer.restart-pybal rolling-restart of pybal on A:lvs-secondary-eqiad or A:lvs-low-traffic-eqiad and A:lvs
[13:47:04] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset-next: apply
[13:47:11] <logmsgbot>	 !log taavi@cumin1002 END (PASS) - Cookbook sre.loadbalancer.restart-pybal (exit_code=0) rolling-restart of pybal on A:lvs-secondary-eqiad or A:lvs-low-traffic-eqiad and A:lvs
[13:47:33] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset-next: apply
[13:48:57] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/superset: apply
[13:49:25] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/superset: apply
[13:50:59] <logmsgbot>	 !log fabfur@cumin1002 conftool action : set/pooled=yes; selector: name=cp4047.ulsfo.wmnet
[13:51:53] <wikibugs>	 (03CR) 10Clément Goubert: "Small nit inline, otherwise lgtm" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039233 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto)
[13:51:57] <wikibugs>	 (03PS1) 10Ottomata: Remove EventLoggingLegacyConverter code - it has been moved to EventLogging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041115 (https://phabricator.wikimedia.org/T353817)
[13:52:33] <wikibugs>	 (03CR) 10CI reject: [V:04-1] Remove EventLoggingLegacyConverter code - it has been moved to EventLogging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041115 (https://phabricator.wikimedia.org/T353817) (owner: 10Ottomata)
[13:54:03] <wikibugs>	 (03PS2) 10Ottomata: Remove EventLoggingLegacyConverter code - it has been moved to EventLogging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041115 (https://phabricator.wikimedia.org/T353817)
[13:55:53] <wikibugs>	 (03PS3) 10Ottomata: Remove EventLoggingLegacyConverter code - it has been moved to EventLogging [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041115 (https://phabricator.wikimedia.org/T353817)
[13:56:30] <wikibugs>	 (03PS10) 10EoghanGaffney: lists: Add option to switch mailman root [puppet] - 10https://gerrit.wikimedia.org/r/1040174
[13:57:07] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] services: update changeprop's Docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041105 (https://phabricator.wikimedia.org/T356252) (owner: 10Elukey)
[13:57:16] <wikibugs>	 (03PS1) 10Brouberol: datasets-config: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041119 (https://phabricator.wikimedia.org/T362978)
[13:57:30] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic1107 for T348977 - bking@cumin2002
[13:57:31] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.elasticsearch.ban (exit_code=99) Banning hosts: elastic1107 for T348977 - bking@cumin2002
[13:57:34] <stashbot>	 T348977: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - https://phabricator.wikimedia.org/T348977
[13:57:47] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.elasticsearch.ban Banning hosts: elastic1107.eqiad.wmnet for T348977 - bking@cumin2002
[13:57:50] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.elasticsearch.ban (exit_code=0) Banning hosts: elastic1107.eqiad.wmnet for T348977 - bking@cumin2002
[13:58:15] <wikibugs>	 (03PS8) 10Giuseppe Lavagetto: statsd: add deployment to mw-debug (codfw only) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039233 (https://phabricator.wikimedia.org/T365265)
[13:58:26] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: statsd: add deployment to mw-debug (codfw only) (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039233 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto)
[13:59:08] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] statsd: add deployment to mw-debug (codfw only) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039233 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto)
[13:59:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T364069)', diff saved to https://phabricator.wikimedia.org/P64535 and previous config saved to /var/cache/conftool/dbconfig/20240610-135914-marostegui.json
[13:59:19] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[14:01:07] <wikibugs>	 (03PS2) 10Brouberol: mpic: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041120 (https://phabricator.wikimedia.org/T362978)
[14:01:37] <effie>	 jouncebot: now
[14:01:38] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 28 minute(s)
[14:01:49] <effie>	 elukey: how are tghings on yoiur end?
[14:02:11] <elukey>	 all done! (pinged you earlier on)
[14:03:54] <effie>	 elukey: oh sorry, notification fail :/
[14:05:27] <Amir1>	 please ping me once you're done, I want to deploy so many more patches
[14:08:06] <_joe_>	 Amir1: hold your horses
[14:08:23] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+2] statsd: add deployment to mw-debug (codfw only) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039233 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto)
[14:08:48] <_joe_>	 Amir1: I might make mediawiki un-deployable for a short while
[14:09:11] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations: Add option to exclude nodes from reboot by uptime or last reboot date - https://phabricator.wikimedia.org/T366797#9875489 (10MoritzMuehlenhoff) I think we should rather base this on a given kernel version? Seems more robust than a given date.
[14:09:13] <wikibugs>	 (03Merged) 10jenkins-bot: statsd: add deployment to mw-debug (codfw only) [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039233 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto)
[14:10:07] <wikibugs>	 (03CR) 10Btullis: [C:03+1] datasets-config: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041119 (https://phabricator.wikimedia.org/T362978) (owner: 10Brouberol)
[14:10:16] <wikibugs>	 (03PS1) 10Clément Goubert: shellbox: Bump shellbox image versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041122 (https://phabricator.wikimedia.org/T362518)
[14:10:19] <wikibugs>	 (03PS1) 10Hnowlan: thumbor: use bullseye image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041123 (https://phabricator.wikimedia.org/T355020)
[14:10:31] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[14:10:31] <wikibugs>	 (03CR) 10Btullis: [C:03+1] mpic: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041120 (https://phabricator.wikimedia.org/T362978) (owner: 10Brouberol)
[14:11:01] <logmsgbot>	 !log oblivian@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[14:12:09] <wikibugs>	 (03CR) 10Ebrahim: "ladsgroup@gmail.com" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041091 (owner: 10Ebrahim)
[14:12:13] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C:03+2] mw-debug: add statsd service everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039234 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto)
[14:12:14] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] thumbor: use bullseye image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041123 (https://phabricator.wikimedia.org/T355020) (owner: 10Hnowlan)
[14:12:51] <Amir1>	 _joe_: 💔
[14:12:51] <wikibugs>	 (03PS2) 10Hnowlan: thumbor: use bookworm [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041123 (https://phabricator.wikimedia.org/T355020)
[14:13:29] <_joe_>	 Amir1: gimme another 10 minutes and you'll be free
[14:14:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P64536 and previous config saved to /var/cache/conftool/dbconfig/20240610-141422-marostegui.json
[14:14:55] <wikibugs>	 (03PS8) 10Giuseppe Lavagetto: mw-debug: add statsd service everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039234 (https://phabricator.wikimedia.org/T365265)
[14:15:17] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] mpic: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041120 (https://phabricator.wikimedia.org/T362978) (owner: 10Brouberol)
[14:15:28] <wikibugs>	 (03CR) 10Brouberol: [C:03+2] datasets-config: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041119 (https://phabricator.wikimedia.org/T362978) (owner: 10Brouberol)
[14:15:29] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V:03+2 C:03+2] mw-debug: add statsd service everywhere [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039234 (https://phabricator.wikimedia.org/T365265) (owner: 10Giuseppe Lavagetto)
[14:15:37] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+1] thumbor: use bookworm [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041123 (https://phabricator.wikimedia.org/T355020) (owner: 10Hnowlan)
[14:18:36] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/datasets-config-next: apply
[14:18:45] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/datasets-config-next: apply
[14:18:45] <jinxer-wm>	 FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ...
[14:18:45] <jinxer-wm>	 fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[14:18:55] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/datasets-config: apply
[14:19:04] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/datasets-config: apply
[14:19:12] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/datasets-config: apply
[14:19:28] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply
[14:19:37] <logmsgbot>	 !log brouberol@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply
[14:21:30] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] thumbor: use bookworm [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041123 (https://phabricator.wikimedia.org/T355020) (owner: 10Hnowlan)
[14:22:20] <wikibugs>	 (03Merged) 10jenkins-bot: thumbor: use bookworm [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041123 (https://phabricator.wikimedia.org/T355020) (owner: 10Hnowlan)
[14:23:44] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply
[14:23:51] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply
[14:25:37] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Data-Persistence, 06DBA, and 2 others: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f7-eqiad - https://phabricator.wikimedia.org/T365984#9875546 (10cmooney) p:05Triage→03Medium
[14:28:01] <wikibugs>	 (03CR) 10Scott French: [C:03+2] proton: drop replicas from 12 to 10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040221 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French)
[14:28:45] <jinxer-wm>	 RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ...
[14:28:45] <jinxer-wm>	 fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[14:28:45] <wikibugs>	 (03Merged) 10jenkins-bot: proton: drop replicas from 12 to 10 [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040221 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French)
[14:28:53] <wikibugs>	 (03PS1) 10Brouberol: rdf-streaming-updater: remove from dse-k8s-eqiad [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041131
[14:29:21] <wikibugs>	 10SRE-tools, 06Infrastructure-Foundations: Add option to exclude nodes from reboot by uptime or last reboot date - https://phabricator.wikimedia.org/T366797#9875574 (10Volans) p:05Triage→03Medium
[14:29:32] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180', diff saved to https://phabricator.wikimedia.org/P64537 and previous config saved to /var/cache/conftool/dbconfig/20240610-142931-marostegui.json
[14:30:41] <wikibugs>	 (03PS1) 10EoghanGaffney: quickdatacopy: Add optional parameter for setting destination path [puppet] - 10https://gerrit.wikimedia.org/r/1041137
[14:31:11] <wikibugs>	 (03CR) 10CI reject: [V:04-1] quickdatacopy: Add optional parameter for setting destination path [puppet] - 10https://gerrit.wikimedia.org/r/1041137 (owner: 10EoghanGaffney)
[14:31:42] <elukey>	 jouncebot: next
[14:31:42] <jouncebot>	 In 0 hour(s) and 58 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240610T1530)
[14:31:51] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet
[14:31:55] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.drain-node (exit_code=99) for draining ganeti node ganeti2023.codfw.wmnet
[14:32:06] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2023.codfw.wmnet
[14:32:08] <_joe_>	 Amir1: please go on if it wasn't clear heh
[14:32:20] <wikibugs>	 (03CR) 10Elukey: [C:03+2] services: update changeprop's Docker images [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041105 (https://phabricator.wikimedia.org/T356252) (owner: 10Elukey)
[14:33:16] <Amir1>	 Thank you
[14:33:18] <wikibugs>	 (03PS1) 10Eevans: aqs: Upgrade aqs1010 to Java 11 (canary) [puppet] - 10https://gerrit.wikimedia.org/r/1041138 (https://phabricator.wikimedia.org/T350567)
[14:33:34] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti1019.eqiad.wmnet
[14:34:32] <wikibugs>	 (03CR) 10Scott French: [C:03+2] admin_ng: bump CPU resourcequota for proton [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040220 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French)
[14:34:35] <wikibugs>	 (03PS2) 10EoghanGaffney: quickdatacopy: Add optional parameter for setting destination path [puppet] - 10https://gerrit.wikimedia.org/r/1041137
[14:35:35] <wikibugs>	 (03PS1) 10Hnowlan: Adapt entrypoint-prod to bullseye + blubber path [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1041140 (https://phabricator.wikimedia.org/T355020)
[14:36:52] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance
[14:36:54] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging] START helmfile.d/services/changeprop: sync
[14:37:05] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1008.eqiad.wmnet with reason: Maintenance
[14:37:31] <wikibugs>	 (03Merged) 10jenkins-bot: admin_ng: bump CPU resourcequota for proton [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040220 (https://phabricator.wikimedia.org/T362978) (owner: 10Scott French)
[14:38:42] <logmsgbot>	 !log elukey@deploy1002 helmfile [staging] DONE helmfile.d/services/changeprop: sync
[14:38:45] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:39:45] <logmsgbot>	 !log swfrench@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[14:40:03] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2023.codfw.wmnet
[14:40:17] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+1] Adapt entrypoint-prod to bullseye + blubber path [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1041140 (https://phabricator.wikimedia.org/T355020) (owner: 10Hnowlan)
[14:41:12] <wikibugs>	 (03PS1) 10Brouberol: superset: replace IP-based networkpolicy by its service counterpart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041142 (https://phabricator.wikimedia.org/T331894)
[14:41:15] <wikibugs>	 (03CR) 10JMeybohm: deployment_server: alert on admin-ng pending changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1040992 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol)
[14:41:15] <logmsgbot>	 !log swfrench@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[14:41:19] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti1019.eqiad.wmnet
[14:41:32] <wikibugs>	 (03PS2) 10Brouberol: superset: replace IP-based networkpolicy by its service counterpart [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041142 (https://phabricator.wikimedia.org/T331894)
[14:41:38] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:03+1] Adapt entrypoint-prod to bullseye + blubber path [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1041140 (https://phabricator.wikimedia.org/T355020) (owner: 10Hnowlan)
[14:41:50] <logmsgbot>	 !log swfrench@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[14:41:55] <icinga-wm_>	 PROBLEM - Host kubestagemaster2005 is DOWN: PING CRITICAL - Packet loss = 100%
[14:42:25] <logmsgbot>	 !log swfrench@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[14:43:13] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on elastic1107.eqiad.wmnet with reason: T365982
[14:43:16] <logmsgbot>	 !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'.
[14:43:18] <stashbot>	 T365982: Upgrade EVPN switches Eqiad row E-F to JunOS 22.2 - lsw1-f5-eqiad - https://phabricator.wikimedia.org/T365982
[14:43:22] <logmsgbot>	 !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop: sync
[14:43:27] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on elastic1107.eqiad.wmnet with reason: T365982
[14:43:45] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service kubestagemaster2005:6443 has failed probes (http_staging_codfw_kube_apiserver_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#kubestagemaster2005:6443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:43:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/1041138 (https://phabricator.wikimedia.org/T350567) (owner: 10Eevans)
[14:44:01] <logmsgbot>	 !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop: sync
[14:44:27] <wikibugs>	 (03CR) 10Eevans: [C:03+2] aqs: Upgrade aqs1010 to Java 11 (canary) [puppet] - 10https://gerrit.wikimedia.org/r/1041138 (https://phabricator.wikimedia.org/T350567) (owner: 10Eevans)
[14:44:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1180 (T364069)', diff saved to https://phabricator.wikimedia.org/P64538 and previous config saved to /var/cache/conftool/dbconfig/20240610-144439-marostegui.json
[14:44:43] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1187.eqiad.wmnet with reason: Maintenance
[14:44:44] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[14:44:54] <logmsgbot>	 !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[14:44:56] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1187.eqiad.wmnet with reason: Maintenance
[14:45:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1187 (T364069)', diff saved to https://phabricator.wikimedia.org/P64539 and previous config saved to /var/cache/conftool/dbconfig/20240610-144501-marostegui.json
[14:45:04] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] "yespls :D" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041122 (https://phabricator.wikimedia.org/T362518) (owner: 10Clément Goubert)
[14:45:10] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2023.codfw.wmnet
[14:45:25] <icinga-wm_>	 RECOVERY - Host kubestagemaster2005 is UP: PING OK - Packet loss = 0%, RTA = 30.42 ms
[14:45:25] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2023.codfw.wmnet
[14:45:33] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] errorpages: Add dark mode support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041091 (owner: 10Ebrahim)
[14:45:39] <logmsgbot>	 !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[14:45:46] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service ganeti2023:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:45:51] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] shellbox: Bump shellbox image versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041122 (https://phabricator.wikimedia.org/T362518) (owner: 10Clément Goubert)
[14:46:13] <logmsgbot>	 !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[14:46:23] <wikibugs>	 (03Merged) 10jenkins-bot: errorpages: Add dark mode support [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041091 (owner: 10Ebrahim)
[14:46:42] <logmsgbot>	 !log cdobbins@cumin1002 conftool action : set/pooled=no; selector: name=cp4046.ulsfo.wmnet
[14:46:49] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] "Copied votes on follow-up patch sets have been updated:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041091 (owner: 10Ebrahim)
[14:46:57] <jinxer-wm>	 FIRING: KubernetesCalicoDown: kubestagemaster2005.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestagemaster2005.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:47:03] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:1041091|errorpages: Add dark mode support]]
[14:47:06] <wikibugs>	 (03Merged) 10jenkins-bot: shellbox: Bump shellbox image versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041122 (https://phabricator.wikimedia.org/T362518) (owner: 10Clément Goubert)
[14:47:56] <urandom>	 !log aqs1010: restarting cassandra to apply upgrade to Java 11 — T350567
[14:48:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:48:01] <stashbot>	 T350567: Migrate Cassandra to Java 11 - https://phabricator.wikimedia.org/T350567
[14:48:42] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-restart for nodes matching aqs1010.eqiad.wmnet: Apply update to Java 11 - eevans@cumin1002
[14:48:45] <jinxer-wm>	 RESOLVED: [3x] ProbeDown: Service ganeti2023:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[14:48:45] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:48:59] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/shellbox: apply
[14:49:35] <logmsgbot>	 !log swfrench@deploy1002 helmfile [codfw] START helmfile.d/services/proton: apply
[14:49:42] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C:04-1] "Looks pretty nice, thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040862 (https://phabricator.wikimedia.org/T344730) (owner: 10Snwachukwu)
[14:50:15] <logmsgbot>	 !log swfrench@deploy1002 helmfile [codfw] DONE helmfile.d/services/proton: apply
[14:50:36] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox: apply
[14:50:53] <logmsgbot>	 !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop: apply
[14:51:28] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-constraints: apply
[14:51:40] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-constraints: apply
[14:51:40] <logmsgbot>	 !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop: apply
[14:51:45] <logmsgbot>	 !log cdobbins@cumin1002 START - Cookbook sre.hosts.reboot-single for host cp4046.ulsfo.wmnet
[14:51:47] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-media: apply
[14:51:57] <jinxer-wm>	 RESOLVED: KubernetesCalicoDown: kubestagemaster2005.codfw.wmnet is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestagemaster2005.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:52:04] <ChrisDobbins901_>	 !log sudo -i cookbook sre.hosts.reboot-single -r 'Kernel upgrade' 'P{cp4046.*}'
[14:52:05] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-media: apply
[14:52:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:52:16] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-syntaxhighlight: apply
[14:52:35] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-syntaxhighlight: apply
[14:52:41] <moritzm>	 !log powercycling ganeti1019, stuck on reboot
[14:52:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:52:52] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [staging] START helmfile.d/services/shellbox-timeline: apply
[14:53:13] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [staging] DONE helmfile.d/services/shellbox-timeline: apply
[14:53:31] <logmsgbot>	 !log elukey@deploy1002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
[14:53:56] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.drain-node for draining ganeti node ganeti2024.codfw.wmnet
[14:54:11] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup and ebrahim: Backport for [[gerrit:1041091|errorpages: Add dark mode support]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:54:37] <logmsgbot>	 !log elukey@deploy1002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
[14:55:06] <wikibugs>	 (03CR) 10Bking: [C:03+1] "feel free to merge once once the dependent patch is merged." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041142 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol)
[14:55:15] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup and ebrahim: Continuing with sync
[14:55:16] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox: apply
[14:55:46] <jinxer-wm>	 FIRING: [2x] JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:55:49] <wikibugs>	 (03PS1) 10Ahmon Dancy: fix-staging-perms.sh: Add missing -r to an xargs call [puppet] - 10https://gerrit.wikimedia.org/r/1041145 (https://phabricator.wikimedia.org/T364309)
[14:56:04] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox: apply
[14:56:09] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching aqs1010.eqiad.wmnet: Apply update to Java 11 - eevans@cumin1002
[14:56:10] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-constraints: apply
[14:56:31] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-constraints: apply
[14:56:38] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-media: apply
[14:57:13] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-media: apply
[14:57:19] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-syntaxhighlight: apply
[14:57:20] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] Adapt entrypoint-prod to bullseye + blubber path [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1041140 (https://phabricator.wikimedia.org/T355020) (owner: 10Hnowlan)
[14:57:30] <wikibugs>	 (03PS1) 10Majavah: hieradata: cloudweb: Fix LVS service name [puppet] - 10https://gerrit.wikimedia.org/r/1041146
[14:58:05] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-syntaxhighlight: apply
[14:58:11] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] START helmfile.d/services/shellbox-timeline: apply
[14:59:29] <wikibugs>	 (03CR) 10Majavah: [V:03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2843/console" [puppet] - 10https://gerrit.wikimedia.org/r/1041146 (owner: 10Majavah)
[14:59:37] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [codfw] DONE helmfile.d/services/shellbox-timeline: apply
[14:59:41] <wikibugs>	 (03CR) 10Majavah: [V:03+1 C:03+2] hieradata: cloudweb: Fix LVS service name [puppet] - 10https://gerrit.wikimedia.org/r/1041146 (owner: 10Majavah)
[14:59:50] <wikibugs>	 06SRE, 10SRE-swift-storage, 06Infrastructure-Foundations, 10netops, 06Traffic: Rise in ms-fe2* TCP retransmits since 11:40 UTC today - https://phabricator.wikimedia.org/T367056#9875713 (10MatthewVernon)
[14:59:59] <logmsgbot>	 !log elukey@deploy1002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
[15:00:35] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti2024.codfw.wmnet
[15:00:40] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox: apply
[15:00:45] <wikibugs>	 (03Merged) 10jenkins-bot: Adapt entrypoint-prod to bullseye + blubber path [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/1041140 (https://phabricator.wikimedia.org/T355020) (owner: 10Hnowlan)
[15:01:25] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox: apply
[15:01:27] <icinga-wm_>	 PROBLEM - Host ganeti1019 is DOWN: PING CRITICAL - Packet loss = 100%
[15:01:31] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-constraints: apply
[15:01:35] <logmsgbot>	 !log cdobbins@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cp4046.ulsfo.wmnet
[15:01:36] <logmsgbot>	 !log elukey@deploy1002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply
[15:01:49] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-constraints: apply
[15:01:55] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-media: apply
[15:02:15] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-media: apply
[15:02:21] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-syntaxhighlight: apply
[15:02:43] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-syntaxhighlight: apply
[15:02:47] <icinga-wm_>	 PROBLEM - Host kubetcd2004 is DOWN: PING CRITICAL - Packet loss = 100%
[15:02:49] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] START helmfile.d/services/shellbox-timeline: apply
[15:03:19] <logmsgbot>	 !log cgoubert@deploy1002 helmfile [eqiad] DONE helmfile.d/services/shellbox-timeline: apply
[15:03:45] <jinxer-wm>	 FIRING: [3x] ProbeDown: Service ganeti1019:1811 has failed probes (tcp_ganeti_noded_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:04:19] <logmsgbot>	 !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:1041091|errorpages: Add dark mode support]] (duration: 17m 15s)
[15:04:41] <icinga-wm_>	 PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:05:06] <logmsgbot>	 !log cdobbins@cumin1002 conftool action : set/pooled=yes; selector: name=4046.ulsfo.wmnet
[15:05:06] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to cassandra-staging-devs for milimetric - https://phabricator.wikimedia.org/T365074#9875754 (10KOfori) a:05KOfori→03WDoranWMF @WDoranWMF please check this out and let me know if this has your approval before I approve.
[15:05:27] <icinga-wm_>	 RECOVERY - Host kubetcd2004 is UP: PING OK - Packet loss = 0%, RTA = 30.81 ms
[15:07:13] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti2024.codfw.wmnet
[15:07:18] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.drain-node (exit_code=0) for draining ganeti node ganeti2024.codfw.wmnet
[15:08:45] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service ganeti1019:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:09:52] <wikibugs>	 (03PS1) 10Ladsgroup: Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041151
[15:10:57] <wikibugs>	 (03CR) 10Ladsgroup: [C:03+2] Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041151 (owner: 10Ladsgroup)
[15:11:37] <wikibugs>	 (03Merged) 10jenkins-bot: Bump portals to HEAD [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041151 (owner: 10Ladsgroup)
[15:11:42] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Observability-Metrics: Memory upgrade request for prometheus200[56] - https://phabricator.wikimedia.org/T360895#9875773 (10Jhancock.wm) I got the ram out of the servers. I can do the addition tomorrow morning at 8am CDT (1600 UTC) or another time that works for you.
[15:12:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: httpbb_hourly_appserver.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:13:45] <wikibugs>	 06SRE, 06Infrastructure-Foundations, 10netops: Juniper QFX5120 error logs on lsw1-e1 and lsw1-f1: Failed to get ifl for ifl index - https://phabricator.wikimedia.org/T325801#9875780 (10cmooney) 05Open→03Resolved We seem to have no such errors being logged any more, either from these switches or the d...
[15:14:41] <icinga-wm_>	 RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[15:16:04] <effie>	 jouncebot: now
[15:16:04] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 13 minute(s)
[15:16:13] <effie>	 jouncebot: next
[15:16:13] <jouncebot>	 In 0 hour(s) and 13 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240610T1530)
[15:17:25] <jinxer-wm>	 RESOLVED: SystemdUnitFailed: httpbb_hourly_appserver.service on cumin2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:17:51] <wikibugs>	 (03PS8) 10Ladsgroup: Change static footer icons to the new one [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1027150 (https://phabricator.wikimedia.org/T256190) (owner: 10Jforrester)
[15:19:13] <bd808>	 effie: `jouncebot: nowandnext` is a sneaky shortcut for that set of lookups
[15:20:15] <effie>	 hahaha, I know, I think I am just used to making 2 requests, keeping the bot busy:p
[15:20:59] <wikibugs>	 (03CR) 10MVernon: [C:03+2] wmflib: add Wmflib::IP::Address::CIDR type [puppet] - 10https://gerrit.wikimedia.org/r/1041046 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon)
[15:22:39] <wikibugs>	 (03PS1) 10Filippo Giunchedi: logstash: align benthos mw-accesslog-sampler consumer group [puppet] - 10https://gerrit.wikimedia.org/r/1041155 (https://phabricator.wikimedia.org/T366308)
[15:22:54] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized portals/wikipedia.org/assets: (no justification provided) (duration: 10m 28s)
[15:24:02] <bd808>	 effie: fair enough. :)
[15:24:23] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: ganeti1019 is down - https://phabricator.wikimedia.org/T367071 (10MoritzMuehlenhoff) 03NEW
[15:24:37] <wikibugs>	 06SRE, 10Ganeti, 06Infrastructure-Foundations: ganeti1019 is down - https://phabricator.wikimedia.org/T367071#9875829 (10MoritzMuehlenhoff) p:05Triage→03Medium
[15:27:45] <jinxer-wm>	 FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ...
[15:27:45] <jinxer-wm>	 fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[15:28:15] <icinga-wm_>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1033 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[15:29:00] <logmsgbot>	 !log swfrench@deploy1002 helmfile [eqiad] START helmfile.d/services/proton: apply
[15:29:04] <wikibugs>	 (03Abandoned) 10Elukey: Remove response_code label from totals in Lift Wing Availability SLOs [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1010196 (owner: 10Elukey)
[15:30:04] <jouncebot>	 jan_drewniak: How many deployers does it take to do Wikimedia Portals Update deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240610T1530).
[15:30:51] <wikibugs>	 (03PS1) 10Elukey: slo_template: update the SLO window [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1041159
[15:30:55] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Observability-Metrics: Memory upgrade request for prometheus200[56] - https://phabricator.wikimedia.org/T360895#9875848 (10fgiunchedi) >>! In T360895#9875773, @Jhancock.wm wrote: > I got the ram out of the servers. I can do the addition tomorrow morning at 8am CDT (1600 UTC)...
[15:30:57] <wikibugs>	 (03PS1) 10Hnowlan: thumbor: new version with entrypoint script fix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041160 (https://phabricator.wikimedia.org/T355020)
[15:31:05] <wikibugs>	 (03CR) 10CI reject: [V:04-1] thumbor: new version with entrypoint script fix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041160 (https://phabricator.wikimedia.org/T355020) (owner: 10Hnowlan)
[15:31:07] <logmsgbot>	 !log swfrench@deploy1002 helmfile [eqiad] DONE helmfile.d/services/proton: apply
[15:31:12] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-reboot rolling reboot on A:cassandra-dev
[15:32:27] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Observability-Metrics: Memory upgrade request for prometheus200[56] - https://phabricator.wikimedia.org/T360895#9875853 (10herron) >>! In T360895#9875773, @Jhancock.wm wrote: > I got the ram out of the servers. I can do the addition tomorrow morning at 8am CDT (1600 UTC) or...
[15:32:35] <wikibugs>	 10ops-codfw, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9875854 (10kamila) @Papaul could you please let me know when would be a good time for you to do this? We don't have any specific...
[15:33:09] <wikibugs>	 (03PS1) 10JMeybohm: flink-operator: add securityContext [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041161 (https://phabricator.wikimedia.org/T362978)
[15:33:29] <wikibugs>	 (03PS2) 10Hnowlan: thumbor: new version with entrypoint script fix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041160 (https://phabricator.wikimedia.org/T355020)
[15:34:06] <wikibugs>	 06SRE, 06collaboration-services, 06Release-Engineering-Team, 06Traffic, 13Patch-For-Review: Move GitLab behind the CDN - https://phabricator.wikimedia.org/T366882#9875862 (10LSobanski) p:05Triage→03High
[15:34:14] <logmsgbot>	 !log ladsgroup@deploy1002 Synchronized portals: (no justification provided) (duration: 11m 20s)
[15:34:39] <wikibugs>	 (03CR) 10Kamila Součková: [C:03+1] thumbor: new version with entrypoint script fix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041160 (https://phabricator.wikimedia.org/T355020) (owner: 10Hnowlan)
[15:35:07] <icinga-wm_>	 RECOVERY - Host ganeti1019 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms
[15:36:45] <wikibugs>	 (03CR) 10Hnowlan: [C:03+2] thumbor: new version with entrypoint script fix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041160 (https://phabricator.wikimedia.org/T355020) (owner: 10Hnowlan)
[15:37:28] <wikibugs>	 (03CR) 10Klausman: [C:03+1] slo_template: update the SLO window [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1041159 (owner: 10Elukey)
[15:37:41] <wikibugs>	 (03Merged) 10jenkins-bot: thumbor: new version with entrypoint script fix [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041160 (https://phabricator.wikimedia.org/T355020) (owner: 10Hnowlan)
[15:38:29] <wikibugs>	 (03PS2) 10Elukey: slo_template: update the SLO window [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1041159
[15:38:45] <jinxer-wm>	 RESOLVED: ProbeDown: Service ganeti1019:1811 has failed probes (tcp_ganeti_noded_ip4) - https://wikitech.wikimedia.org/wiki/Ganeti - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[15:39:00] <wikibugs>	 (03CR) 10Elukey: "Added the wrong month :( (sept instead of August)" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1041159 (owner: 10Elukey)
[15:39:19] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to Kubernetes deployment for amastilovic - https://phabricator.wikimedia.org/T367073 (10amastilovic) 03NEW
[15:39:25] <wikibugs>	 10ops-codfw, 06SRE, 06DC-Ops, 10Observability-Metrics: Memory upgrade request for prometheus200[56] - https://phabricator.wikimedia.org/T360895#9875901 (10Jhancock.wm) Yes that would work.
[15:40:18] <wikibugs>	 (03PS3) 10Elukey: slo_template: update the SLO window [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1041159
[15:40:36] <wikibugs>	 (03CR) 10Elukey: "And August has 31 days, not 30.. Good job Luca :D" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1041159 (owner: 10Elukey)
[15:40:51] <marostegui>	 !log Drop flaggedpage_pending from s6 T365568
[15:40:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:40:56] <stashbot>	 T365568: Drop flaggedpage_pending from production - https://phabricator.wikimedia.org/T365568
[15:41:40] <marostegui>	 !log Drop flaggedpage_pending from s7 T365568
[15:41:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:41:51] <godog>	 !log bounce benthos@mw_accesslog_metrics.service on centrallog hosts
[15:41:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:42:45] <jinxer-wm>	 RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ...
[15:42:45] <jinxer-wm>	 fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[15:42:49] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/thumbor: apply
[15:42:57] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/thumbor: apply
[15:43:13] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to Kubernetes deployment for amastilovic - https://phabricator.wikimedia.org/T367073#9875920 (10Ottomata)
[15:43:35] <marostegui>	 !log Drop flaggedpage_pending from s2 T365568
[15:43:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:44:23] <wikibugs>	 (03PS1) 10MVernon: cephadm: template out cephadm spec files [puppet] - 10https://gerrit.wikimedia.org/r/1041163 (https://phabricator.wikimedia.org/T279621)
[15:44:29] <icinga-wm_>	 PROBLEM - MD RAID on ganeti1019 is CRITICAL: CRITICAL: State: degraded, Active: 10, Working: 10, Failed: 0, Spare: 0 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[15:44:30] <icinga-wm_>	 ACKNOWLEDGEMENT - MD RAID on ganeti1019 is CRITICAL: CRITICAL: State: degraded, Active: 10, Working: 10, Failed: 0, Spare: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T367075 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering
[15:44:38] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops: Degraded RAID on ganeti1019 - https://phabricator.wikimedia.org/T367075 (10ops-monitoring-bot) 03NEW
[15:44:43] <wikibugs>	 (03CR) 10CI reject: [V:04-1] cephadm: template out cephadm spec files [puppet] - 10https://gerrit.wikimedia.org/r/1041163 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon)
[15:46:24] <marostegui>	 !log Drop flaggedpage_pending from s5 T365568
[15:46:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:46:28] <stashbot>	 T365568: Drop flaggedpage_pending from production - https://phabricator.wikimedia.org/T365568
[15:46:52] <wikibugs>	 (03PS1) 10Clément Goubert: docker_registry_ha: Bump nginx worker_rlimit_nofile [puppet] - 10https://gerrit.wikimedia.org/r/1041164 (https://phabricator.wikimedia.org/T366481)
[15:47:11] <marostegui>	 !log Drop flaggedpage_pending from s3 T365568
[15:47:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:47:15] <wikibugs>	 (03CR) 10Scott French: "Thanks, Janis! It looks like you might also need to add base.helper.restrictedSecurityContext onto the containers in `developer-portal/tem" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041039 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm)
[15:52:42] <wikibugs>	 (03CR) 10Scott French: [C:03+1] linkrecommendation: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041049 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm)
[15:53:20] <wikibugs>	 (03CR) 10Scott French: [C:03+1] machinetranslation: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041055 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm)
[15:54:12] <wikibugs>	 (03PS3) 10MVernon: cephadm: template out cephadm spec files [puppet] - 10https://gerrit.wikimedia.org/r/1041163 (https://phabricator.wikimedia.org/T279621)
[15:54:28] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/services/thumbor: apply
[15:54:33] <wikibugs>	 (03CR) 10CI reject: [V:04-1] cephadm: template out cephadm spec files [puppet] - 10https://gerrit.wikimedia.org/r/1041163 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon)
[15:55:59] <wikibugs>	 (03CR) 10Elukey: [V:03+2 C:03+2] slo_template: update the SLO window [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/1041159 (owner: 10Elukey)
[15:57:13] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to Kubernetes deployment for amastilovic - https://phabricator.wikimedia.org/T367073#9876016 (10Ottomata)
[15:57:27] <wikibugs>	 (03CR) 10Scott French: [C:03+1] python-webapp: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041072 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm)
[15:57:41] <wikibugs>	 (03PS1) 10Ottomata: data.yaml - Add amastilovic to deployment user group [puppet] - 10https://gerrit.wikimedia.org/r/1041165 (https://phabricator.wikimedia.org/T367073)
[15:58:15] <icinga-wm_>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1033 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[15:58:25] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to Kubernetes deployment for amastilovic - https://phabricator.wikimedia.org/T367073#9876021 (10Ottomata) ^ patch to do this once approved.
[15:59:38] <wikibugs>	 (03CR) 10Scott French: "Thanks, Janis! Looks like this might need updates to calculator-service/templates/deployment.yaml as well?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041076 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm)
[16:00:01] <logmsgbot>	 !log eevans@cumin1002 END (PASS) - Cookbook sre.cassandra.roll-reboot (exit_code=0) rolling reboot on A:cassandra-dev
[16:00:12] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[16:00:21] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to Kubernetes deployment for amastilovic - https://phabricator.wikimedia.org/T367073#9876024 (10Ottomata) @thcipriani for group approver
[16:00:41] <wikibugs>	 06SRE, 10MW-on-K8s, 10Observability-Logging, 06serviceops: benthos mw-accesslog-metrics kafka lag and interpolation errors - https://phabricator.wikimedia.org/T367076#9876028 (10fgiunchedi)
[16:01:10] <cdanis>	 !log 💙cdanis@puppetserver2001.codfw.wmnet ~ 🕛☕ sudo systemctl restart sync-puppet-volatile
[16:01:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:05:16] <wikibugs>	 (03PS4) 10MVernon: cephadm: template out cephadm spec files [puppet] - 10https://gerrit.wikimedia.org/r/1041163 (https://phabricator.wikimedia.org/T279621)
[16:05:51] <cdanis>	 !log 💙cdanis@cumin1002.eqiad.wmnet ~ 🕛☕ sudo cumin -b 8 '*.codfw.wmnet and C:geoip::data::puppet%fetch_ipinfo_dbs=true' 'sha512sum /usr/share/GeoIPInfo/GeoLite2-ASN.mmdb || run-puppet-agent'
[16:05:53] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:06:28] <wikibugs>	 10ops-codfw, 06SRE-OnFire, 06DC-Ops, 06serviceops, 10Sustainability (Incident Followup): codfw:(3) wikikube-ctrl NIC upgrade to 10G - https://phabricator.wikimedia.org/T366205#9876057 (10Papaul) @kamila ? There are some planning that we need to do around this.  We will need to relocate those servers for...
[16:09:42] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] Remove iegreview module [puppet] - 10https://gerrit.wikimedia.org/r/1040873 (https://phabricator.wikimedia.org/T334415) (owner: 10Muehlenhoff)
[16:13:49] <wikibugs>	 (03CR) 10Scott French: [C:03+1] "Thanks, Janis! Just to confirm, the pod-level securityContext reverting to the chart defaults for runAsUser/Group (9999) should be a noop " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041161 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm)
[16:14:45] <wikibugs>	 (03CR) 10MVernon: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1041163 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon)
[16:20:47] <marostegui>	 !log Drop flaggedpage_pending from s1 T365568
[16:20:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:20:51] <stashbot>	 T365568: Drop flaggedpage_pending from production - https://phabricator.wikimedia.org/T365568
[16:21:18] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/services/thumbor: apply
[16:26:51] <logmsgbot>	 !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply
[16:38:43] <claime>	 .28
[16:43:05] <wikibugs>	 (03CR) 10Vgutierrez: [V:03+1] "latest PS tested on WMCS and it's working as expected for several interfaces and on IPv4 only realservers as well" [puppet] - 10https://gerrit.wikimedia.org/r/1035724 (https://phabricator.wikimedia.org/T365689) (owner: 10Vgutierrez)
[16:46:26] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to Kubernetes deployment for amastilovic - https://phabricator.wikimedia.org/T367073#9876336 (10Ahoelzl) Approved.
[16:49:38] <wikibugs>	 (03CR) 10JMeybohm: "Yeah, correct. The more precise settings (e.g. the one on container level) win." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041161 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm)
[16:49:52] <wikibugs>	 (03CR) 10JMeybohm: deployment_server: alert on admin-ng pending changes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/1040992 (https://phabricator.wikimedia.org/T331894) (owner: 10Brouberol)
[16:58:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T364069)', diff saved to https://phabricator.wikimedia.org/P64543 and previous config saved to /var/cache/conftool/dbconfig/20240610-165806-marostegui.json
[16:58:12] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[17:00:04] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240610T1700)
[17:00:04] <jouncebot>	 ryankemper: Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240610T1700). Please do the needful.
[17:00:09] <wikibugs>	 06SRE, 10Observability-Metrics, 05Goal, 13Patch-For-Review: Fully migrate producers off statsd - https://phabricator.wikimedia.org/T205870#9876416 (10elukey) @colewhite o/ I finally deployed recommendation-api, and this time it looks good. I updated also its dashboard:  https://grafana.wikimedia.org/d/Y5wk...
[17:01:36] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to Kubernetes deployment for amastilovic - https://phabricator.wikimedia.org/T367073#9876424 (10Ottomata)
[17:01:43] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2200.codfw.wmnet with reason: Maintenance
[17:01:57] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2200.codfw.wmnet with reason: Maintenance
[17:02:08] <wikibugs>	 (03CR) 10JMeybohm: [C:03+1] docker_registry_ha: Bump nginx worker_rlimit_nofile [puppet] - 10https://gerrit.wikimedia.org/r/1041164 (https://phabricator.wikimedia.org/T366481) (owner: 10Clément Goubert)
[17:02:32] <wikibugs>	 (03CR) 10Clément Goubert: [C:03+2] docker_registry_ha: Bump nginx worker_rlimit_nofile [puppet] - 10https://gerrit.wikimedia.org/r/1041164 (https://phabricator.wikimedia.org/T366481) (owner: 10Clément Goubert)
[17:02:42] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering, 13Patch-For-Review: Requesting access to Kubernetes deployment for amastilovic - https://phabricator.wikimedia.org/T367073#9876421 (10ttaylor) Approving in @thcipriani 's place since he is on vacation.
[17:06:25] <wikibugs>	 (03CR) 10Ottomata: [V:03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2850/co" [puppet] - 10https://gerrit.wikimedia.org/r/1041165 (https://phabricator.wikimedia.org/T367073) (owner: 10Ottomata)
[17:07:35] <wikibugs>	 (03CR) 10Scott French: [C:03+1] mcrouter: Bump chart modules [deployment-charts] - 10https://gerrit.wikimedia.org/r/1038859 (https://phabricator.wikimedia.org/T362978) (owner: 10Alexandros Kosiaris)
[17:08:40] <wikibugs>	 (03PS2) 10JMeybohm: developer-portal: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041039 (https://phabricator.wikimedia.org/T362978)
[17:09:14] <wikibugs>	 (03CR) 10JMeybohm: "Absolutely, yes. Thanks for spotting this." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041039 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm)
[17:12:09] <wikibugs>	 (03CR) 10Ottomata: [V:03+1 C:03+2] data.yaml - Add amastilovic to deployment user group [puppet] - 10https://gerrit.wikimedia.org/r/1041165 (https://phabricator.wikimedia.org/T367073) (owner: 10Ottomata)
[17:13:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P64544 and previous config saved to /var/cache/conftool/dbconfig/20240610-171313-marostegui.json
[17:16:22] <wikibugs>	 (03CR) 10Scott French: [C:03+1] developer-portal: add securityContext to all containers [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041039 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm)
[17:20:22] <wikibugs>	 (03PS3) 10Cmelo: Configures the necessary user rights for CampaignEvents on swahili [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041094 (https://phabricator.wikimedia.org/T366502)
[17:23:45] <jinxer-wm>	 FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ...
[17:23:45] <jinxer-wm>	 fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[17:24:00] <wikibugs>	 (03PS4) 10Cmelo: Configures the necessary user rights for CampaignEvents on swahili [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041094 (https://phabricator.wikimedia.org/T366502)
[17:25:10] <wikibugs>	 (03CR) 10Cmelo: Configures the necessary user rights for CampaignEvents on swahili (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041094 (https://phabricator.wikimedia.org/T366502) (owner: 10Cmelo)
[17:25:14] <logmsgbot>	 !log dancy@deploy1002 Installing scap version "4.87.0" for 285 hosts
[17:26:26] <wikibugs>	 (03CR) 10Daimona Eaytoy: [C:04-1] Configures the necessary user rights for CampaignEvents on swahili (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041094 (https://phabricator.wikimedia.org/T366502) (owner: 10Cmelo)
[17:28:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187', diff saved to https://phabricator.wikimedia.org/P64545 and previous config saved to /var/cache/conftool/dbconfig/20240610-172820-marostegui.json
[17:29:19] <logmsgbot>	 !log amastilovic@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[17:29:30] <logmsgbot>	 !log amastilovic@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[17:30:15] <logmsgbot>	 !log dancy@deploy1002 Installation of scap version "4.87.0" completed for 285 hosts
[17:34:42] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[17:36:59] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[17:37:02] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[17:37:15] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to Kubernetes deployment for amastilovic - https://phabricator.wikimedia.org/T367073#9876692 (10amastilovic) Merged and applied - done
[17:38:45] <jinxer-wm>	 RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ...
[17:38:45] <jinxer-wm>	 fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[17:42:29] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting access to Kubernetes deployment for amastilovic - https://phabricator.wikimedia.org/T367073#9876716 (10Ottomata) 05Open→03Resolved a:03Ottomata
[17:43:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1187 (T364069)', diff saved to https://phabricator.wikimedia.org/P64546 and previous config saved to /var/cache/conftool/dbconfig/20240610-174327-marostegui.json
[17:43:30] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1201.eqiad.wmnet with reason: Maintenance
[17:43:32] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[17:43:43] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1201.eqiad.wmnet with reason: Maintenance
[17:43:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1201 (T364069)', diff saved to https://phabricator.wikimedia.org/P64547 and previous config saved to /var/cache/conftool/dbconfig/20240610-174349-marostegui.json
[17:46:53] <logmsgbot>	 !log amastilovic@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply
[17:47:06] <logmsgbot>	 !log amastilovic@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[17:50:22] <logmsgbot>	 !log amastilovic@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply
[17:50:29] <logmsgbot>	 !log amastilovic@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[17:57:38] <wikibugs>	 (03CR) 10Aleksandar Mastilovic: "Deployed to eqiad and codfw. Deployed to staging too, but k8s showed no pods/resources running." [deployment-charts] - 10https://gerrit.wikimedia.org/r/1039727 (https://phabricator.wikimedia.org/T362978) (owner: 10JMeybohm)
[18:01:11] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9876800 (10BCornwall)
[18:01:33] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists, 07Wikimedia-Incident: lists apache config change should trigger an apache reload - https://phabricator.wikimedia.org/T323208#9876816 (10Ladsgroup) @Dzahn The issue was that the change made the config invalid, since it was invalid, it didn't restart the apache. But then later...
[18:02:12] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Ganeti, 06Infrastructure-Foundations: ganeti1019 is down - https://phabricator.wikimedia.org/T367071#9876817 (10MoritzMuehlenhoff) a:03Jclark-ctr All VMs moved off the server. DC ops, can you please have a look? Not sure what "unsupported event" means, never seen that be...
[18:02:36] <wikibugs>	 10ops-eqiad, 06SRE, 06DC-Ops, 10Ganeti, 06Infrastructure-Foundations: ganeti1019 is down - https://phabricator.wikimedia.org/T367071#9876819 (10MoritzMuehlenhoff)
[18:06:24] <wikibugs>	 10ops-codfw, 06SRE, 10SRE-tools, 06DC-Ops, and 2 others: Netbox errors caused by system board replacement - https://phabricator.wikimedia.org/T358542#9876833 (10wiki_willy) Thanks @Volans, will do on the remaining Netbox errors.   >>! In T358542#9874557, @Volans wrote: > This is now completed. The new...
[18:11:44] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)
[18:11:49] <stashbot>	 T349069: Design and implement a WDQS data-reload mechanism that sources its data from HDFS instead of the snapshot servers - https://phabricator.wikimedia.org/T349069
[18:17:43] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists, 07Wikimedia-Incident: lists apache config change should trigger an apache reload - https://phabricator.wikimedia.org/T323208#9876889 (10Dzahn) Gotcha! Yea, so.. I would normally support the idea of adding an Icinga check.  Except my concern is that Icinga doesn't effectively...
[18:17:46] <wikibugs>	 (03PS2) 10Snwachukwu: Update Eventgate-Wikimedia and Eventstreams repository to Gitlab source and version [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040862 (https://phabricator.wikimedia.org/T344730)
[18:17:57] <wikibugs>	 (03CR) 10Snwachukwu: "@ltoscano@wikimedia.org WHich images/config are you referring to please?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/1040862 (https://phabricator.wikimedia.org/T344730) (owner: 10Snwachukwu)
[18:29:07] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9876950 (10BCornwall)
[18:29:56] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9876951 (10BCornwall)
[18:30:29] <wikibugs>	 10ops-ulsfo, 06SRE, 06DC-Ops, 06Traffic: Q4: install PCIe NVMe SSDs into ulsfo text cp40(3[789]|4[01234] - https://phabricator.wikimedia.org/T364891#9876952 (10BCornwall)
[18:49:00] <wikibugs>	 (03PS11) 10CDobbins: varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040262 (https://phabricator.wikimedia.org/T354718)
[18:55:26] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Ifrahkhanyaree_WMDE - https://phabricator.wikimedia.org/T366558#9877037 (10herron)
[18:55:42] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Ifrahkhanyaree_WMDE - https://phabricator.wikimedia.org/T366558#9877038 (10herron)
[18:55:46] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:56:07] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for Ifrahkhanyaree_WMDE - https://phabricator.wikimedia.org/T366558#9877036 (10herron) Hi @Ifrahkhanyaree_WMDE I see the SSH key in the description is in use already.  Could you please generate a fresh ssh key for production use and...
[18:57:43] <wikibugs>	 06SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users (no kerberos, no ssh) for HNordeen - https://phabricator.wikimedia.org/T364801#9877041 (10herron)
[18:58:17] <wikibugs>	 (03PS12) 10CDobbins: varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040262 (https://phabricator.wikimedia.org/T354718)
[19:02:06] <logmsgbot>	 !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)
[19:02:12] <stashbot>	 T349069: Design and implement a WDQS data-reload mechanism that sources its data from HDFS instead of the snapshot servers - https://phabricator.wikimedia.org/T349069
[19:02:49] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)
[19:04:08] <wikibugs>	 06SRE, 10Wikimedia-Mailing-lists, 07Wikimedia-Incident: lists apache config change should trigger an apache reload - https://phabricator.wikimedia.org/T323208#9877057 (10Dzahn) P.S. (and when we merge apache changes and we aren't sure if a puppet refresh is enough for it to take effect, then we should do the...
[19:04:55] <wikibugs>	 (03PS13) 10CDobbins: varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040262 (https://phabricator.wikimedia.org/T354718)
[19:06:16] <wikibugs>	 (03PS1) 10Herron: admin: add hnordeen to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1041198 (https://phabricator.wikimedia.org/T364801)
[19:07:11] <wikibugs>	 (03CR) 10CI reject: [V:04-1] admin: add hnordeen to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1041198 (https://phabricator.wikimedia.org/T364801) (owner: 10Herron)
[19:12:43] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T364069)', diff saved to https://phabricator.wikimedia.org/P64550 and previous config saved to /var/cache/conftool/dbconfig/20240610-191242-marostegui.json
[19:12:47] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[19:14:30] <jinxer-wm>	 FIRING: [2x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1014:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:15:00] <wikibugs>	 (03PS2) 10Herron: admin: add hnordeen to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1041198 (https://phabricator.wikimedia.org/T364801)
[19:17:22] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/1041198 (https://phabricator.wikimedia.org/T364801) (owner: 10Herron)
[19:18:29] <wikibugs>	 (03PS1) 10Herron: admin: add note/hint for no-ssh no-kereberos accounts [puppet] - 10https://gerrit.wikimedia.org/r/1041200
[19:19:30] <jinxer-wm>	 RESOLVED: [2x] ProbeDown: Service wdqs1014:443 has failed probes (http_wdqs_external_sparql_endpoint_search_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#wdqs1014:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:20:09] <wikibugs>	 (03CR) 10Herron: [C:03+2] admin: add hnordeen to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1041198 (https://phabricator.wikimedia.org/T364801) (owner: 10Herron)
[19:20:22] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] admin: add note/hint for no-ssh no-kereberos accounts [puppet] - 10https://gerrit.wikimedia.org/r/1041200 (owner: 10Herron)
[19:21:32] <wikibugs>	 (03CR) 10Herron: [C:03+2] admin: add note/hint for no-ssh no-kereberos accounts [puppet] - 10https://gerrit.wikimedia.org/r/1041200 (owner: 10Herron)
[19:22:50] <logmsgbot>	 !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)
[19:22:54] <stashbot>	 T349069: Design and implement a WDQS data-reload mechanism that sources its data from HDFS instead of the snapshot servers - https://phabricator.wikimedia.org/T349069
[19:25:33] <wikibugs>	 07Puppet, 06Infrastructure-Foundations: Puppetmaster volatile data not synced to all puppet frontends for a month and a half (2024-04-27 to 2024-06-10) - https://phabricator.wikimedia.org/T367113 (10CDanis) 03NEW
[19:25:46] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users (no kerberos, no ssh) for HNordeen - https://phabricator.wikimedia.org/T364801#9877139 (10herron) 05In progress→03Resolved The patch to provision this access has been merged and will be propagated fully...
[19:27:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P64551 and previous config saved to /var/cache/conftool/dbconfig/20240610-192749-marostegui.json
[19:33:13] <icinga-wm_>	 PROBLEM - rt.wikimedia.org tls expiry on moscovium is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[19:33:17] <icinga-wm_>	 PROBLEM - rt.wikimedia.org requires authentication on moscovium is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[19:33:23] <jinxer-wm>	 FIRING: ProbeDown: Service moscovium:443 has failed probes (http_rt_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#moscovium:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:33:29] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] create u4c.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1040144 (https://phabricator.wikimedia.org/T366649) (owner: 10Zabe)
[19:33:34] <wikibugs>	 (03PS2) 10Zabe: create u4c.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1040144 (https://phabricator.wikimedia.org/T366649)
[19:34:05] <mutante>	 i'll take care of the moscovium alerts
[19:34:09] <icinga-wm_>	 RECOVERY - rt.wikimedia.org tls expiry on moscovium is OK: OK - Certificate rt.discovery.wmnet will expire on Tue 25 Jun 2024 02:55:00 PM GMT +0000. https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[19:34:09] <icinga-wm_>	 RECOVERY - rt.wikimedia.org requires authentication on moscovium is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 537 bytes in 1.578 second response time https://wikitech.wikimedia.org/wiki/CAS-SSO/Administration
[19:35:28] <wikibugs>	 06SRE, 10SRE-Access-Requests, 06Data-Engineering: Requesting permissions for analytics-privatedata-users (with kerberos) for Mareike Heuer - https://phabricator.wikimedia.org/T364715#9877189 (10herron) 05In progress→03Resolved Resolving as the access looks to have been provisioned, please reopen if a...
[19:36:56] <jinxer-wm>	 FIRING: MaxConntrack: Max conntrack at 90.48% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[19:37:44] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Rae Adimer - https://phabricator.wikimedia.org/T365832#9877210 (10herron)
[19:38:23] <jinxer-wm>	 RESOLVED: ProbeDown: Service moscovium:443 has failed probes (http_rt_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#moscovium:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[19:41:55] <jinxer-wm>	 RESOLVED: MaxConntrack: Max conntrack at 90.48% on krb1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_conntrack - https://grafana.wikimedia.org/d/oITUqwKIk/netfilter-connection-tracking - https://alerts.wikimedia.org/?q=alertname%3DMaxConntrack
[19:42:08] <wikibugs>	 (03PS10) 10Brouberol: deployment_server: alert on admin-ng pending changes [puppet] - 10https://gerrit.wikimedia.org/r/1040992 (https://phabricator.wikimedia.org/T331894)
[19:42:43] <wikibugs>	 (03PS11) 10Brouberol: deployment_server: alert on admin-ng pending changes [puppet] - 10https://gerrit.wikimedia.org/r/1040992 (https://phabricator.wikimedia.org/T331894)
[19:42:58] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201', diff saved to https://phabricator.wikimedia.org/P64552 and previous config saved to /var/cache/conftool/dbconfig/20240610-194256-marostegui.json
[19:45:24] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Tchanders - https://phabricator.wikimedia.org/T366351#9877234 (10herron)
[19:47:41] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Rae Adimer - https://phabricator.wikimedia.org/T365832#9877229 (10herron) (SSH key verification email sent)
[19:47:51] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Tchanders - https://phabricator.wikimedia.org/T366351#9877241 (10herron) a:03JayCano Hi @JayCano, assigning to you for approval. Thanks!
[19:48:43] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for rickijay - https://phabricator.wikimedia.org/T365574#9877247 (10herron)
[19:48:53] <wikibugs>	 (03PS14) 10CDobbins: varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040262 (https://phabricator.wikimedia.org/T354718)
[19:53:52] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Rae Adimer - https://phabricator.wikimedia.org/T365832#9877256 (10herron)
[19:54:47] <wikibugs>	 (03PS12) 10Brouberol: deployment_server: alert on admin-ng pending changes [puppet] - 10https://gerrit.wikimedia.org/r/1040992 (https://phabricator.wikimedia.org/T331894)
[19:58:05] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1201 (T364069)', diff saved to https://phabricator.wikimedia.org/P64553 and previous config saved to /var/cache/conftool/dbconfig/20240610-195804-marostegui.json
[19:58:07] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1224.eqiad.wmnet with reason: Maintenance
[19:58:09] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[19:58:20] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1224.eqiad.wmnet with reason: Maintenance
[19:58:26] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1224 (T364069)', diff saved to https://phabricator.wikimedia.org/P64554 and previous config saved to /var/cache/conftool/dbconfig/20240610-195826-marostegui.json
[19:58:45] <jinxer-wm>	 FIRING: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ...
[19:58:45] <jinxer-wm>	 fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[19:59:03] <wikibugs>	 (03PS2) 10Scott French: kubernetes: alert on persistent unavailable replicas [alerts] - 10https://gerrit.wikimedia.org/r/1040242 (https://phabricator.wikimedia.org/T366932)
[20:00:05] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: OwO what's this, a deployment window?? UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240610T2000). nyaa~
[20:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[20:00:40] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2203 (T352010)', diff saved to https://phabricator.wikimedia.org/P64555 and previous config saved to /var/cache/conftool/dbconfig/20240610-200039-ladsgroup.json
[20:00:44] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[20:00:51] <wikibugs>	 (03CR) 10CI reject: [V:04-1] kubernetes: alert on persistent unavailable replicas [alerts] - 10https://gerrit.wikimedia.org/r/1040242 (https://phabricator.wikimedia.org/T366932) (owner: 10Scott French)
[20:03:45] <jinxer-wm>	 RESOLVED: CirrusConsumerRerenderFetchErrorRate: cirrus_streaming_updater_consumer_cloudelastic_eqiad in eqiad (k8s): ...
[20:03:45] <jinxer-wm>	 fetch error (rerenders) rate too high - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=consumer-cloudelastic - https://alerts.wikimedia.org/?q=alertname%3DCirrusConsumerRerenderFetchErrorRate
[20:03:46] <wikibugs>	 (03CR) 10Scott French: "Thanks, Janis!" [alerts] - 10https://gerrit.wikimedia.org/r/1040242 (https://phabricator.wikimedia.org/T366932) (owner: 10Scott French)
[20:05:36] <wikibugs>	 (03PS3) 10Scott French: kubernetes: alert on persistent unavailable replicas [alerts] - 10https://gerrit.wikimedia.org/r/1040242 (https://phabricator.wikimedia.org/T366932)
[20:10:15] <wikibugs>	 (03PS15) 10CDobbins: varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040262 (https://phabricator.wikimedia.org/T354718)
[20:14:35] <wikibugs>	 (03PS5) 10JHathaway: cephadm: template out cephadm spec files [puppet] - 10https://gerrit.wikimedia.org/r/1041163 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon)
[20:15:47] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2203', diff saved to https://phabricator.wikimedia.org/P64556 and previous config saved to /var/cache/conftool/dbconfig/20240610-201546-ladsgroup.json
[20:16:29] <wikibugs>	 (03PS2) 10Herron: admin: add radimer to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1036593 (https://phabricator.wikimedia.org/T365832) (owner: 10Cwhite)
[20:17:16] <wikibugs>	 (03CR) 10CI reject: [V:04-1] admin: add radimer to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1036593 (https://phabricator.wikimedia.org/T365832) (owner: 10Cwhite)
[20:17:45] <wikibugs>	 07Puppet, 06Infrastructure-Foundations: Puppetmaster volatile data not synced to all puppet frontends for a month and a half (2024-04-27 to 2024-06-10) - https://phabricator.wikimedia.org/T367113#9877330 (10CDanis)
[20:18:36] <wikibugs>	 (03CR) 10Dzahn: "recheck" [dns] - 10https://gerrit.wikimedia.org/r/1040144 (https://phabricator.wikimedia.org/T366649) (owner: 10Zabe)
[20:20:27] <wikibugs>	 (03PS16) 10CDobbins: varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040262 (https://phabricator.wikimedia.org/T354718)
[20:21:37] <wikibugs>	 (03PS1) 10CDanis: enable monitoring+logging for puppetmaster syncs [puppet] - 10https://gerrit.wikimedia.org/r/1041217 (https://phabricator.wikimedia.org/T367113)
[20:21:58] <wikibugs>	 (03CR) 10CI reject: [V:04-1] enable monitoring+logging for puppetmaster syncs [puppet] - 10https://gerrit.wikimedia.org/r/1041217 (https://phabricator.wikimedia.org/T367113) (owner: 10CDanis)
[20:21:59] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1041217 (https://phabricator.wikimedia.org/T367113) (owner: 10CDanis)
[20:22:34] <wikibugs>	 (03PS1) 10Herron: admin: add radimer to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1041218 (https://phabricator.wikimedia.org/T365832)
[20:22:48] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you1" [puppet] - 10https://gerrit.wikimedia.org/r/1041111 (https://phabricator.wikimedia.org/T357747) (owner: 10Filippo Giunchedi)
[20:23:51] <wikibugs>	 (03PS2) 10CDanis: enable monitoring+logging for puppetmaster syncs [puppet] - 10https://gerrit.wikimedia.org/r/1041217 (https://phabricator.wikimedia.org/T367113)
[20:23:56] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1041217 (https://phabricator.wikimedia.org/T367113) (owner: 10CDanis)
[20:24:05] <wikibugs>	 (03Abandoned) 10Herron: admin: add radimer to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1036593 (https://phabricator.wikimedia.org/T365832) (owner: 10Cwhite)
[20:24:11] <wikibugs>	 (03CR) 10CI reject: [V:04-1] enable monitoring+logging for puppetmaster syncs [puppet] - 10https://gerrit.wikimedia.org/r/1041217 (https://phabricator.wikimedia.org/T367113) (owner: 10CDanis)
[20:24:15] <wikibugs>	 (03CR) 10Dzahn: [C:04-1] "the expiry_contact and expiry_date should stay in there unless the manager or so states they aren't a contractor anymore" [puppet] - 10https://gerrit.wikimedia.org/r/1041218 (https://phabricator.wikimedia.org/T365832) (owner: 10Herron)
[20:24:46] <wikibugs>	 (03CR) 10Andrea Denisse: [C:03+1] "LGTM, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/1040170 (https://phabricator.wikimedia.org/T366710) (owner: 10Filippo Giunchedi)
[20:25:24] <wikibugs>	 (03PS2) 10Herron: admin: add radimer to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1041218 (https://phabricator.wikimedia.org/T365832)
[20:25:39] <wikibugs>	 (03CR) 10JHathaway: "Pushed a patch with a few suggestions. One option you might want to consider is converting Puppet data structures to yaml directly, rather" [puppet] - 10https://gerrit.wikimedia.org/r/1041163 (https://phabricator.wikimedia.org/T279621) (owner: 10MVernon)
[20:25:43] <wikibugs>	 (03PS3) 10CDanis: enable monitoring+logging for puppetmaster syncs [puppet] - 10https://gerrit.wikimedia.org/r/1041217 (https://phabricator.wikimedia.org/T367113)
[20:25:45] <jinxer-wm>	 FIRING: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable
[20:25:50] <wikibugs>	 (03CR) 10CDanis: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/1041217 (https://phabricator.wikimedia.org/T367113) (owner: 10CDanis)
[20:26:03] <wikibugs>	 (03CR) 10CI reject: [V:04-1] enable monitoring+logging for puppetmaster syncs [puppet] - 10https://gerrit.wikimedia.org/r/1041217 (https://phabricator.wikimedia.org/T367113) (owner: 10CDanis)
[20:26:12] <wikibugs>	 (03CR) 10Herron: "thanks! updated in ps2" [puppet] - 10https://gerrit.wikimedia.org/r/1041218 (https://phabricator.wikimedia.org/T365832) (owner: 10Herron)
[20:26:42] <wikibugs>	 (03PS1) 10Santiago Faci: Metrics Platform Instrument Configurator: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041221 (https://phabricator.wikimedia.org/T366918)
[20:26:42] <wikibugs>	 (03PS17) 10CDobbins: varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040262 (https://phabricator.wikimedia.org/T354718)
[20:28:56] <wikibugs>	 (03CR) 10Dzahn: "So the email address changed from -ctr to no -ctr suffix. And I see it's actually like that in LDAP. That brings up the question.. have th" [puppet] - 10https://gerrit.wikimedia.org/r/1041218 (https://phabricator.wikimedia.org/T365832) (owner: 10Herron)
[20:29:22] <wikibugs>	 (03CR) 10Clare Ming: [C:03+2] Metrics Platform Instrument Configurator: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041221 (https://phabricator.wikimedia.org/T366918) (owner: 10Santiago Faci)
[20:30:02] <logmsgbot>	 !log ryankemper@cumin2002 START - Cookbook sre.wdqs.data-reload reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)
[20:30:07] <wikibugs>	 (03Merged) 10jenkins-bot: Metrics Platform Instrument Configurator: Deploying to staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041221 (https://phabricator.wikimedia.org/T366918) (owner: 10Santiago Faci)
[20:30:07] <stashbot>	 T349069: Design and implement a WDQS data-reload mechanism that sources its data from HDFS instead of the snapshot servers - https://phabricator.wikimedia.org/T349069
[20:30:25] <logmsgbot>	 !log ryankemper@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) reloading wikidata_full on wdqs2023.codfw.wmnet from DumpsSource.HDFS (hdfs:///wmf/discovery/wdqs-reload-cookbook-test-T349069/ using stat1009.eqiad.wmnet)
[20:30:54] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2203', diff saved to https://phabricator.wikimedia.org/P64557 and previous config saved to /var/cache/conftool/dbconfig/20240610-203053-ladsgroup.json
[20:31:32] <wikibugs>	 (03PS4) 10CDanis: enable monitoring+logging for puppetmaster syncs [puppet] - 10https://gerrit.wikimedia.org/r/1041217 (https://phabricator.wikimedia.org/T367113)
[20:36:10] <logmsgbot>	 !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/mpic-next: apply
[20:36:27] <logmsgbot>	 !log sfaci@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/mpic-next: apply
[20:36:52] <wikibugs>	 (03PS5) 10Cmelo: Configures the necessary user rights for CampaignEvents on swahili [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041094 (https://phabricator.wikimedia.org/T366502)
[20:37:01] <wikibugs>	 (03PS5) 10CDanis: enable monitoring+logging for puppetmaster syncs [puppet] - 10https://gerrit.wikimedia.org/r/1041217 (https://phabricator.wikimedia.org/T367113)
[20:37:14] <wikibugs>	 (03CR) 10Cmelo: Configures the necessary user rights for CampaignEvents on swahili (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041094 (https://phabricator.wikimedia.org/T366502) (owner: 10Cmelo)
[20:41:11] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for rickijay - https://phabricator.wikimedia.org/T365574#9877425 (10Dzahn) 05Stalled→03Open
[20:43:31] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Rae Adimer - https://phabricator.wikimedia.org/T365832#9877420 (10herron) >>! In T365832#9830059, @elappen-WMF wrote: > Approving access from my end.  Hi @LMccabe @elappen-WMF we noticed when writing the pa...
[20:46:01] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2203 (T352010)', diff saved to https://phabricator.wikimedia.org/P64558 and previous config saved to /var/cache/conftool/dbconfig/20240610-204600-ladsgroup.json
[20:46:03] <logmsgbot>	 !log ladsgroup@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2216.codfw.wmnet with reason: Maintenance
[20:46:05] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[20:46:16] <logmsgbot>	 !log ladsgroup@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2216.codfw.wmnet with reason: Maintenance
[20:46:23] <logmsgbot>	 !log ladsgroup@cumin1002 dbctl commit (dc=all): 'Depooling db2216 (T352010)', diff saved to https://phabricator.wikimedia.org/P64559 and previous config saved to /var/cache/conftool/dbconfig/20240610-204622-ladsgroup.json
[20:46:26] <wikibugs>	 (03CR) 10Herron: [C:03+1] k8s: send logs to per-cluster kafka topics [puppet] - 10https://gerrit.wikimedia.org/r/1040170 (https://phabricator.wikimedia.org/T366710) (owner: 10Filippo Giunchedi)
[20:48:08] <wikibugs>	 (03PS18) 10CDobbins: varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040262 (https://phabricator.wikimedia.org/T354718)
[20:49:23] <wikibugs>	 (03CR) 10Herron: [C:03+1] titan: trim 5m retention to 70w [puppet] - 10https://gerrit.wikimedia.org/r/1041111 (https://phabricator.wikimedia.org/T357747) (owner: 10Filippo Giunchedi)
[20:56:24] <wikibugs>	 07Puppet, 06Infrastructure-Foundations: Install a default timeout for systemd::timer::jobs - https://phabricator.wikimedia.org/T367119 (10CDanis) 03NEW
[20:57:41] <wikibugs>	 (03PS1) 10Dzahn: admin: add rickijay to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/1041227 (https://phabricator.wikimedia.org/T365574)
[20:59:24] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for rickijay - https://phabricator.wikimedia.org/T365574#9877476 (10Dzahn) 05Open→03In progress
[21:00:05] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: #bothumor I � Unicode. All rise for Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240610T2100).
[21:04:13] <wikibugs>	 (03PS19) 10CDobbins: varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040262 (https://phabricator.wikimedia.org/T354718)
[21:05:06] <wikibugs>	 (03PS1) 10Dzahn: remote iegreview.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1041229 (https://phabricator.wikimedia.org/T367011)
[21:05:21] <wikibugs>	 (03PS2) 10Dzahn: remove iegreview.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1041229 (https://phabricator.wikimedia.org/T367011)
[21:06:12] <wikibugs>	 (03PS3) 10Dzahn: remove iegreview.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1041229 (https://phabricator.wikimedia.org/T367011)
[21:09:21] <wikibugs>	 (03CR) 10Dzahn: [C:03+2] remove iegreview.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1041229 (https://phabricator.wikimedia.org/T367011) (owner: 10Dzahn)
[21:11:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T364069)', diff saved to https://phabricator.wikimedia.org/P64560 and previous config saved to /var/cache/conftool/dbconfig/20240610-211101-marostegui.json
[21:11:05] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[21:13:02] <wikibugs>	 06SRE, 10DNS, 06Traffic, 13Patch-For-Review: Remove iegreview.wikimedia.org from DNS - https://phabricator.wikimedia.org/T367011#9877497 (10Dzahn) 05Open→03Resolved a:03Dzahn thanks for reporting. removed.   Host iegreview.wikimedia.org not found: 3(NXDOMAIN)
[21:13:10] <wikibugs>	 (03PS1) 10Ahmon Dancy: Testing Gerrit.  Please Disregard [puppet] - 10https://gerrit.wikimedia.org/r/1041230
[21:17:38] <wikibugs>	 (03PS1) 10Eevans: sre.cassandra.roll-reboot: Add missing Cassandra cluster aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/1041231
[21:20:01] <wikibugs>	 (03CR) 10Muehlenhoff: sre.cassandra.roll-reboot: Add missing Cassandra cluster aliases (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1041231 (owner: 10Eevans)
[21:20:31] <wikibugs>	 (03PS20) 10CDobbins: varnish: WIP [puppet] - 10https://gerrit.wikimedia.org/r/1040262 (https://phabricator.wikimedia.org/T354718)
[21:21:17] <wikibugs>	 (03CR) 10CI reject: [V:04-1] sre.cassandra.roll-reboot: Add missing Cassandra cluster aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/1041231 (owner: 10Eevans)
[21:21:29] <wikibugs>	 (03PS8) 10Jdlrobson: Drop unused config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031459 (https://phabricator.wikimedia.org/T301212)
[21:23:52] <wikibugs>	 (03PS1) 10EoghanGaffney: lists: Remove quickdatacopy and use our own rsyncd and systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/1041232 (https://phabricator.wikimedia.org/T331706)
[21:23:54] <wikibugs>	 (03PS2) 10Eevans: sre.cassandra.roll-reboot: Add missing Cassandra cluster aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/1041231
[21:24:12] <wikibugs>	 (03CR) 10CI reject: [V:04-1] lists: Remove quickdatacopy and use our own rsyncd and systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/1041232 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney)
[21:24:16] <wikibugs>	 (03CR) 10Eevans: sre.cassandra.roll-reboot: Add missing Cassandra cluster aliases (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1041231 (owner: 10Eevans)
[21:25:22] <wikibugs>	 06SRE, 10SRE-Access-Requests, 13Patch-For-Review: Requesting access to analytics-privatedata-users for Rae Adimer - https://phabricator.wikimedia.org/T365832#9877582 (10elappen-WMF) Hello! Yes I can confirm the email change is correct and yes you can remove the expiry. Also if needed confirming the change in...
[21:25:23] <wikibugs>	 (03PS2) 10EoghanGaffney: lists: Remove quickdatacopy and use our own rsyncd and systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/1041232 (https://phabricator.wikimedia.org/T331706)
[21:26:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P64561 and previous config saved to /var/cache/conftool/dbconfig/20240610-212608-marostegui.json
[21:27:40] <logmsgbot>	 !log eevans@cumin1002 START - Cookbook sre.cassandra.roll-reboot rolling reboot on A:restbase-eqiad
[21:28:42] <wikibugs>	 (03CR) 10Stoyofuku-wmf: [C:03+1] Drop unused config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031459 (https://phabricator.wikimedia.org/T301212) (owner: 10Jdlrobson)
[21:30:43] <wikibugs>	 (03CR) 10VolkerE: [C:03+1] Drop unused config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1031459 (https://phabricator.wikimedia.org/T301212) (owner: 10Jdlrobson)
[21:30:46] <jinxer-wm>	 FIRING: ProbeDown: Service restbase1028-c:9042 has failed probes (tcp_cassandra_c_cql_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#restbase1028-c:9042 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:33:45] <jinxer-wm>	 FIRING: [6x] ProbeDown: Service restbase1028-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:34:42] <jinxer-wm>	 FIRING: [2x] RoutinatorRsyncErrors: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors
[21:34:57] <wikibugs>	 (03PS3) 10EoghanGaffney: lists: Remove quickdatacopy and use our own rsyncd and systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/1041232 (https://phabricator.wikimedia.org/T331706)
[21:35:18] <wikibugs>	 (03CR) 10CI reject: [V:04-1] lists: Remove quickdatacopy and use our own rsyncd and systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/1041232 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney)
[21:35:46] <jinxer-wm>	 RESOLVED: [6x] ProbeDown: Service restbase1028-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:36:04] <wikibugs>	 06SRE, 10DNS, 06Traffic: Cleanup unused DNS subdomains - https://phabricator.wikimedia.org/T367012#9877639 (10Dzahn) cache.wikimedia.org goes so far back in history that I reached 2012 when using git blame and the change before that was made by root and isn't in gerrit anymore.   langcom.wikimedia.org - same...
[21:37:03] <wikibugs>	 (03PS4) 10EoghanGaffney: lists: Remove quickdatacopy and use our own rsyncd and systemd timer [puppet] - 10https://gerrit.wikimedia.org/r/1041232 (https://phabricator.wikimedia.org/T331706)
[21:38:18] <wikibugs>	 (03CR) 10EoghanGaffney: [V:03+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/2868/co" [puppet] - 10https://gerrit.wikimedia.org/r/1041232 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney)
[21:39:30] <wikibugs>	 06SRE, 10DNS, 06Traffic: Cleanup unused DNS subdomains - https://phabricator.wikimedia.org/T367012#9877666 (10taavi) >>! In T367012#9877639, @Dzahn wrote: > langcom.wikimedia.org - same. It was already there in an initial import in 2011.  Apparently there once was a `langcomwiki` which was [[ https://gerrit....
[21:40:49] <wikibugs>	 (03PS1) 10Scott French: aqs-http-gateway: allow cross-DC Cassandra client connection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041235 (https://phabricator.wikimedia.org/T366851)
[21:41:16] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224', diff saved to https://phabricator.wikimedia.org/P64562 and previous config saved to /var/cache/conftool/dbconfig/20240610-214115-marostegui.json
[21:41:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C:03+1] "Looks good!" [cookbooks] - 10https://gerrit.wikimedia.org/r/1041231 (owner: 10Eevans)
[21:41:51] <wikibugs>	 06SRE, 10DNS, 06Traffic: Cleanup unused DNS subdomains - https://phabricator.wikimedia.org/T367012#9877668 (10Dzahn) pk.wikimedia.org was added in 2013 in https://gerrit.wikimedia.org/r/c/operations/dns/+/86650 to add a redirect but in 2023 the redirect was removed in https://gerrit.wikimedia.org/r/c/operati...
[21:43:45] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service restbase1028-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:44:29] <wikibugs>	 (03CR) 10Volans: sre.cassandra.roll-reboot: Add missing Cassandra cluster aliases (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1041231 (owner: 10Eevans)
[21:48:23] <Reedy>	 jouncebot: nowandnext
[21:48:23] <jouncebot>	 For the next 1 hour(s) and 11 minute(s): Weekly Security deployment window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240610T2100)
[21:48:23] <jouncebot>	 In 4 hour(s) and 11 minute(s): Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240611T0200)
[21:48:45] <jinxer-wm>	 RESOLVED: [12x] ProbeDown: Service restbase1028-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:49:22] <wikibugs>	 (03PS1) 10Reedy: langlist-labs: Add bn and fr [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041236 (https://phabricator.wikimedia.org/T367126)
[21:49:49] <wikibugs>	 (03PS2) 10Reedy: interwiki(-labs).php: De-duplicate and update from meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1040766 (https://phabricator.wikimedia.org/T365679)
[21:49:55] <wikibugs>	 (03CR) 10Reedy: [C:03+2] interwiki(-labs).php: De-duplicate and update from meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1040766 (https://phabricator.wikimedia.org/T365679) (owner: 10Reedy)
[21:50:41] <wikibugs>	 (03PS1) 10Dzahn: delete langcom.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1041237 (https://phabricator.wikimedia.org/T367012)
[21:50:42] <wikibugs>	 (03CR) 10Volans: sre.cassandra.roll-reboot: Add missing Cassandra cluster aliases (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1041231 (owner: 10Eevans)
[21:53:17] <wikibugs>	 (03Merged) 10jenkins-bot: interwiki(-labs).php: De-duplicate and update from meta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1040766 (https://phabricator.wikimedia.org/T365679) (owner: 10Reedy)
[21:53:29] <wikibugs>	 (03PS2) 10Reedy: langlist-labs: Add bn and fr [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041236 (https://phabricator.wikimedia.org/T367126)
[21:53:33] <wikibugs>	 (03CR) 10Reedy: [C:03+2] langlist-labs: Add bn and fr [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041236 (https://phabricator.wikimedia.org/T367126) (owner: 10Reedy)
[21:55:14] <wikibugs>	 (03Merged) 10jenkins-bot: langlist-labs: Add bn and fr [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041236 (https://phabricator.wikimedia.org/T367126) (owner: 10Reedy)
[21:55:46] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service restbase1029-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:56:20] <wikibugs>	 (03PS1) 10Reedy: interwiki-labs.php: Update as per langlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041238
[21:56:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1224 (T364069)', diff saved to https://phabricator.wikimedia.org/P64563 and previous config saved to /var/cache/conftool/dbconfig/20240610-215622-marostegui.json
[21:56:24] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance
[21:56:28] <stashbot>	 T364069: Rebuild pagelinks tables - https://phabricator.wikimedia.org/T364069
[21:56:38] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance
[21:57:41] <wikibugs>	 (03CR) 10Reedy: [C:03+2] interwiki-labs.php: Update as per langlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041238 (owner: 10Reedy)
[21:58:22] <wikibugs>	 (03Merged) 10jenkins-bot: interwiki-labs.php: Update as per langlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041238 (owner: 10Reedy)
[22:00:46] <jinxer-wm>	 RESOLVED: [12x] ProbeDown: Service restbase1029-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:04:05] <wikibugs>	 (03PS2) 10Scott French: aqs-http-gateway: allow cross-DC Cassandra client connection [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041235 (https://phabricator.wikimedia.org/T366851)
[22:07:47] <wikibugs>	 (03PS1) 10Zabe: Add Apache configuration for u4c.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/1041240 (https://phabricator.wikimedia.org/T366649)
[22:08:45] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:10:33] <wikibugs>	 (03PS1) 10Zabe: Add u4cwiki to private_wikis [puppet] - 10https://gerrit.wikimedia.org/r/1041242 (https://phabricator.wikimedia.org/T366649)
[22:11:29] <wikibugs>	 (03CR) 10Scott French: "After discussion on T366851 and chatting with @brouberol@wikimedia.org earlier today, I think we're on the same page that this seems like " [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041235 (https://phabricator.wikimedia.org/T366851) (owner: 10Scott French)
[22:13:45] <jinxer-wm>	 RESOLVED: [12x] ProbeDown: Service restbase1030-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:14:10] <logmsgbot>	 !log reedy@deploy1002 Synchronized langlist-labs: Add fr and bn (duration: 14m 29s)
[22:18:36] <wikibugs>	 (03PS1) 10Hashar: wm-zuul-status: fix reload button [software/gerrit] (deploy/wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1041243 (https://phabricator.wikimedia.org/T360550)
[22:19:27] <wikibugs>	 (03PS1) 10Dzahn: delete pk.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/1041245 (https://phabricator.wikimedia.org/T367012)
[22:19:59] <wikibugs>	 (03CR) 10Dzahn: "Good catch, taavi" [dns] - 10https://gerrit.wikimedia.org/r/1041237 (https://phabricator.wikimedia.org/T367012) (owner: 10Dzahn)
[22:20:07] <wikibugs>	 (03CR) 10Hashar: "I have tried it by copy pasting in the the browser console:" [software/gerrit] (deploy/wmf/stable-3.9) - 10https://gerrit.wikimedia.org/r/1041243 (https://phabricator.wikimedia.org/T360550) (owner: 10Hashar)
[22:20:46] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service restbase1031-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:21:16] <wikibugs>	 (03Abandoned) 10Zabe: trafficserver: Move test-commons to mw-on-k8s [puppet] - 10https://gerrit.wikimedia.org/r/1034106 (owner: 10Zabe)
[22:21:19] <wikibugs>	 (03CR) 10Dzahn: [C:03+1] "lgtm, we traced this back to when we converted cron jobs to systemd timers. I don't fully remember but I think we just didn't turn on moni" [puppet] - 10https://gerrit.wikimedia.org/r/1041217 (https://phabricator.wikimedia.org/T367113) (owner: 10CDanis)
[22:23:46] <wikibugs>	 (03PS1) 10DCausse: cirrus-streaming-updater: use new image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041246
[22:24:51] <wikibugs>	 (03CR) 10DCausse: [C:03+2] cirrus-streaming-updater: use new image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041246 (owner: 10DCausse)
[22:25:14] <logmsgbot>	 !log reedy@deploy1002 Synchronized wmf-config/: sync interwiki lists (duration: 10m 07s)
[22:25:46] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service restbase1031-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:25:51] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus-streaming-updater: use new image [deployment-charts] - 10https://gerrit.wikimedia.org/r/1041246 (owner: 10DCausse)
[22:27:25] <jinxer-wm>	 FIRING: SystemdUnitFailed: php7.4-fpm_check_restart.service on mw1489:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[22:27:45] <logmsgbot>	 !log dcausse@deploy1002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[22:28:03] <logmsgbot>	 !log dcausse@deploy1002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[22:28:45] <jinxer-wm>	 RESOLVED: [12x] ProbeDown: Service restbase1031-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:30:37] <logmsgbot>	 !log dcausse@deploy1002 helmfile [eqiad] START helmfile.d/services/cirrus-streaming-updater: apply
[22:30:51] <logmsgbot>	 !log dcausse@deploy1002 helmfile [eqiad] DONE helmfile.d/services/cirrus-streaming-updater: apply
[22:35:46] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service restbase1032-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:36:10] <logmsgbot>	 !log dcausse@deploy1002 helmfile [codfw] START helmfile.d/services/cirrus-streaming-updater: apply
[22:36:19] <logmsgbot>	 !log dcausse@deploy1002 helmfile [codfw] DONE helmfile.d/services/cirrus-streaming-updater: apply
[22:38:45] <jinxer-wm>	 RESOLVED: [12x] ProbeDown: Service restbase1032-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:40:45] <jinxer-wm>	 RESOLVED: CirrusStreamingUpdaterFlinkJobUnstable: cirrus_streaming_updater_producer_eqiad in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Search#Streaming_Updater - https://grafana.wikimedia.org/d/K9x0c4aVk/flink-app?var-datasource=eqiad+prometheus%2Fk8s&var-namespace=cirrus-streaming-updater&var-helm_release=producer - https://alerts.wikimedia.org/?q=alertname%3DCirrusStreamingUpdaterFlinkJobUnstable
[22:46:53] <wikibugs>	 (03CR) 10Dzahn: lists: Remove quickdatacopy and use our own rsyncd and systemd timer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/1041232 (https://phabricator.wikimedia.org/T331706) (owner: 10EoghanGaffney)
[22:48:45] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service restbase1033-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:53:45] <jinxer-wm>	 RESOLVED: [12x] ProbeDown: Service restbase1033-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:55:31] <wikibugs>	 06SRE, 10Language-Technical Support, 06serviceops, 10Wikimedia-Site-requests, 13Patch-For-Review: Change $wgMaxArticleSize limit from byte-based to character-based - https://phabricator.wikimedia.org/T275319#9877825 (10stjn) While discussing performance issues on Discord, I looked at https://he.wikisourc...
[22:55:46] <jinxer-wm>	 FIRING: JobUnavailable: Reduced availability for job ganeti in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[23:00:46] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service restbase1034-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:03:25] <wikibugs>	 (03PS1) 10Pppery: MediaWiki.org: restrict unfuzzy rights to autoconfirmed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041249 (https://phabricator.wikimedia.org/T366994)
[23:05:46] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service restbase1034-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:08:45] <jinxer-wm>	 RESOLVED: [12x] ProbeDown: Service restbase1034-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:12:01] <wikibugs>	 (03PS1) 10Jdlrobson: Enable dark mode on data table pages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041250 (https://phabricator.wikimedia.org/T366373)
[23:13:45] <jinxer-wm>	 FIRING: [11x] ProbeDown: Service restbase1035-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:15:46] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service restbase1035-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:20:46] <jinxer-wm>	 RESOLVED: [12x] ProbeDown: Service restbase1035-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:23:45] <jinxer-wm>	 FIRING: SystemdUnitFailed: generate_vrts_aliases.service on mx2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:28:45] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service restbase1036-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:31:27] <wikibugs>	 (03PS3) 10Eevans: sre.cassandra.roll-reboot: Add missing Cassandra cluster aliases [cookbooks] - 10https://gerrit.wikimedia.org/r/1041231
[23:32:05] <wikibugs>	 (03CR) 10Eevans: sre.cassandra.roll-reboot: Add missing Cassandra cluster aliases (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/1041231 (owner: 10Eevans)
[23:33:45] <jinxer-wm>	 RESOLVED: [12x] ProbeDown: Service restbase1036-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:38:16] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1041254
[23:38:16] <wikibugs>	 (03CR) 10TrainBranchBot: [C:03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/1041254 (owner: 10TrainBranchBot)
[23:40:46] <jinxer-wm>	 FIRING: [7x] ProbeDown: Service restbase1037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:41:56] <wikibugs>	 (03PS1) 10Reedy: Remove old wgAbuseFilterActorTableSchemaMigrationStage config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041255 (https://phabricator.wikimedia.org/T188180)
[23:42:42] <wikibugs>	 (03CR) 10Reedy: [C:04-2] "Not yet; T188180#9877744 and I86ec2b816eed17b62bf02bfd085570f132011b3e to ride the train and become stable" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/1041255 (https://phabricator.wikimedia.org/T188180) (owner: 10Reedy)
[23:43:45] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service restbase1037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:48:45] <jinxer-wm>	 RESOLVED: [12x] ProbeDown: Service restbase1037-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[23:52:14] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance
[23:52:27] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1009.eqiad.wmnet with reason: Maintenance
[23:55:46] <jinxer-wm>	 FIRING: [12x] ProbeDown: Service restbase1038-a:7000 has failed probes (tcp_cassandra_a_ssl_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown