[00:09:25] <logmsgbot>	 !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcontrol1006.eqiad.wmnet
[00:16:11] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:16:28] <logmsgbot>	 !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcontrol1006.eqiad.wmnet
[00:20:35] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:38:20] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/979450
[00:38:26] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/979450 (owner: 10TrainBranchBot)
[00:40:00] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[00:49:24] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[00:58:49] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/979450 (owner: 10TrainBranchBot)
[01:23:52] <wikibugs>	 10SRE, 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder)
[02:01:05] <icinga-wm>	 RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[02:39:04] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:49:49] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint2002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[02:55:45] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint2002 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[03:00:18] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[03:00:44] <jinxer-wm>	 (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk
[03:07:37] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint2002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[03:09:04] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:28:53] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Update MinT to 2023-11-21-115852-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/978727 (owner: 10KartikMistry)
[03:30:00] <wikibugs>	 (03Merged) 10jenkins-bot: Update MinT to 2023-11-21-115852-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/978727 (owner: 10KartikMistry)
[03:30:01] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-int_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:34:11] <icinga-wm>	 RECOVERY - Work requests waiting in Zuul Gearman server on contint2002 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10
[03:36:49] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[04:16:07] <icinga-wm>	 PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-k8s-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:26:25] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:29:15] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[04:40:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[04:43:37] <ryankemper>	 !log [WDQS] Clearing `BlazegraphFreeAllocatorsDecreasingRapidly` -> `ryankemper@wdqs1007:~$ sudo systemctl restart wdqs-blazegraph`
[04:43:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[04:49:24] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[04:50:01] <jinxer-wm>	 (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly
[05:37:15] <icinga-wm>	 PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - No response from remote host 185.15.59.128 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[05:54:33] <wikibugs>	 (03PS1) 10KartikMistry: Update cxserver to 2023-12-04-055024-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/979487 (https://phabricator.wikimedia.org/T270060)
[05:55:56] <kart_>	 ^ Deploying cxserver..
[05:56:43] <wikibugs>	 (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-12-04-055024-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/979487 (https://phabricator.wikimedia.org/T270060) (owner: 10KartikMistry)
[05:57:46] <wikibugs>	 (03Merged) 10jenkins-bot: Update cxserver to 2023-12-04-055024-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/979487 (https://phabricator.wikimedia.org/T270060) (owner: 10KartikMistry)
[05:58:46] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply
[05:59:21] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply
[06:02:51] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply
[06:03:30] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply
[06:05:43] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply
[06:06:14] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply
[06:08:03] <kart_>	 !log Updated cxserver to 2023-12-04-055024-production (T270060, T350773, T352620)
[06:08:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:08:10] <stashbot>	 T270060: Package apertium-fra-frp (French-Arpitan) - https://phabricator.wikimedia.org/T270060
[06:08:10] <stashbot>	 T350773: Remove preq and use node fetch - https://phabricator.wikimedia.org/T350773
[06:08:10] <stashbot>	 T352620: Failure to start new translations - https://phabricator.wikimedia.org/T352620
[06:11:58] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/machinetranslation: apply
[06:12:10] <kart_>	 Minor deployment for MinT too ^^
[06:14:09] <wikibugs>	 (03PS1) 10Zoranzoki21: Revert "throttle.php: Cleanup old rules, add new one" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979224 (https://phabricator.wikimedia.org/T352569)
[06:14:53] <logmsgbot>	 !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply
[06:15:37] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Add throttle rule for editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979488 (https://phabricator.wikimedia.org/T352569)
[06:15:45] <wikibugs>	 (03PS2) 10Zoranzoki21: Revert "throttle.php: Cleanup old rules, add new one" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979224 (https://phabricator.wikimedia.org/T352569)
[06:16:07] <wikibugs>	 (03Abandoned) 10Zoranzoki21: Revert "throttle.php: Cleanup old rules, add new one" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979224 (https://phabricator.wikimedia.org/T352569) (owner: 10Zoranzoki21)
[06:17:53] <wikibugs>	 (03CR) 10Zoranzoki21: [C: 04-1] Add throttle rule for editathon (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979488 (https://phabricator.wikimedia.org/T352569) (owner: 10Giuseppe Lavagetto)
[06:22:59] <jinxer-wm>	 (PuppetZeroResources) firing: Puppet has failed generate resources on elastic1107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[06:28:08] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply
[06:31:04] <wikibugs>	 (03CR) 10Anzx: [C: 04-1] Add throttle rule for editathon (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979488 (https://phabricator.wikimedia.org/T352569) (owner: 10Giuseppe Lavagetto)
[06:33:49] <logmsgbot>	 !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply
[06:35:51] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: Add throttle rule for editathon (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979488 (https://phabricator.wikimedia.org/T352569) (owner: 10Giuseppe Lavagetto)
[06:37:59] <jinxer-wm>	 (PuppetZeroResources) resolved: Puppet has failed generate resources on elastic1107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources
[06:38:20] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Add throttle rule for editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979488 (https://phabricator.wikimedia.org/T352569)
[06:42:00] <wikibugs>	 (03CR) 10Zoranzoki21: [C: 03+1] Add throttle rule for editathon (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979488 (https://phabricator.wikimedia.org/T352569) (owner: 10Giuseppe Lavagetto)
[06:44:51] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply
[06:46:57] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db[2135,2160].codfw.wmnet,db[1119,1176,1217].eqiad.wmnet with reason: m5 master switch T352505
[06:47:00] <stashbot>	 T352505: Switchover m5 master db1176 -> db1119 - https://phabricator.wikimedia.org/T352505
[06:47:15] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2135,2160].codfw.wmnet,db[1119,1176,1217].eqiad.wmnet with reason: m5 master switch T352505
[06:49:00] <wikibugs>	 (03PS1) 10Marostegui: mariadb: Promote db1119 to m5 master [puppet] - 10https://gerrit.wikimedia.org/r/979489 (https://phabricator.wikimedia.org/T352505)
[06:49:49] <logmsgbot>	 !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply
[06:50:14] <wikibugs>	 (03PS4) 10Marostegui: parsercachepurging.pp: Increase retention back to 30 days [puppet] - 10https://gerrit.wikimedia.org/r/877205 (https://phabricator.wikimedia.org/T280604)
[06:51:16] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[06:52:42] <wikibugs>	 (03CR) 10Krinkle: "should pc4 use the same expiry?" [puppet] - 10https://gerrit.wikimedia.org/r/877205 (https://phabricator.wikimedia.org/T280604) (owner: 10Marostegui)
[06:53:02] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1119 to m5 master [puppet] - 10https://gerrit.wikimedia.org/r/979489 (https://phabricator.wikimedia.org/T352505) (owner: 10Marostegui)
[06:53:11] <chlod>	 multiple people on the Help desk reporting that they're getting Rdbms errors when editing https://en.wikipedia.org/wiki/Wikipedia:Help_desk
[06:55:51] <wikibugs>	 (03CR) 10Marostegui: "Good point Timo! It will!" [puppet] - 10https://gerrit.wikimedia.org/r/877205 (https://phabricator.wikimedia.org/T280604) (owner: 10Marostegui)
[06:56:07] <chlod>	 something about the write duration exceeding a 3 second limit, someone opened a task at https://phabricator.wikimedia.org/T352628
[06:56:16] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[06:57:07] <marostegui>	 chlod: I guess a write taking more than 3 seconds
[06:57:20] <marostegui>	 !log Failover m5 from db1176 to db1119 - T332155
[06:57:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:57:24] <stashbot>	 T332155: Switchover m5 master (db1106 -> db1176) - https://phabricator.wikimedia.org/T332155
[07:00:02] <wikibugs>	 (03PS5) 10Marostegui: parsercachepurging.pp: Increase retention back to 30 days [puppet] - 10https://gerrit.wikimedia.org/r/877205 (https://phabricator.wikimedia.org/T280604)
[07:00:18] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[07:00:44] <jinxer-wm>	 (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk
[07:01:14] <wikibugs>	 (03PS1) 10Marostegui: db1176: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/979490 (https://phabricator.wikimedia.org/T352361)
[07:02:01] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1176: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/979490 (https://phabricator.wikimedia.org/T352361) (owner: 10Marostegui)
[07:03:27] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1176.eqiad.wmnet with OS bookworm
[07:07:33] <kart_>	 !log Updated MinT to 2023-11-21-115852-production
[07:07:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:07:40] <kart_>	 Forgot to log earlier ^^
[07:10:05] <jinxer-wm>	 (JobUnavailable) firing: (9) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[07:15:20] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1176.eqiad.wmnet with reason: host reimage
[07:16:21] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1176: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/979225
[07:16:28] <wikibugs>	 (03CR) 10Marostegui: [C: 04-2] "Not yet" [puppet] - 10https://gerrit.wikimedia.org/r/979225 (owner: 10Marostegui)
[07:18:23] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1176.eqiad.wmnet with reason: host reimage
[07:19:16] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[07:24:16] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[07:31:32] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1176.eqiad.wmnet with OS bookworm
[07:32:14] <wikibugs>	 (03CR) 10Marostegui: Revert "db1176: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/979225 (owner: 10Marostegui)
[07:32:17] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1176: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/979225 (owner: 10Marostegui)
[07:33:29] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] parsercachepurging.pp: Increase retention back to 30 days [puppet] - 10https://gerrit.wikimedia.org/r/877205 (https://phabricator.wikimedia.org/T280604) (owner: 10Marostegui)
[07:39:26] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1129.eqiad.wmnet with reason: Maintenance
[07:39:51] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1129.eqiad.wmnet with reason: Maintenance
[07:39:58] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T348183)', diff saved to https://phabricator.wikimedia.org/P54062 and previous config saved to /var/cache/conftool/dbconfig/20231204-073957-arnaudb.json
[07:40:02] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[07:42:36] <icinga-wm>	 PROBLEM - SSH on wdqs1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[07:42:39] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T348183)', diff saved to https://phabricator.wikimedia.org/P54063 and previous config saved to /var/cache/conftool/dbconfig/20231204-074238-arnaudb.json
[07:53:34] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1023 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:54:07] <wikibugs>	 (03PS1) 10Marostegui: dbproxy1022: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/979676 (https://phabricator.wikimedia.org/T351864)
[07:54:37] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1022.eqiad.wmnet with OS bookworm
[07:54:40] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] dbproxy1022: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/979676 (https://phabricator.wikimedia.org/T351864) (owner: 10Marostegui)
[07:55:42] <jinxer-wm>	 (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:57:46] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P54064 and previous config saved to /var/cache/conftool/dbconfig/20231204-075745-arnaudb.json
[08:00:04] <jouncebot>	 Amir1 and Urbanecm: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231204T0800).
[08:00:05] <jouncebot>	 _joe_ and aanzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[08:00:12] <wikibugs>	 (03PS4) 10Anzx: hewikivoyage: add tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979686 (https://phabricator.wikimedia.org/T351981)
[08:00:15] <wikibugs>	 (03PS2) 10Anzx: azwiki: Enable $wgMinervaEnableSiteNotice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979223 (https://phabricator.wikimedia.org/T352621)
[08:00:17] <wikibugs>	 (03PS3) 10Anzx: trwikivoyage: update wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978522 (https://phabricator.wikimedia.org/T352329)
[08:02:54] <_joe_>	 o/
[08:03:15] <_joe_>	 I'm happy to merge my own patch, I can't be the general deployer though
[08:04:10] <_joe_>	 urbanecm: around?
[08:04:18] <urbanecm>	 yes
[08:04:19] <_joe_>	 or Amir1
[08:04:21] <_joe_>	 ack
[08:04:23] <urbanecm>	 'morning everyone
[08:04:30] <_joe_>	 good morning :)
[08:04:43] <_joe_>	 I'll go and merge this change for the naughty editathoners
[08:04:51] <anzx>	 o/
[08:05:01] <_joe_>	 who opened the throttle request on late friday evening for monday :P
[08:05:26] <urbanecm>	 heh
[08:05:43] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by oblivian@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979488 (https://phabricator.wikimedia.org/T352569) (owner: 10Giuseppe Lavagetto)
[08:05:46] <wikibugs>	 (03PS1) 10Muehlenhoff: ganeti: Switch eqiad to PKI [puppet] - 10https://gerrit.wikimedia.org/r/979838 (https://phabricator.wikimedia.org/T350686)
[08:06:04] <_joe_>	 urbanecm: I'm trying to be lenient but yeah...
[08:06:30] <wikibugs>	 (03Merged) 10jenkins-bot: Add throttle rule for editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979488 (https://phabricator.wikimedia.org/T352569) (owner: 10Giuseppe Lavagetto)
[08:07:27] <logmsgbot>	 !log oblivian@deploy2002 Started scap: Backport for [[gerrit:979488|Add throttle rule for editathon (T352569)]]
[08:07:31] <stashbot>	 T352569: Lift IP cap on 2023-12-04 for Editathon for commonswiki and eswiki - https://phabricator.wikimedia.org/T352569
[08:08:26] <urbanecm>	 _joe_: fwiw, the official guidelines (https://meta.wikimedia.org/wiki/Mass_account_creation#Requesting_temporary_lift_of_IP_cap) say "two weeks in advance" :-/
[08:08:49] <_joe_>	 urbanecm: https://phabricator.wikimedia.org/T352569#9377671
[08:08:51] <_joe_>	 :P
[08:09:14] <_joe_>	 urbanecm: I even have to run a script, ofc
[08:09:24] <_joe_>	 because we can't ever have nice things
[08:09:39] <icinga-wm>	 PROBLEM - SSH on wdqs1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:10:07] <urbanecm>	 yeah... and hope the IP info is okay.
[08:10:44] <jinxer-wm>	 (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:10:52] <logmsgbot>	 !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbproxy1022.eqiad.wmnet with OS bookworm
[08:11:26] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1022.eqiad.wmnet with OS bookworm
[08:12:52] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P54065 and previous config saved to /var/cache/conftool/dbconfig/20231204-081251-arnaudb.json
[08:17:27] <logmsgbot>	 !log oblivian@deploy2002 oblivian: Backport for [[gerrit:979488|Add throttle rule for editathon (T352569)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[08:17:30] <stashbot>	 T352569: Lift IP cap on 2023-12-04 for Editathon for commonswiki and eswiki - https://phabricator.wikimedia.org/T352569
[08:17:31] <wikibugs>	 (03PS7) 10Elukey: changeprop: refactor templating for Kafka producer/consumer settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950)
[08:18:33] <logmsgbot>	 !log oblivian@deploy2002 oblivian: Continuing with sync
[08:19:27] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1023 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:19:52] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] ganeti: Switch eqiad to PKI [puppet] - 10https://gerrit.wikimedia.org/r/979838 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff)
[08:21:41] <_joe_>	 urbanecm: I'm almost done; building the image took a long time but almost everything is unusually slow
[08:23:01] <urbanecm>	 Ack
[08:23:12] <jinxer-wm>	 (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:23:15] <_joe_>	 !log clearing throttle cache for T352569
[08:23:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:23:19] <stashbot>	 T352569: Lift IP cap on 2023-12-04 for Editathon for commonswiki and eswiki - https://phabricator.wikimedia.org/T352569
[08:24:38] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM moscovium.eqiad.wmnet
[08:25:32] <logmsgbot>	 !log oblivian@deploy2002 Finished scap: Backport for [[gerrit:979488|Add throttle rule for editathon (T352569)]] (duration: 18m 04s)
[08:26:59] <_joe_>	 urbanecm: I'm done
[08:27:05] <urbanecm>	 ack
[08:27:12] <urbanecm>	 anzx: hi, still around? :)
[08:27:17] <anzx>	 Ues
[08:27:20] <anzx>	 Yes
[08:27:30] <wikibugs>	 (03PS5) 10Urbanecm: hewikivoyage: add tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979686 (https://phabricator.wikimedia.org/T351981) (owner: 10Anzx)
[08:27:35] <wikibugs>	 (03PS3) 10Urbanecm: azwiki: Enable $wgMinervaEnableSiteNotice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979223 (https://phabricator.wikimedia.org/T352621) (owner: 10Anzx)
[08:27:39] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] hewikivoyage: add tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979686 (https://phabricator.wikimedia.org/T351981) (owner: 10Anzx)
[08:27:44] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] azwiki: Enable $wgMinervaEnableSiteNotice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979223 (https://phabricator.wikimedia.org/T352621) (owner: 10Anzx)
[08:27:50] <wikibugs>	 (03PS4) 10Urbanecm: trwikivoyage: update wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978522 (https://phabricator.wikimedia.org/T352329) (owner: 10Anzx)
[08:27:54] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] trwikivoyage: update wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978522 (https://phabricator.wikimedia.org/T352329) (owner: 10Anzx)
[08:27:59] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T348183)', diff saved to https://phabricator.wikimedia.org/P54066 and previous config saved to /var/cache/conftool/dbconfig/20231204-082758-arnaudb.json
[08:28:01] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[08:28:03] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[08:28:15] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1139.eqiad.wmnet with reason: Maintenance
[08:28:34] <wikibugs>	 (03Merged) 10jenkins-bot: hewikivoyage: add tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979686 (https://phabricator.wikimedia.org/T351981) (owner: 10Anzx)
[08:28:38] <wikibugs>	 (03Merged) 10jenkins-bot: azwiki: Enable $wgMinervaEnableSiteNotice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979223 (https://phabricator.wikimedia.org/T352621) (owner: 10Anzx)
[08:28:46] <wikibugs>	 (03Merged) 10jenkins-bot: trwikivoyage: update wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978522 (https://phabricator.wikimedia.org/T352329) (owner: 10Anzx)
[08:28:53] <urbanecm>	 let's send it out
[08:29:32] <wikibugs>	 (03PS1) 10David Caro: codfw1dev: add smart_hosts [puppet] - 10https://gerrit.wikimedia.org/r/979888 (https://phabricator.wikimedia.org/T350008)
[08:29:44] <logmsgbot>	 !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:979686|hewikivoyage: add tagline (T351981)]], [[gerrit:979223|azwiki: Enable $wgMinervaEnableSiteNotice (T352621)]], [[gerrit:978522|trwikivoyage: update wordmark (T352329)]]
[08:29:50] <stashbot>	 T351981: Change Hebrew Wikivoyage wordmark logo - https://phabricator.wikimedia.org/T351981
[08:29:51] <stashbot>	 T352621: Enable $wgMinervaEnableSiteNotice for azwiki - https://phabricator.wikimedia.org/T352621
[08:29:51] <stashbot>	 T352329: Remove logo from Turkish Wikivoyage wordmark - https://phabricator.wikimedia.org/T352329
[08:30:42] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[08:30:43] <wikibugs>	 (03PS2) 10David Caro: codfw1dev: add smart_hosts [puppet] - 10https://gerrit.wikimedia.org/r/979888 (https://phabricator.wikimedia.org/T350008)
[08:30:56] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance
[08:31:03] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T348183)', diff saved to https://phabricator.wikimedia.org/P54067 and previous config saved to /var/cache/conftool/dbconfig/20231204-083102-arnaudb.json
[08:31:05] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm and anzx: Backport for [[gerrit:979686|hewikivoyage: add tagline (T351981)]], [[gerrit:979223|azwiki: Enable $wgMinervaEnableSiteNotice (T352621)]], [[gerrit:978522|trwikivoyage: update wordmark (T352329)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[08:31:12] <anzx>	 Checking
[08:31:13] <urbanecm>	 anzx: please test at the debug servers
[08:31:14] <urbanecm>	 ty
[08:32:26] <wikibugs>	 (03PS8) 10Elukey: changeprop: refactor templating for Kafka producer/consumer settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950)
[08:32:52] <anzx>	 urbanecm: looks good 
[08:33:23] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm and anzx: Continuing with sync
[08:35:35] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T348183)', diff saved to https://phabricator.wikimedia.org/P54068 and previous config saved to /var/cache/conftool/dbconfig/20231204-083534-arnaudb.json
[08:35:39] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[08:38:12] <wikibugs>	 (03CR) 10Elukey: "Hugh/Joe: Tried to refactor another time the charts, lemme know if you like it or not :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey)
[08:39:17] <wikibugs>	 (03PS3) 10David Caro: codfw1dev: add smart_hosts [puppet] - 10https://gerrit.wikimedia.org/r/979888 (https://phabricator.wikimedia.org/T350008)
[08:39:34] <logmsgbot>	 !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:979686|hewikivoyage: add tagline (T351981)]], [[gerrit:979223|azwiki: Enable $wgMinervaEnableSiteNotice (T352621)]], [[gerrit:978522|trwikivoyage: update wordmark (T352329)]] (duration: 09m 49s)
[08:39:37] <urbanecm>	 anzx: done
[08:39:39] <stashbot>	 T351981: Change Hebrew Wikivoyage wordmark logo - https://phabricator.wikimedia.org/T351981
[08:39:40] <stashbot>	 T352621: Enable $wgMinervaEnableSiteNotice for azwiki - https://phabricator.wikimedia.org/T352621
[08:39:40] <stashbot>	 T352329: Remove logo from Turkish Wikivoyage wordmark - https://phabricator.wikimedia.org/T352329
[08:41:11] <wikibugs>	 (03CR) 10David Caro: "Tested on codfw by cherry-picking on the local puppetmaster and running on a VM (etcd-discovery-2.cloudinfra-codfw1dev.codfw1dev.wikimedia" [puppet] - 10https://gerrit.wikimedia.org/r/979888 (https://phabricator.wikimedia.org/T350008) (owner: 10David Caro)
[08:43:00] <elukey>	 !log upgrade istio (buster -> bullseye) on dse-k8s-eqiad - T351933
[08:43:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:43:10] <stashbot>	 T351933: Bump istio Docker images to Bookworm - https://phabricator.wikimedia.org/T351933
[08:43:12] <jinxer-wm>	 (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:43:43] <anzx>	 urbanecm: thanks logo seems to appears 
[08:44:15] <anzx>	 correctly
[08:44:41] <icinga-wm>	 PROBLEM - Host moscovium is DOWN: PING CRITICAL - Packet loss = 100%
[08:44:59] <urbanecm>	 yay
[08:45:49] <logmsgbot>	 !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbproxy1022.eqiad.wmnet with OS bookworm
[08:46:19] <icinga-wm>	 RECOVERY - Host moscovium is UP: PING OK - Packet loss = 0%, RTA = 0.43 ms
[08:48:09] <elukey>	 !log upgrade istio (buster -> bullseye) on aux-k8s-eqiad - T351933
[08:48:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:48:14] <stashbot>	 T351933: Bump istio Docker images to Bookworm - https://phabricator.wikimedia.org/T351933
[08:49:04] <jinxer-wm>	 (ProbeDown) resolved: Service moscovium:443 has failed probes (http_rt_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#moscovium:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[08:49:05] <icinga-wm>	 RECOVERY - SSH on wdqs1023 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:49:23] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:49:24] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[08:50:09] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM moscovium.eqiad.wmnet
[08:50:34] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1022.eqiad.wmnet with OS bookworm
[08:50:41] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P54069 and previous config saved to /var/cache/conftool/dbconfig/20231204-085041-arnaudb.json
[08:53:13] <icinga-wm>	 PROBLEM - SSH on wdqs1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
[08:55:05] <wikibugs>	 10sre-alert-triage, 10Infrastructure-Foundations: Alert triage: overdue alert [warning] Systemd units failing on debmonitor2003 - https://phabricator.wikimedia.org/T343897 (10LSobanski) Updating as this alert came up on the overdue list again.
[08:58:40] <elukey>	 !log upgrade istio (buster -> bullseye) on ml-serve-eqiad - T351933
[08:58:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:58:44] <stashbot>	 T351933: Bump istio Docker images to Bookworm - https://phabricator.wikimedia.org/T351933
[09:00:07] <icinga-wm>	 RECOVERY - SSH on wdqs1023 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:02:23] <wikibugs>	 (03PS1) 10Muehlenhoff: ganeti: Configure eqiad/test for PKI [puppet] - 10https://gerrit.wikimedia.org/r/979890 (https://phabricator.wikimedia.org/T350686)
[09:05:48] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P54070 and previous config saved to /var/cache/conftool/dbconfig/20231204-090547-arnaudb.json
[09:07:31] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1022 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:09:42] <jinxer-wm>	 (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:14:42] <jinxer-wm>	 (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:17:47] <icinga-wm>	 PROBLEM - Check systemd state on wdqs1022 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:19:42] <jinxer-wm>	 (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:20:08] <wikibugs>	 (03PS1) 10Brouberol: Define a DNS A record for the dse k8s ingress gateway [dns] - 10https://gerrit.wikimedia.org/r/979891 (https://phabricator.wikimedia.org/T352639)
[09:20:10] <wikibugs>	 (03PS1) 10Brouberol: Enable ingress for the spark-history server services via the dse ingress gw [dns] - 10https://gerrit.wikimedia.org/r/979892 (https://phabricator.wikimedia.org/T352639)
[09:20:54] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T348183)', diff saved to https://phabricator.wikimedia.org/P54072 and previous config saved to /var/cache/conftool/dbconfig/20231204-092054-arnaudb.json
[09:20:58] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance
[09:20:58] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[09:21:11] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance
[09:21:12] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[09:21:30] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance
[09:21:37] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T348183)', diff saved to https://phabricator.wikimedia.org/P54073 and previous config saved to /var/cache/conftool/dbconfig/20231204-092136-arnaudb.json
[09:22:57] <wikibugs>	 (03PS1) 10MVernon: Swift: Set new-style storage for ms-be1076-89,ms-be2080-9 [puppet] - 10https://gerrit.wikimedia.org/r/979893 (https://phabricator.wikimedia.org/T349840)
[09:23:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] ganeti: Configure eqiad/test for PKI [puppet] - 10https://gerrit.wikimedia.org/r/979890 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff)
[09:24:52] <wikibugs>	 (03CR) 10Elukey: [C: 04-1] "Please check what was done for other k8s ingress services (for example k8s-ingress-aux). This needs to be a new LVS service:" [dns] - 10https://gerrit.wikimedia.org/r/979891 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol)
[09:26:01] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T348183)', diff saved to https://phabricator.wikimedia.org/P54074 and previous config saved to /var/cache/conftool/dbconfig/20231204-092600-arnaudb.json
[09:26:05] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[09:26:53] <wikibugs>	 (03CR) 10Arnaudb: [V: 03+1 C: 03+1] Swift: Set new-style storage for ms-be1076-89,ms-be2080-9 [puppet] - 10https://gerrit.wikimedia.org/r/979893 (https://phabricator.wikimedia.org/T349840) (owner: 10MVernon)
[09:28:43] <wikibugs>	 (03CR) 10Brouberol: "Ah, I see. We define an LVS-ed service with a reserved IP, and that's the IP being resolved by the DNS record, not the k8s service DNS. Th" [dns] - 10https://gerrit.wikimedia.org/r/979891 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol)
[09:29:57] <wikibugs>	 (03CR) 10MVernon: [C: 03+2] Swift: Set new-style storage for ms-be1076-89,ms-be2080-9 [puppet] - 10https://gerrit.wikimedia.org/r/979893 (https://phabricator.wikimedia.org/T349840) (owner: 10MVernon)
[09:34:42] <jinxer-wm>	 (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:35:29] <icinga-wm>	 RECOVERY - Check systemd state on wdqs1022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:35:36] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10MatthewVernon) @Jclark-ctr sorry, there are some puppet changes that have to be made before new ms-be* nodes will install cleanly, which is why those nodes failed on Friday....
[09:35:55] <icinga-wm>	 RECOVERY - SSH on wdqs1022 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
[09:36:14] <elukey>	 !log upgrade istio (buster -> bullseye) on ml-serve-codfw - T351933
[09:36:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:36:18] <stashbot>	 T351933: Bump istio Docker images to Bookworm - https://phabricator.wikimedia.org/T351933
[09:41:08] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P54075 and previous config saved to /var/cache/conftool/dbconfig/20231204-094107-arnaudb.json
[09:44:42] <wikibugs>	 (03PS1) 10Muehlenhoff: ganeti: Remove non-PKI code for RAPI access [puppet] - 10https://gerrit.wikimedia.org/r/979897 (https://phabricator.wikimedia.org/T350686)
[09:47:09] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] k8s: allow setting prometheus retention in cluster definition [puppet] - 10https://gerrit.wikimedia.org/r/977687 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi)
[09:49:20] <logmsgbot>	 !log volans@cumin1001 START - Cookbook sre.hosts.provision for host dbproxy1022.mgmt.eqiad.wmnet with reboot policy GRACEFUL
[09:50:47] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/979897 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff)
[09:52:55] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: set 850GB retention for prometheus@k8s [puppet] - 10https://gerrit.wikimedia.org/r/977688 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi)
[09:56:14] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P54076 and previous config saved to /var/cache/conftool/dbconfig/20231204-095614-arnaudb.json
[09:57:39] <godog>	 !log roll-restart prometheus/k8s to apply size-based retention - T351179
[09:57:40] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+1] add wmf-debci image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/979355 (https://phabricator.wikimedia.org/T352003) (owner: 10Jelto)
[09:57:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:57:43] <stashbot>	 T351179: LVM vg0 close to getting full on prometheus eqiad - https://phabricator.wikimedia.org/T351179
[09:58:27] <logmsgbot>	 !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dbproxy1022.mgmt.eqiad.wmnet with reboot policy GRACEFUL
[09:59:05] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy1022.eqiad.wmnet with reason: host reimage
[09:59:54] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Structured-Data-Backlog, 10UploadWizard: Access request to deleted image files in the backup cluster - https://phabricator.wikimedia.org/T350020 (10mfossati) >>! In T350020#9376060, @jcrespo wrote: >>>! In T350020#9375684, @mfossati wrote: >> @jcrespo , would it be possible...
[10:00:29] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Structured-Data-Backlog, 10UploadWizard: Access request to deleted image files in the backup cluster - https://phabricator.wikimedia.org/T350020 (10mfossati) Also CC @fkaelin .
[10:02:56] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy1022.eqiad.wmnet with reason: host reimage
[10:08:53] <wikibugs>	 (03PS1) 10Filippo Giunchedi: hieradata: adjust prometheus k8s retention to current utilization [puppet] - 10https://gerrit.wikimedia.org/r/979898 (https://phabricator.wikimedia.org/T351179)
[10:11:21] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T348183)', diff saved to https://phabricator.wikimedia.org/P54077 and previous config saved to /var/cache/conftool/dbconfig/20231204-101120-arnaudb.json
[10:11:23] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[10:11:32] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[10:11:37] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance
[10:11:44] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T348183)', diff saved to https://phabricator.wikimedia.org/P54078 and previous config saved to /var/cache/conftool/dbconfig/20231204-101143-arnaudb.json
[10:16:15] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T348183)', diff saved to https://phabricator.wikimedia.org/P54079 and previous config saved to /var/cache/conftool/dbconfig/20231204-101615-arnaudb.json
[10:17:01] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 138997
[10:17:37] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 138997
[10:17:51] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy1022.eqiad.wmnet with OS bookworm
[10:19:22] <wikibugs>	 (03PS1) 10Marostegui: Revert "dbproxy1022: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/979689
[10:20:35] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 35 days, 0:00:00 on debmonitor2003.codfw.wmnet with reason: WIP
[10:20:37] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1022: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/979689 (owner: 10Marostegui)
[10:20:53] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 35 days, 0:00:00 on debmonitor2003.codfw.wmnet with reason: WIP
[10:21:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/977734 (https://phabricator.wikimedia.org/T351936) (owner: 10Filippo Giunchedi)
[10:23:16] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[10:26:09] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to archiva-deployers for pfischer - https://phabricator.wikimedia.org/T352475 (10Gehel) Approved!
[10:26:26] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: adjust prometheus k8s retention to current utilization [puppet] - 10https://gerrit.wikimedia.org/r/979898 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi)
[10:28:16] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[10:28:22] <jayme>	 !log pgrade istio (buster -> bullseye) on wikikube eqiad - T351933
[10:28:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:28:26] <stashbot>	 T351933: Bump istio Docker images to Bookworm - https://phabricator.wikimedia.org/T351933
[10:29:52] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 237
[10:29:56] <wikibugs>	 (03PS9) 10Vgutierrez: lvs::realserver::ipip: Check that TCP MSS clamping is working [puppet] - 10https://gerrit.wikimedia.org/r/977696 (https://phabricator.wikimedia.org/T351069)
[10:30:29] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 237
[10:30:32] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 19165
[10:31:22] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P54080 and previous config saved to /var/cache/conftool/dbconfig/20231204-103121-arnaudb.json
[10:31:27] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 19165
[10:32:14] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 15305
[10:32:43] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 15305
[10:32:57] <jayme>	 !log upgrade istio (buster -> bullseye) on wikikube codfw - T351933
[10:33:00] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 398446
[10:33:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:33:13] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398446
[10:33:32] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 142505
[10:33:54] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 142505
[10:34:14] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 33604
[10:34:17] <wikibugs>	 (03PS1) 10David Caro: openstack,trove: increase api response alert to 3s [alerts] - 10https://gerrit.wikimedia.org/r/979899
[10:35:11] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 33604
[10:35:14] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 4800
[10:35:53] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 4800
[10:36:05] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 44592
[10:36:34] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 44592
[10:36:38] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 58952
[10:37:26] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 58952
[10:37:30] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 31898
[10:37:54] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: "I think the code can be made slightly more readable, see my suggestion. Otherwise LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey)
[10:38:16] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 31898
[10:38:20] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 63927
[10:38:46] <wikibugs>	 (03PS1) 10Urbanecm: User impact: sort datestring keys to ascending alphanumeric order [extensions/GrowthExperiments] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979690 (https://phabricator.wikimedia.org/T352349)
[10:39:12] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63927
[10:39:19] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 23856
[10:39:44] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 23856
[10:40:08] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove ganeti RAPI dummy certs [labs/private] - 10https://gerrit.wikimedia.org/r/979901 (https://phabricator.wikimedia.org/T350686)
[10:42:37] <wikibugs>	 (03PS1) 10Slyngshede: C:prometheus::node_exporter allow CPU flags collection [puppet] - 10https://gerrit.wikimedia.org/r/979902 (https://phabricator.wikimedia.org/T350694)
[10:42:59] <wikibugs>	 (03PS1) 10Vgutierrez: lvs::realserver::ipip: Clamp on lo too [puppet] - 10https://gerrit.wikimedia.org/r/979903 (https://phabricator.wikimedia.org/T351069)
[10:43:25] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Add 4x10G breakout cable to cr2-esams - https://phabricator.wikimedia.org/T347323 (10ayounsi) 05Open→03Resolved Ports freed up in T347403
[10:44:16] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[10:44:43] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/805/con" [puppet] - 10https://gerrit.wikimedia.org/r/979903 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[10:44:46] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/977733 (https://phabricator.wikimedia.org/T351936) (owner: 10Filippo Giunchedi)
[10:45:18] <wikibugs>	 (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Remove ganeti RAPI dummy certs [labs/private] - 10https://gerrit.wikimedia.org/r/979901 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff)
[10:46:21] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete dummy cert [labs/private] - 10https://gerrit.wikimedia.org/r/979905
[10:46:28] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P54081 and previous config saved to /var/cache/conftool/dbconfig/20231204-104628-arnaudb.json
[10:46:34] <wikibugs>	 (03PS1) 10JMeybohm: Drop remaining k8s master cergen certs [puppet] - 10https://gerrit.wikimedia.org/r/979906 (https://phabricator.wikimedia.org/T329826)
[10:47:01] <wikibugs>	 (03PS2) 10Vgutierrez: lvs::realserver::ipip: Clamp on lo too [puppet] - 10https://gerrit.wikimedia.org/r/979903 (https://phabricator.wikimedia.org/T351069)
[10:47:03] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Drop remaining k8s master cergen certs [puppet] - 10https://gerrit.wikimedia.org/r/979906 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[10:47:55] <wikibugs>	 (03PS2) 10JMeybohm: Drop remaining k8s master cergen certs [puppet] - 10https://gerrit.wikimedia.org/r/979906 (https://phabricator.wikimedia.org/T329826)
[10:48:11] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.dns.netbox
[10:48:18] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/806/con" [puppet] - 10https://gerrit.wikimedia.org/r/979903 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[10:48:38] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/807/console" [puppet] - 10https://gerrit.wikimedia.org/r/979906 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[10:49:16] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[10:49:51] <wikibugs>	 (03PS1) 10Elukey: ml-services: remove mlstaging ingress settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/979907
[10:50:29] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add service records for the k8s-ingress-dse endpoints - btullis@cumin1001"
[10:51:18] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add service records for the k8s-ingress-dse endpoints - btullis@cumin1001"
[10:51:18] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[10:52:16] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[10:53:58] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/808/con" [puppet] - 10https://gerrit.wikimedia.org/r/979902 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[10:54:39] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: eventschemas::service
[10:56:50] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] Remove obsolete dummy cert [labs/private] - 10https://gerrit.wikimedia.org/r/979905 (owner: 10Muehlenhoff)
[10:57:16] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[10:57:34] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/979906 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[10:58:44] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch eventschems::service to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/979908 (https://phabricator.wikimedia.org/T349619)
[11:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231204T1100)
[11:00:19] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[11:00:44] <jinxer-wm>	 (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk
[11:00:49] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+2] Move mw api servers to kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/977659 (https://phabricator.wikimedia.org/T351074) (owner: 10Kamila Součková)
[11:00:59] <wikibugs>	 (03CR) 10Kamila Součková: [C: 03+2] Move mw api servers to kubernetes workers [homer/public] - 10https://gerrit.wikimedia.org/r/977660 (https://phabricator.wikimedia.org/T351074) (owner: 10Kamila Součková)
[11:01:35] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T348183)', diff saved to https://phabricator.wikimedia.org/P54082 and previous config saved to /var/cache/conftool/dbconfig/20231204-110134-arnaudb.json
[11:01:37] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance
[11:01:45] <wikibugs>	 (03Merged) 10jenkins-bot: Move mw api servers to kubernetes workers [homer/public] - 10https://gerrit.wikimedia.org/r/977660 (https://phabricator.wikimedia.org/T351074) (owner: 10Kamila Součková)
[11:01:47] <wikibugs>	 (03PS2) 10Kamila Součková: Move mw api servers to kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/977659 (https://phabricator.wikimedia.org/T351074)
[11:01:50] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[11:01:50] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance
[11:01:57] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T348183)', diff saved to https://phabricator.wikimedia.org/P54083 and previous config saved to /var/cache/conftool/dbconfig/20231204-110156-arnaudb.json
[11:03:46] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch eventschems::service to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/979908 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[11:04:41] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 04-1] "Python part LGTM, the Puppet part won't work as-is" [puppet] - 10https://gerrit.wikimedia.org/r/977696 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[11:05:37] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:05:54] <wikibugs>	 (03PS1) 10Brouberol: Add an entry related to the dse k8s cluster ingress gateway to conftool [puppet] - 10https://gerrit.wikimedia.org/r/979910 (https://phabricator.wikimedia.org/T352639)
[11:06:00] <wikibugs>	 (03PS1) 10Brouberol: Add the k8s-ingress-dse LVS service to the service list [puppet] - 10https://gerrit.wikimedia.org/r/979911 (https://phabricator.wikimedia.org/T352639)
[11:06:36] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T348183)', diff saved to https://phabricator.wikimedia.org/P54084 and previous config saved to /var/cache/conftool/dbconfig/20231204-110635-arnaudb.json
[11:06:51] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[11:07:07] <wikibugs>	 (03PS10) 10Vgutierrez: lvs::realserver::ipip: Check that TCP MSS clamping is working [puppet] - 10https://gerrit.wikimedia.org/r/977696 (https://phabricator.wikimedia.org/T351069)
[11:07:11] <wikibugs>	 (03PS2) 10Brouberol: Add the k8s-ingress-dse LVS service to the service list [puppet] - 10https://gerrit.wikimedia.org/r/979911 (https://phabricator.wikimedia.org/T352639)
[11:07:28] <wikibugs>	 (03CR) 10Vgutierrez: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/977696 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[11:08:37] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: eventschemas::service
[11:10:05] <jinxer-wm>	 (JobUnavailable) firing: (9) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[11:10:19] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[11:11:03] <wikibugs>	 (03PS2) 10Brouberol: Define a DNS A record for the dse k8s ingress gateway [dns] - 10https://gerrit.wikimedia.org/r/979891 (https://phabricator.wikimedia.org/T352639)
[11:11:05] <wikibugs>	 (03PS2) 10Brouberol: Enable ingress for the spark-history server services via the dse ingress gw [dns] - 10https://gerrit.wikimedia.org/r/979892 (https://phabricator.wikimedia.org/T352639)
[11:11:07] <wikibugs>	 (03PS3) 10Elukey: cert-manager: bump appVersion [deployment-charts] - 10https://gerrit.wikimedia.org/r/978640 (https://phabricator.wikimedia.org/T351933)
[11:11:32] <wikibugs>	 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff)
[11:12:27] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Grant access to nda LDAP group to xqt - https://phabricator.wikimedia.org/T348520 (10Xqt) a:05Xqt→03None @Dzahn:  > @Xqt Would you like us to keep your real name out of public repos or you don't mind? I propose not to publish my real name if possible.  >...
[11:13:28] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] openstack,trove: increase api response alert to 3s [alerts] - 10https://gerrit.wikimedia.org/r/979899 (owner: 10David Caro)
[11:14:30] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] lvs::realserver::ipip: Check that TCP MSS clamping is working [puppet] - 10https://gerrit.wikimedia.org/r/977696 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[11:15:37] <logmsgbot>	 !log kamila@cumin1001 START - Cookbook sre.hosts.reimage for host mw2422.codfw.wmnet with OS bullseye
[11:16:24] <wikibugs>	 (03CR) 10Fabfur: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/979903 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[11:17:12] <logmsgbot>	 !log kamila@cumin1001 START - Cookbook sre.hosts.reimage for host mw1462.eqiad.wmnet with OS bullseye
[11:19:37] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes1038 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:20:37] <wikibugs>	 (03PS4) 10Elukey: cert-manager: bump version in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/978640 (https://phabricator.wikimedia.org/T351933)
[11:21:16] <wikibugs>	 (03PS5) 10Elukey: cert-manager: bump version in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/978640 (https://phabricator.wikimedia.org/T351933)
[11:21:28] <wikibugs>	 (03CR) 10Elukey: cert-manager: bump version in staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/978640 (https://phabricator.wikimedia.org/T351933) (owner: 10Elukey)
[11:21:42] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P54085 and previous config saved to /var/cache/conftool/dbconfig/20231204-112141-arnaudb.json
[11:22:33] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes1038 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:24:44] <wikibugs>	 (03PS1) 10Dreamy Jazz: Enable read new for event tables migration on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979914 (https://phabricator.wikimedia.org/T341829)
[11:26:28] <wikibugs>	 (03PS2) 10EoghanGaffney: [apt-staging] Add script to pull artifacts from gitlab [puppet] - 10https://gerrit.wikimedia.org/r/979912
[11:28:44] <wikibugs>	 (03PS1) 10Klausman: hiera: clean up more ORES leftovers [labs/private] - 10https://gerrit.wikimedia.org/r/979915 (https://phabricator.wikimedia.org/T347278)
[11:29:04] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/978466 (owner: 10Muehlenhoff)
[11:29:06] <wikibugs>	 (03PS6) 10Elukey: cert-manager: bump version in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/978640 (https://phabricator.wikimedia.org/T351933)
[11:29:31] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/977088 (owner: 10Muehlenhoff)
[11:29:57] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/977181 (owner: 10Muehlenhoff)
[11:30:18] <logmsgbot>	 !log kamila@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1462.eqiad.wmnet with reason: host reimage
[11:30:40] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Nice, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/978539 (https://phabricator.wikimedia.org/T346947) (owner: 10Majavah)
[11:30:46] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] hiera: clean up more ORES leftovers [labs/private] - 10https://gerrit.wikimedia.org/r/979915 (https://phabricator.wikimedia.org/T347278) (owner: 10Klausman)
[11:31:00] <wikibugs>	 (03PS1) 10Klausman: profiles: Remove more ORES leftovers [puppet] - 10https://gerrit.wikimedia.org/r/979916 (https://phabricator.wikimedia.org/T347278)
[11:31:30] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] Remove analytics_cluster::hadoop::client role [puppet] - 10https://gerrit.wikimedia.org/r/979338 (owner: 10Muehlenhoff)
[11:32:19] <wikibugs>	 (03CR) 10Elukey: profiles: Remove more ORES leftovers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979916 (https://phabricator.wikimedia.org/T347278) (owner: 10Klausman)
[11:32:28] <wikibugs>	 (03CR) 10Btullis: "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/979333 (https://phabricator.wikimedia.org/T352193) (owner: 10Muehlenhoff)
[11:32:33] <logmsgbot>	 !log kamila@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2422.codfw.wmnet with reason: host reimage
[11:32:49] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] deployment_server: add mcrouter service 1 [puppet] - 10https://gerrit.wikimedia.org/r/979339 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli)
[11:32:59] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] Add namespace for mcrouter service 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/979340 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli)
[11:33:34] <logmsgbot>	 !log kamila@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1462.eqiad.wmnet with reason: host reimage
[11:34:17] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] cert-manager: bump version in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/978640 (https://phabricator.wikimedia.org/T351933) (owner: 10Elukey)
[11:34:19] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Drop remaining k8s master cergen certs [puppet] - 10https://gerrit.wikimedia.org/r/979906 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm)
[11:35:22] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] cert-manager: bump version in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/978640 (https://phabricator.wikimedia.org/T351933) (owner: 10Elukey)
[11:36:19] <logmsgbot>	 !log kamila@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2422.codfw.wmnet with reason: host reimage
[11:36:49] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P54086 and previous config saved to /var/cache/conftool/dbconfig/20231204-113648-arnaudb.json
[11:37:41] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Rewrite metrics sent by Airflow [puppet] - 10https://gerrit.wikimedia.org/r/979118 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu)
[11:38:11] <wikibugs>	 (03PS1) 10Marostegui: dbproxy1027: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/979917 (https://phabricator.wikimedia.org/T351864)
[11:38:49] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] dbproxy1027: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/979917 (https://phabricator.wikimedia.org/T351864) (owner: 10Marostegui)
[11:39:25] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1027.eqiad.wmnet with OS bookworm
[11:39:28] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[11:39:35] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] "I can deploy this whenever it's convenient for you. I was wondering whether you need to coordinate it with an airflow-dags deployment of t" [puppet] - 10https://gerrit.wikimedia.org/r/979118 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu)
[11:39:55] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] lvs::realserver::ipip: Clamp on lo too [puppet] - 10https://gerrit.wikimedia.org/r/979903 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[11:40:01] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[11:40:29] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T352653 (10ArthurTaylor) As the #WMF-Legal project tag was added to this task, some general information to avoid wrong expectations: Please note that public tasks i...
[11:40:49] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T352653 (10ArthurTaylor)
[11:41:24] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "This looks good to me now. It matches the address in https://netbox.wikimedia.org/ipam/ip-addresses/15582/" [dns] - 10https://gerrit.wikimedia.org/r/979891 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol)
[11:42:02] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] Add an entry related to the dse k8s cluster ingress gateway to conftool [puppet] - 10https://gerrit.wikimedia.org/r/979910 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol)
[11:42:08] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'sync'.
[11:42:21] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 44592
[11:42:40] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'.
[11:42:58] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 44592
[11:43:18] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'.
[11:43:58] <logmsgbot>	 !log elukey@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'.
[11:44:09] <wikibugs>	 (03PS4) 10KartikMistry: Update cxserver to 2023-12-04-083437-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/977983 (https://phabricator.wikimedia.org/T344982)
[11:45:15] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:url_downloader add blackbox exporter. [puppet] - 10https://gerrit.wikimedia.org/r/973780 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede)
[11:47:08] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "This also looks good to me, but I would also recommend a second opinion from someone else who knows the service catalog well." [puppet] - 10https://gerrit.wikimedia.org/r/979911 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol)
[11:47:42] <wikibugs>	 (03PS1) 10Jcrespo: Implement batch deletion, restoration and query of files [software/mediabackups] - 10https://gerrit.wikimedia.org/r/979919 (https://phabricator.wikimedia.org/T352655)
[11:48:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Implement batch deletion, restoration and query of files [software/mediabackups] - 10https://gerrit.wikimedia.org/r/979919 (https://phabricator.wikimedia.org/T352655) (owner: 10Jcrespo)
[11:51:08] <logmsgbot>	 !log kamila@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1462.eqiad.wmnet with OS bullseye
[11:51:55] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T348183)', diff saved to https://phabricator.wikimedia.org/P54087 and previous config saved to /var/cache/conftool/dbconfig/20231204-115154-arnaudb.json
[11:51:57] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance
[11:51:59] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[11:52:10] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance
[11:52:17] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1188 (T348183)', diff saved to https://phabricator.wikimedia.org/P54088 and previous config saved to /var/cache/conftool/dbconfig/20231204-115217-arnaudb.json
[11:52:27] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Structured-Data-Backlog, 10UploadWizard: Access request to deleted image files in the production Swift cluster - https://phabricator.wikimedia.org/T350020 (10jcrespo)
[11:52:42] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Structured-Data-Backlog, 10UploadWizard: Access request to deleted image files in the production Swift cluster - https://phabricator.wikimedia.org/T350020 (10jcrespo) Updating title to reflect current request.
[11:53:15] <wikibugs>	 (03CR) 10Brouberol: "Claime, as you added the k8s-ingress-aux service, could I ask for a review for something similar? Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/979911 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol)
[11:53:57] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy1027.eqiad.wmnet with reason: host reimage
[11:54:38] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] mcrouter: add vanila chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/979107 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli)
[11:54:46] <logmsgbot>	 !log kamila@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2422.codfw.wmnet with OS bullseye
[11:54:56] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T348183)', diff saved to https://phabricator.wikimedia.org/P54089 and previous config saved to /var/cache/conftool/dbconfig/20231204-115455-arnaudb.json
[11:56:50] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy1027.eqiad.wmnet with reason: host reimage
[12:00:55] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host druid1011.eqiad.wmnet
[12:01:15] <wikibugs>	 (03CR) 10Jelto: "It seems this change introduced some problems with Puppet runs on contint (role::ci) hosts. They fail with" [puppet] - 10https://gerrit.wikimedia.org/r/977687 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi)
[12:01:40] <wikibugs>	 (03PS1) 10Ladsgroup: Bump ParserCache TTL back to 30 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979920 (https://phabricator.wikimedia.org/T280604)
[12:01:59] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch druid1011 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/979921 (https://phabricator.wikimedia.org/T349619)
[12:04:14] <urbanecm>	 jouncebot: nowandnext
[12:04:14] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 55 minute(s)
[12:04:14] <jouncebot>	 In 1 hour(s) and 55 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231204T1400)
[12:04:31] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] User impact: sort datestring keys to ascending alphanumeric order [extensions/GrowthExperiments] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979690 (https://phabricator.wikimedia.org/T352349) (owner: 10Urbanecm)
[12:05:00] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch druid1011 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/979921 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[12:07:40] <wikibugs>	 (03PS2) 10Klausman: profiles: Remove more ORES leftovers [puppet] - 10https://gerrit.wikimedia.org/r/979916 (https://phabricator.wikimedia.org/T347278)
[12:08:07] <wikibugs>	 (03CR) 10Klausman: "I want to wait for tavvi's answer regarding the generated file before submitting this." [puppet] - 10https://gerrit.wikimedia.org/r/979916 (https://phabricator.wikimedia.org/T347278) (owner: 10Klausman)
[12:08:35] <wikibugs>	 (03CR) 10Tacsipacsi: Bump ParserCache TTL back to 30 days (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979920 (https://phabricator.wikimedia.org/T280604) (owner: 10Ladsgroup)
[12:09:58] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host druid1011.eqiad.wmnet
[12:09:58] <wikibugs>	 (03PS1) 10Marostegui: Revert "dbproxy1027: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/979691
[12:10:02] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P54090 and previous config saved to /var/cache/conftool/dbconfig/20231204-121002-arnaudb.json
[12:11:16] <wikibugs>	 (03CR) 10Clément Goubert: "The overall code and logic LGTM, but some of the changes should in my opinion be spun off into new module versions." [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli)
[12:11:30] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1027: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/979691 (owner: 10Marostegui)
[12:12:46] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Deploy kube-state-metrics to the dse-k8s cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/978504 (https://phabricator.wikimedia.org/T264625) (owner: 10Btullis)
[12:13:02] <wikibugs>	 (03CR) 10Kosta Harlan: [C: 03+1] Enable read new for event tables migration on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979914 (https://phabricator.wikimedia.org/T341829) (owner: 10Dreamy Jazz)
[12:15:16] <wikibugs>	 (03Merged) 10jenkins-bot: Deploy kube-state-metrics to the dse-k8s cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/978504 (https://phabricator.wikimedia.org/T264625) (owner: 10Btullis)
[12:15:56] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy1027.eqiad.wmnet with OS bookworm
[12:18:17] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[12:19:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host an-druid1005.eqiad.wmnet
[12:19:47] <logmsgbot>	 !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[12:21:31] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch an-druid1005 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/979924 (https://phabricator.wikimedia.org/T349619)
[12:22:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch an-druid1005 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/979924 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[12:25:09] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P54091 and previous config saved to /var/cache/conftool/dbconfig/20231204-122508-arnaudb.json
[12:25:22] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979690 (https://phabricator.wikimedia.org/T352349) (owner: 10Urbanecm)
[12:25:31] <wikibugs>	 (03Merged) 10jenkins-bot: User impact: sort datestring keys to ascending alphanumeric order [extensions/GrowthExperiments] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979690 (https://phabricator.wikimedia.org/T352349) (owner: 10Urbanecm)
[12:25:44] <logmsgbot>	 !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:979690|User impact: sort datestring keys to ascending alphanumeric order (T352349 T351898)]]
[12:25:49] <stashbot>	 T352349: Impact Module: Views on articles you've edited graph - https://phabricator.wikimedia.org/T352349
[12:25:49] <stashbot>	 T351898: Reduce size of growthexperiments_user_impact.geui_data json blobs - https://phabricator.wikimedia.org/T351898
[12:25:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/979926 (owner: 10L10n-bot)
[12:27:47] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host an-druid1005.eqiad.wmnet
[12:28:19] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:979690|User impact: sort datestring keys to ascending alphanumeric order (T352349 T351898)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[12:28:23] <wikibugs>	 (03PS1) 10Elukey: admin_ng: deploy kube-state-metrics on all ml clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/979930 (https://phabricator.wikimedia.org/T264625)
[12:29:11] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm: Continuing with sync
[12:29:47] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/979465
[12:33:39] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "Missing the reverse PTR for both eqiad and codfw (as a commented line reserved for)" [dns] - 10https://gerrit.wikimedia.org/r/979891 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol)
[12:35:21] <wikibugs>	 (03CR) 10Clément Goubert: [C: 04-1] Add the k8s-ingress-dse LVS service to the service list (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/979911 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol)
[12:35:28] <logmsgbot>	 !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:979690|User impact: sort datestring keys to ascending alphanumeric order (T352349 T351898)]] (duration: 09m 43s)
[12:35:36] <stashbot>	 T352349: Impact Module: Views on articles you've edited graph - https://phabricator.wikimedia.org/T352349
[12:35:36] <stashbot>	 T351898: Reduce size of growthexperiments_user_impact.geui_data json blobs - https://phabricator.wikimedia.org/T351898
[12:40:15] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T348183)', diff saved to https://phabricator.wikimedia.org/P54092 and previous config saved to /var/cache/conftool/dbconfig/20231204-124015-arnaudb.json
[12:40:17] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance
[12:40:20] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[12:40:31] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance
[12:40:38] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1197 (T348183)', diff saved to https://phabricator.wikimedia.org/P54093 and previous config saved to /var/cache/conftool/dbconfig/20231204-124037-arnaudb.json
[12:43:17] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T348183)', diff saved to https://phabricator.wikimedia.org/P54094 and previous config saved to /var/cache/conftool/dbconfig/20231204-124316-arnaudb.json
[12:44:15] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] firewall: Remove special case handling for flerovium [puppet] - 10https://gerrit.wikimedia.org/r/979333 (https://phabricator.wikimedia.org/T352193) (owner: 10Muehlenhoff)
[12:47:38] <wikibugs>	 (03PS3) 10Brouberol: Define a DNS A record for the dse k8s ingress gateway [dns] - 10https://gerrit.wikimedia.org/r/979891 (https://phabricator.wikimedia.org/T352639)
[12:47:40] <wikibugs>	 (03PS3) 10Brouberol: Enable ingress for the spark-history server services via the dse ingress gw [dns] - 10https://gerrit.wikimedia.org/r/979892 (https://phabricator.wikimedia.org/T352639)
[12:48:54] <icinga-wm>	 RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:49:24] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[12:49:28] <wikibugs>	 (03PS7) 10MdsShakil: Create new namespaces and namespace aliases for bd.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977196 (https://phabricator.wikimedia.org/T351903)
[12:49:53] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Remove analytics_cluster::hadoop::client role [puppet] - 10https://gerrit.wikimedia.org/r/979338 (owner: 10Muehlenhoff)
[12:51:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] archiva: Update outdated comments [puppet] - 10https://gerrit.wikimedia.org/r/978466 (owner: 10Muehlenhoff)
[12:52:14] <wikibugs>	 10SRE-tools, 10Dumps-Generation, 10Infrastructure-Foundations, 10serviceops, and 2 others: Some Service Operations clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271142 (10akosiaris) @Volans   All of these (which can be grouped in 2 just 2 categores, **mw** and **mc**, have be...
[12:52:59] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] statistics::web: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/977088 (owner: 10Muehlenhoff)
[12:53:35] <wikibugs>	 (03CR) 10Clément Goubert: [C: 04-1] Add the k8s-ingress-dse LVS service to the service list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979911 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol)
[12:54:15] <wikibugs>	 (03CR) 10Brouberol: Add the k8s-ingress-dse LVS service to the service list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979911 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol)
[12:54:27] <wikibugs>	 (03PS3) 10Brouberol: Add the k8s-ingress-dse LVS service to the service list [puppet] - 10https://gerrit.wikimedia.org/r/979911 (https://phabricator.wikimedia.org/T352639)
[12:55:01] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add the k8s-ingress-dse LVS service to the service list [puppet] - 10https://gerrit.wikimedia.org/r/979911 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol)
[12:55:44] <wikibugs>	 (03PS4) 10Brouberol: Add the k8s-ingress-dse LVS service to the service list [puppet] - 10https://gerrit.wikimedia.org/r/979911 (https://phabricator.wikimedia.org/T352639)
[12:56:20] <wikibugs>	 10SRE-tools, 10Dumps-Generation, 10Infrastructure-Foundations, 10serviceops, and 2 others: Some Service Operations clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271142 (10Volans) @akosiaris sure, and having a cluster deemed as *not* IPv6 ready is totally ok. The problem arise...
[12:56:42] <wikibugs>	 (03CR) 10Brouberol: Define a DNS A record for the dse k8s ingress gateway (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/979891 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol)
[12:57:56] <wikibugs>	 (03PS4) 10Brouberol: Enable ingress for the spark-history server services via the dse ingress gw [dns] - 10https://gerrit.wikimedia.org/r/979892 (https://phabricator.wikimedia.org/T352639)
[12:58:23] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P54095 and previous config saved to /var/cache/conftool/dbconfig/20231204-125823-arnaudb.json
[12:59:58] <wikibugs>	 (03PS1) 10Hnowlan: jobqueue: reduce ThumbnailRender concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/979942 (https://phabricator.wikimedia.org/T337649)
[13:00:29] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] analytics::postgresql: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/977181 (owner: 10Muehlenhoff)
[13:03:12] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] jobqueue: reduce ThumbnailRender concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/979942 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan)
[13:03:25] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] jobqueue: reduce ThumbnailRender concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/979942 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan)
[13:04:28] <wikibugs>	 (03Merged) 10jenkins-bot: jobqueue: reduce ThumbnailRender concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/979942 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan)
[13:04:40] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
[13:04:51] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] Add the k8s-ingress-dse LVS service to the service list [puppet] - 10https://gerrit.wikimedia.org/r/979911 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol)
[13:05:01] <wikibugs>	 (03PS1) 10Filippo Giunchedi: hieradata: update kubernetes::clusters in CI [puppet] - 10https://gerrit.wikimedia.org/r/979943 (https://phabricator.wikimedia.org/T351179)
[13:05:04] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply
[13:05:32] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] hieradata: update kubernetes::clusters in CI [puppet] - 10https://gerrit.wikimedia.org/r/979943 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi)
[13:05:46] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
[13:05:48] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] k8s: allow setting prometheus retention in cluster definition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/977687 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi)
[13:06:08] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
[13:06:21] <TheresNoTime>	 hej, just bubbling T352628 and T352659 up — `Wikimedia\Rdbms\DBQueryError` but maybe an issue with the jobqueue?
[13:06:22] <stashbot>	 T352628: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628
[13:06:22] <stashbot>	 T352659: [13f3f15c-98c2-4126-8e87-6d6d81706e13] 2023-12-04 12:39:58: Fatal exception of type "Wikimedia\Rdbms\DBQueryError" - https://phabricator.wikimedia.org/T352659
[13:06:31] <wikibugs>	 (03PS2) 10Filippo Giunchedi: hieradata: update kubernetes::clusters in CI [puppet] - 10https://gerrit.wikimedia.org/r/979943 (https://phabricator.wikimedia.org/T351179)
[13:06:47] <TheresNoTime>	 hnowlan: sorry for the ping, seeing that you're touching jobqueue at the moment?
[13:07:55] <wikibugs>	 (03CR) 10Filippo Giunchedi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/979943 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi)
[13:08:30] <TheresNoTime>	 ^ T352663
[13:08:30] <stashbot>	 T352663: JobQueueError: Could not enqueue jobs - https://phabricator.wikimedia.org/T352663
[13:09:45] <wikibugs>	 (03PS2) 10Jcrespo: Implement batch deletion, restoration and query of files [software/mediabackups] - 10https://gerrit.wikimedia.org/r/979919 (https://phabricator.wikimedia.org/T352655)
[13:10:58] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me. I" [puppet] - 10https://gerrit.wikimedia.org/r/979912 (owner: 10EoghanGaffney)
[13:10:59] <hnowlan>	 TheresNoTime: good shout, it's not related to that change but it is most likely related to something I've been doing recently 
[13:11:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/811/console" [puppet] - 10https://gerrit.wikimedia.org/r/977733 (https://phabricator.wikimedia.org/T351936) (owner: 10Filippo Giunchedi)
[13:12:07] <wikibugs>	 (03CR) 10Filippo Giunchedi: "PCC failed though only in 'prod' which is expected" [puppet] - 10https://gerrit.wikimedia.org/r/979943 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi)
[13:13:30] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P54096 and previous config saved to /var/cache/conftool/dbconfig/20231204-131329-arnaudb.json
[13:14:34] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T352653 (10Aklapper) Hi and welcome! Unrelated: Could you please also [connect your WMDE SUL account on mediawiki.org](https://phabricator.wikimedia.org/settings/panel/external/) to your Phab account?...
[13:15:01] <wikibugs>	 (03PS10) 10D3r1ck01: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366)
[13:15:42] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01)
[13:16:22] <icinga-wm>	 PROBLEM - Check systemd state on ml-serve1001 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:16:36] <icinga-wm>	 PROBLEM - Check systemd state on kubernetes1027 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:17:03] <wikibugs>	 (03PS11) 10D3r1ck01: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366)
[13:17:17] <wikibugs>	 (03PS1) 10Arnaudb: mariadb: add db2194 to multiinstance pool [puppet] - 10https://gerrit.wikimedia.org/r/979946 (https://phabricator.wikimedia.org/T343674)
[13:17:49] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01)
[13:19:07] <wikibugs>	 (03PS3) 10Jcrespo: Implement batch deletion, restoration and query of files [software/mediabackups] - 10https://gerrit.wikimedia.org/r/979919 (https://phabricator.wikimedia.org/T352655)
[13:20:00] <wikibugs>	 (03PS12) 10D3r1ck01: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366)
[13:20:42] <hnowlan>	 TheresNoTime: still looking but it seems unlikely to be related to my work - we're migrating jobs to the k8s jobrunners, but that job hasn't been touched yet
[13:20:55] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01)
[13:22:13] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] openstack,trove: increase api response alert to 3s [alerts] - 10https://gerrit.wikimedia.org/r/979899 (owner: 10David Caro)
[13:22:28] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
[13:22:43] <wikibugs>	 (03PS2) 10Brouberol: Add an entry related to the dse k8s cluster ingress gateway to conftool [puppet] - 10https://gerrit.wikimedia.org/r/979910 (https://phabricator.wikimedia.org/T352639)
[13:22:45] <wikibugs>	 (03PS5) 10Brouberol: Add the k8s-ingress-dse LVS service to the service list [puppet] - 10https://gerrit.wikimedia.org/r/979911 (https://phabricator.wikimedia.org/T352639)
[13:22:51] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply
[13:22:58] <moritzm>	 !log installing libde265 security updates
[13:23:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:23:00] <TheresNoTime>	 hnowlan: hm, ack — thanks for looking.. the `JobQueueError`s do seem to have died down a little (started at around 12:45 UTC and finished(?) at 13:16 UTC, does that match anything changing that you know of?)
[13:23:17] <wikibugs>	 (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/979911 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol)
[13:23:30] <wikibugs>	 (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/979910 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol)
[13:24:54] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1027 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[13:25:14] <wikibugs>	 (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] prometheus: re-introduce distro-specific node-exporter arguments [puppet] - 10https://gerrit.wikimedia.org/r/977733 (https://phabricator.wikimedia.org/T351936) (owner: 10Filippo Giunchedi)
[13:25:44] <icinga-wm>	 RECOVERY - Check systemd state on kubernetes1027 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:25:59] <hnowlan>	 TheresNoTime: no, the only jobs that would be changing were related to thumbor and my deploy started/finished within that window but not in any way that aligned :( 
[13:26:36] <hnowlan>	 that error looks pretty clearly pointing to the queries being run which the jobrunner migration wouldn't affect at all
[13:26:59] <wikibugs>	 (03PS6) 10Brouberol: Add the k8s-ingress-dse LVS service to the service list [puppet] - 10https://gerrit.wikimedia.org/r/979911 (https://phabricator.wikimedia.org/T352639)
[13:27:32] <icinga-wm>	 PROBLEM - Check whether ferm is active by checking the default input chain on ml-serve1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[13:27:53] <wikibugs>	 (03PS13) 10D3r1ck01: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366)
[13:27:55] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T352653 (10ArthurTaylor) Done!
[13:28:32] <wikibugs>	 (03CR) 10D3r1ck01: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01)
[13:28:34] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01)
[13:28:37] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T348183)', diff saved to https://phabricator.wikimedia.org/P54097 and previous config saved to /var/cache/conftool/dbconfig/20231204-132836-arnaudb.json
[13:28:38] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance
[13:28:41] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[13:28:48] <wikibugs>	 (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/812/console" [puppet] - 10https://gerrit.wikimedia.org/r/979911 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol)
[13:28:53] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance
[13:28:59] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1222 (T348183)', diff saved to https://phabricator.wikimedia.org/P54098 and previous config saved to /var/cache/conftool/dbconfig/20231204-132859-arnaudb.json
[13:30:07] <wikibugs>	 (03CR) 10D3r1ck01: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01)
[13:30:51] <wikibugs>	 (03CR) 10JMeybohm: mcrouter: add chart (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli)
[13:30:53] <moritzm>	 !log instaling dbus security updates on buster
[13:30:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:33:23] <wikibugs>	 (03CR) 10JMeybohm: "Do we plan to just run one mcrouter deployment per cluster? If not, mcrouter is a too generic name IMHO. Does it maybe make sense to run a" [deployment-charts] - 10https://gerrit.wikimedia.org/r/979363 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli)
[13:33:29] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T348183)', diff saved to https://phabricator.wikimedia.org/P54099 and previous config saved to /var/cache/conftool/dbconfig/20231204-133328-arnaudb.json
[13:34:16] <wikibugs>	 (03CR) 10Marostegui: "Just a brief comment here" [puppet] - 10https://gerrit.wikimedia.org/r/979390 (https://phabricator.wikimedia.org/T207253) (owner: 10Ladsgroup)
[13:35:39] <wikibugs>	 (03PS14) 10D3r1ck01: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366)
[13:36:26] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01)
[13:36:56] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: exclude timer units from systemd collector [puppet] - 10https://gerrit.wikimedia.org/r/977734 (https://phabricator.wikimedia.org/T351936) (owner: 10Filippo Giunchedi)
[13:37:03] <wikibugs>	 (03PS3) 10Filippo Giunchedi: prometheus: exclude timer units from systemd collector [puppet] - 10https://gerrit.wikimedia.org/r/977734 (https://phabricator.wikimedia.org/T351936)
[13:37:54] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T352653 (10WMDE-leszek) I support this request from WMDE's side.
[13:38:06] <wikibugs>	 (03CR) 10Marostegui: mariadb: add db2194 to multiinstance pool (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979946 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb)
[13:39:13] <wikibugs>	 (03PS2) 10Arnaudb: mariadb: add db2194 to multiinstance pool [puppet] - 10https://gerrit.wikimedia.org/r/979946 (https://phabricator.wikimedia.org/T343674)
[13:39:31] <wikibugs>	 (03CR) 10Arnaudb: "this has been fixed!" [puppet] - 10https://gerrit.wikimedia.org/r/979946 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb)
[13:39:33] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] mariadb: add db2194 to multiinstance pool [puppet] - 10https://gerrit.wikimedia.org/r/979946 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb)
[13:39:35] <wikibugs>	 (03CR) 10Brouberol: [C: 03+2] Add an entry related to the dse k8s cluster ingress gateway to conftool [puppet] - 10https://gerrit.wikimedia.org/r/979910 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol)
[13:39:53] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+2] mariadb: add db2194 to multiinstance pool [puppet] - 10https://gerrit.wikimedia.org/r/979946 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb)
[13:40:34] <wikibugs>	 (03CR) 10Jcrespo: [C: 03+1] mariadb: add db2194 to multiinstance pool [puppet] - 10https://gerrit.wikimedia.org/r/979946 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb)
[13:42:19] <wikibugs>	 (03PS15) 10D3r1ck01: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366)
[13:42:59] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01)
[13:43:00] <logmsgbot>	 !log btullis@cumin1001 START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid jvm daemons.
[13:43:45] <wikibugs>	 (03PS16) 10D3r1ck01: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366)
[13:44:25] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01)
[13:45:36] <wikibugs>	 (03CR) 10Atieno: [C: 03+1] Bump ParserCache TTL back to 30 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979920 (https://phabricator.wikimedia.org/T280604) (owner: 10Ladsgroup)
[13:45:39] <wikibugs>	 (03PS2) 10Ladsgroup: Bump ParserCache TTL back to 30 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979920 (https://phabricator.wikimedia.org/T280604)
[13:46:21] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] admin_ng: deploy kube-state-metrics on all ml clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/979930 (https://phabricator.wikimedia.org/T264625) (owner: 10Elukey)
[13:46:41] <wikibugs>	 10SRE, 10HyperSwitch, 10Wikimedia-production-error: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 (10TheresNoTime)
[13:46:46] <wikibugs>	 (03CR) 10Ladsgroup: Bump ParserCache TTL back to 30 days (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979920 (https://phabricator.wikimedia.org/T280604) (owner: 10Ladsgroup)
[13:46:53] <Amir1>	 jouncebot: nowandnext
[13:46:54] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 13 minute(s)
[13:46:54] <jouncebot>	 In 0 hour(s) and 13 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231204T1400)
[13:48:35] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P54100 and previous config saved to /var/cache/conftool/dbconfig/20231204-134835-arnaudb.json
[13:52:02] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[13:52:02] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] Bring an-coord1003 into service as a hadoop coordinator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979086 (https://phabricator.wikimedia.org/T336045) (owner: 10Btullis)
[13:52:09] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Bring an-coord1003 into service as a hadoop coordinator [puppet] - 10https://gerrit.wikimedia.org/r/979086 (https://phabricator.wikimedia.org/T336045) (owner: 10Btullis)
[13:52:25] <wikibugs>	 (03PS17) 10D3r1ck01: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366)
[13:52:28] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[13:53:17] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01)
[13:55:22] <icinga-wm>	 RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1027 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
[13:55:24] <wikibugs>	 (03PS18) 10D3r1ck01: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366)
[13:56:04] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01)
[13:56:32] <moritzm>	 !log installing postgresql-13 security updates
[13:56:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:56:43] <wikibugs>	 (03PS1) 10Kosta Harlan: MediaModeration: Set MediaModerationDeveloperMode to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979969
[13:57:24] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] MediaModeration: Set MediaModerationDeveloperMode to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979969 (owner: 10Kosta Harlan)
[13:57:34] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[13:57:49] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Enable read new for event tables migration on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979914 (https://phabricator.wikimedia.org/T341829) (owner: 10Dreamy Jazz)
[13:58:19] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[13:59:21] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[13:59:57] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[14:00:05] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231204T1400).
[14:00:05] <jouncebot>	 James_F, Dreamy_Jazz, and MdsShakil: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[14:00:17] <Lucas_WMDE>	 o/
[14:00:18] <Dreamy_Jazz>	 \o
[14:00:24] <James_F>	 \o/
[14:00:30] <James_F>	 Now we've got a complete set.
[14:00:32] <MdsShakil>	 Hello
[14:00:55] <TheresNoTime>	 Lucas_WMDE: FYI T352628, don't think it's a deploy stopper
[14:00:55] <stashbot>	 T352628: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628
[14:01:33] <Lucas_WMDE>	 *nods*
[14:01:46] <Lucas_WMDE>	 James_F: I assume you’ll self-service?
[14:01:54] * Lucas_WMDE has no idea what to do about that transaction size error unfortunately
[14:02:13] <wikibugs>	 (03PS19) 10D3r1ck01: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366)
[14:02:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01)
[14:03:10] <wikibugs>	 (03PS2) 10Dreamy Jazz: Enable read new for event tables migration on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979914 (https://phabricator.wikimedia.org/T341829)
[14:03:42] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P54101 and previous config saved to /var/cache/conftool/dbconfig/20231204-140341-arnaudb.json
[14:03:48] <wikibugs>	 (03CR) 10Dreamy Jazz: Enable read new for event tables migration on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979914 (https://phabricator.wikimedia.org/T341829) (owner: 10Dreamy Jazz)
[14:04:02] <wikibugs>	 (03PS1) 10Peter Fischer: enable page_rerender for commonswiki, frwiki, itwiki, and wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979970
[14:04:04] <wikibugs>	 (03PS6) 10Hnowlan: rest-gateway: add device-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/970823
[14:04:25] <James_F>	 Lucas_WMDE: Oh, sure.
[14:04:44] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979362 (https://phabricator.wikimedia.org/T352532) (owner: 10Jforrester)
[14:04:46] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979180 (https://phabricator.wikimedia.org/T352495) (owner: 10Terasail)
[14:05:32] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctionswiki: Disable thumbnail in Vector search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979362 (https://phabricator.wikimedia.org/T352532) (owner: 10Jforrester)
[14:05:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wikifunctionswiki: Add ability for sysops to manage Functioneer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979180 (https://phabricator.wikimedia.org/T352495) (owner: 10Terasail)
[14:06:03] <wikibugs>	 (03CR) 10Dreamy Jazz: Enable read new for event tables migration on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979914 (https://phabricator.wikimedia.org/T341829) (owner: 10Dreamy Jazz)
[14:06:51] <James_F>	 Meh.
[14:06:55] <wikibugs>	 (03PS1) 10Arnaudb: homedir: add tmux.conf [puppet] - 10https://gerrit.wikimedia.org/r/979947 (https://phabricator.wikimedia.org/T348183)
[14:07:05] <wikibugs>	 (03PS5) 10Jforrester: wikifunctionswiki: Add ability for sysops to manage Functioneer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979180 (https://phabricator.wikimedia.org/T352495) (owner: 10Terasail)
[14:07:09] <wikibugs>	 (03CR) 10TrainBranchBot: "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979180 (https://phabricator.wikimedia.org/T352495) (owner: 10Terasail)
[14:07:27] <James_F>	 Dear CI, please don't flake when I'm deploying, kthxbai.
[14:07:51] <wikibugs>	 (03Merged) 10jenkins-bot: wikifunctionswiki: Add ability for sysops to manage Functioneer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979180 (https://phabricator.wikimedia.org/T352495) (owner: 10Terasail)
[14:08:07] <Dreamy_Jazz>	 To be able to test my config change I would need to be given the checkuser group on testwiki.
[14:08:08] <logmsgbot>	 !log jforrester@deploy2002 Started scap: Backport for [[gerrit:979362|wikifunctionswiki: Disable thumbnail in Vector search (T352532)]], [[gerrit:979180|wikifunctionswiki: Add ability for sysops to manage Functioneer (T352495)]]
[14:08:13] <stashbot>	 T352532: Disable Vector 2022 search thumbnails on Wikifunctions - https://phabricator.wikimedia.org/T352532
[14:08:14] <stashbot>	 T352495: Add ability for administrators to add and remove functioneer - https://phabricator.wikimedia.org/T352495
[14:08:53] <Lucas_WMDE>	 hm, not sure if it’s okay to hand out that group tbh :/
[14:08:56] <Lucas_WMDE>	 even temporarily and on testwiki
[14:09:00] <Lucas_WMDE>	 it’s still real IP addresses…
[14:09:07] <Dreamy_Jazz>	 It's been done before.
[14:09:09] <Lucas_WMDE>	 but I don’t know the usual process to gain that right
[14:09:09] <Lucas_WMDE>	 ok
[14:09:25] <logmsgbot>	 !log jforrester@deploy2002 jforrester and terasail: Backport for [[gerrit:979362|wikifunctionswiki: Disable thumbnail in Vector search (T352532)]], [[gerrit:979180|wikifunctionswiki: Add ability for sysops to manage Functioneer (T352495)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:09:32] <Lucas_WMDE>	 do you have a link to some chat archive or log entry where it happened?
[14:09:38] <James_F>	 Lucas_WMDE: Because CU isn't available on beta cluster it gets more use on testwiki than it should.
[14:10:03] <logmsgbot>	 !log jforrester@deploy2002 jforrester and terasail: Continuing with sync
[14:10:43] <wikibugs>	 (03PS20) 10D3r1ck01: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366)
[14:10:50] <Lucas_WMDE>	 T337126 confirms NDA, at least
[14:10:50] <stashbot>	 T337126: Log stash access for Dreamy Jazz - https://phabricator.wikimedia.org/T337126
[14:10:52] <wikibugs>	 10SRE, 10HyperSwitch, 10Wikimedia-production-error: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 (10Ladsgroup) I don't know how restbase or hyperswitch ended up in critical path of saving edits, that is a rather important issue we need to check....
[14:11:06] <Dreamy_Jazz>	 See https://test.wikipedia.org/wiki/Special:UserRights/Dreamy_Jazz
[14:11:23] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01)
[14:11:35] <Lucas_WMDE>	 ack
[14:12:04] <wikibugs>	 10SRE, 10HyperSwitch, 10RESTBase, 10Wikimedia-production-error: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 (10TheresNoTime)
[14:12:28] <Lucas_WMDE>	 would probably be good to have a steward around to give you the right
[14:12:32] <Lucas_WMDE>	 IIRC createAndPromote.php isn’t logged as well
[14:12:33] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] homedir: add tmux.conf [puppet] - 10https://gerrit.wikimedia.org/r/979947 (https://phabricator.wikimedia.org/T348183) (owner: 10Arnaudb)
[14:12:35] <wikibugs>	 (03PS21) 10D3r1ck01: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366)
[14:12:39] <Lucas_WMDE>	 (though it would be an option)
[14:12:58] <wikibugs>	 (03CR) 10Arnaudb: [C: 03+2] homedir: add tmux.conf [puppet] - 10https://gerrit.wikimedia.org/r/979947 (https://phabricator.wikimedia.org/T348183) (owner: 10Arnaudb)
[14:13:03] <Dreamy_Jazz>	 Perhaps Urbanecm could?
[14:13:30] <Dreamy_Jazz>	 (listed as being on this window)
[14:13:41] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01)
[14:14:04] <Dreamy_Jazz>	 Otherwise I'm happy to delay and coordinate with them to make the change.
[14:14:14] <Dreamy_Jazz>	 in a later window.
[14:15:45] <James_F>	 OK, PHP-restarts are finally finishing, over to Lucas_WMDE, sorry for the slowness of scap.
[14:15:50] <logmsgbot>	 !log jforrester@deploy2002 Finished scap: Backport for [[gerrit:979362|wikifunctionswiki: Disable thumbnail in Vector search (T352532)]], [[gerrit:979180|wikifunctionswiki: Add ability for sysops to manage Functioneer (T352495)]] (duration: 07m 41s)
[14:15:57] <stashbot>	 T352532: Disable Vector 2022 search thumbnails on Wikifunctions - https://phabricator.wikimedia.org/T352532
[14:15:58] <stashbot>	 T352495: Add ability for administrators to add and remove functioneer - https://phabricator.wikimedia.org/T352495
[14:15:58] <Lucas_WMDE>	 alright
[14:16:31] <wikibugs>	 (03PS22) 10D3r1ck01: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366)
[14:16:40] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] rest-gateway: add device-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/970823 (owner: 10Hnowlan)
[14:16:53] * Lucas_WMDE digs up yubikey
[14:17:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01)
[14:17:20] <urbanecm>	 Dreamy_Jazz: hey, i saw your slack ping
[14:17:26] <Dreamy_Jazz>	 Thanks.
[14:17:30] <wikibugs>	 (03Merged) 10jenkins-bot: rest-gateway: add device-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/970823 (owner: 10Hnowlan)
[14:17:36] <Dreamy_Jazz>	 The change is https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/979914/
[14:17:44] <urbanecm>	 Dreamy_Jazz: you just need the testwiki cu flag, right? or am i supposed to deploy sth as well?
[14:17:47] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Enable read new for event tables migration on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979914 (https://phabricator.wikimedia.org/T341829) (owner: 10Dreamy Jazz)
[14:17:59] <Lucas_WMDE>	 urbanecm: I can deploy, unless you want to :)
[14:18:13] <Lucas_WMDE>	 but I can’t give out the right
[14:18:16] <urbanecm>	 i'd prefer someone else to deploy if possible
[14:18:30] <Lucas_WMDE>	 happy to do it then
[14:18:36] <wikibugs>	 10SRE, 10HyperSwitch, 10RESTBase, 10Wikimedia-production-error: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 (10Joe) @Ladsgroup I think the log linked by @TheresNoTime is a typical example of a distributed transaction going wrong:  * We start...
[14:18:48] <urbanecm>	 Dreamy_Jazz: volunteer / staff acc?
[14:18:49] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T348183)', diff saved to https://phabricator.wikimedia.org/P54102 and previous config saved to /var/cache/conftool/dbconfig/20231204-141848-arnaudb.json
[14:18:50] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance
[14:18:53] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[14:18:58] <wikibugs>	 (03CR) 10AOkoth: [C: 03+1] vrts: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/979364 (owner: 10Muehlenhoff)
[14:19:03] <urbanecm>	 or either is fine?
[14:19:04] <Dreamy_Jazz>	 Volunteer probably best just as I'll have recent actions for that account.
[14:19:05] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance
[14:19:08] <Dreamy_Jazz>	 Either is fine though.
[14:19:31] <wikibugs>	 (03PS3) 10Lucas Werkmeister (WMDE): Enable read new for event tables migration on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979914 (https://phabricator.wikimedia.org/T341829) (owner: 10Dreamy Jazz)
[14:19:38] <wikibugs>	 (03CR) 10Tacsipacsi: Bump ParserCache TTL back to 30 days (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979920 (https://phabricator.wikimedia.org/T280604) (owner: 10Ladsgroup)
[14:19:42] <wikibugs>	 10SRE-tools, 10Dumps-Generation, 10Infrastructure-Foundations, 10serviceops, and 2 others: Some Service Operations clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271142 (10akosiaris) >>! In T271142#9378778, @Volans wrote: > @akosiaris sure, and having a cluster deemed as *not*...
[14:19:46] <urbanecm>	 Dreamy_Jazz: granted for an hour
[14:19:49] <Dreamy_Jazz>	 Thanks!
[14:19:57] * Lucas_WMDE looks at diffConfig
[14:20:43] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979914 (https://phabricator.wikimedia.org/T341829) (owner: 10Dreamy Jazz)
[14:21:09] <wikibugs>	 (03PS1) 10Btullis: Prevent removal of python2 on hadoop coordinators [puppet] - 10https://gerrit.wikimedia.org/r/979973 (https://phabricator.wikimedia.org/T336045)
[14:21:23] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] centralserver: reintroduce tls-remedy for centralserver [puppet] - 10https://gerrit.wikimedia.org/r/979108 (https://phabricator.wikimedia.org/T351710) (owner: 10Filippo Giunchedi)
[14:21:25] <wikibugs>	 (03Merged) 10jenkins-bot: Enable read new for event tables migration on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979914 (https://phabricator.wikimedia.org/T341829) (owner: 10Dreamy Jazz)
[14:21:35] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[14:21:41] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:979914|Enable read new for event tables migration on testwiki (T341829)]]
[14:21:47] <stashbot>	 T341829: Enable read new for the event table migration - https://phabricator.wikimedia.org/T341829
[14:21:49] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[14:21:49] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10MatthewVernon)
[14:22:22] <wikibugs>	 (03PS2) 10Btullis: Prevent removal of python2 on hadoop coordinators [puppet] - 10https://gerrit.wikimedia.org/r/979973 (https://phabricator.wikimedia.org/T336045)
[14:22:58] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 dreamyjazz and lucaswerkmeister-wmde: Backport for [[gerrit:979914|Enable read new for event tables migration on testwiki (T341829)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:23:09] <wikibugs>	 10SRE, 10Wikimedia-production-error: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 (10Ladsgroup) Yup, after looking at logs properly, it's clear.
[14:23:18] <wikibugs>	 (03CR) 10Nikerabbit: [V: 03+2] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/979926 (owner: 10L10n-bot)
[14:23:20] <Dreamy_Jazz>	 Testing now.
[14:23:44] <Lucas_WMDE>	 ok
[14:23:56] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/813/console" [puppet] - 10https://gerrit.wikimedia.org/r/979973 (https://phabricator.wikimedia.org/T336045) (owner: 10Btullis)
[14:24:15] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance
[14:24:29] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance
[14:24:30] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Prevent removal of python2 on hadoop coordinators [puppet] - 10https://gerrit.wikimedia.org/r/979973 (https://phabricator.wikimedia.org/T336045) (owner: 10Btullis)
[14:24:50] <Dreamy_Jazz>	 Test complete and successful.
[14:25:13] <Dreamy_Jazz>	 Ran a few checks on my own account.
[14:25:24] <Lucas_WMDE>	 alright
[14:25:25] <Lucas_WMDE>	 thanks!
[14:25:26] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 dreamyjazz and lucaswerkmeister-wmde: Continuing with sync
[14:26:25] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Create new namespaces and namespace aliases for bd.wikimedia.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977196 (https://phabricator.wikimedia.org/T351903) (owner: 10MdsShakil)
[14:26:40] <Lucas_WMDE>	 MdsShakil: ^ left a suggestion on your change
[14:26:51] <Lucas_WMDE>	 but otherwise it should be okay to deploy once this backport is done
[14:26:59] <Lucas_WMDE>	 *config change
[14:27:33] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2104.codfw.wmnet with reason: Maintenance
[14:27:42] <MdsShakil>	 I think it's not necessary, since already mentioned on current task
[14:27:47] <MdsShakil>	 Lucas_WMDE
[14:27:48] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2104.codfw.wmnet with reason: Maintenance
[14:27:55] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2104 (T348183)', diff saved to https://phabricator.wikimedia.org/P54103 and previous config saved to /var/cache/conftool/dbconfig/20231204-142754-arnaudb.json
[14:27:58] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[14:27:59] <Lucas_WMDE>	 sure, not necessary
[14:28:02] <Lucas_WMDE>	 but still nice imho :)
[14:28:24] <Lucas_WMDE>	 if I want to know when the Photowalk namespace was established, it would be nice to have the older task ID there directly
[14:28:51] <Lucas_WMDE>	 but if you don’t want to add it I can live with that ^^
[14:29:25] <MdsShakil>	 Lucas_WMDE you can do it :)
[14:29:34] <Lucas_WMDE>	 hm, ok ^^
[14:29:37] * Lucas_WMDE downloads the change
[14:29:52] <wikibugs>	 (03CR) 10Ladsgroup: Bump ParserCache TTL back to 30 days (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979920 (https://phabricator.wikimedia.org/T280604) (owner: 10Ladsgroup)
[14:30:30] <wikibugs>	 (03PS8) 10Lucas Werkmeister (WMDE): Create new namespaces and namespace aliases for bd.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977196 (https://phabricator.wikimedia.org/T351903) (owner: 10MdsShakil)
[14:30:39] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Create new namespaces and namespace aliases for bd.wikimedia.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977196 (https://phabricator.wikimedia.org/T351903) (owner: 10MdsShakil)
[14:31:26] <wikibugs>	 10SRE, 10Wikimedia-production-error: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 (10Ladsgroup) I have to go to a meeting, if someone is willing to reproduce the issue in mwdebug while verbose log (there is an option for it in x-debug) is enabled...
[14:32:03] <wikibugs>	 (03CR) 10Klausman: [V: 03+2 C: 03+2] hiera: clean up more ORES leftovers [labs/private] - 10https://gerrit.wikimedia.org/r/979915 (https://phabricator.wikimedia.org/T347278) (owner: 10Klausman)
[14:32:08] <logmsgbot>	 !log btullis@cumin1001 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid analytics cluster: Roll restart of Druid jvm daemons.
[14:32:23] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:979914|Enable read new for event tables migration on testwiki (T341829)]] (duration: 10m 42s)
[14:32:27] <stashbot>	 T341829: Enable read new for the event table migration - https://phabricator.wikimedia.org/T341829
[14:33:07] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] wikimedia.org: add 1Password site verification [dns] - 10https://gerrit.wikimedia.org/r/979421 (https://phabricator.wikimedia.org/T352579) (owner: 10Ssingh)
[14:33:28] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977196 (https://phabricator.wikimedia.org/T351903) (owner: 10MdsShakil)
[14:33:37] <sukhe>	 !log running authdns-update for T352579
[14:33:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:33:42] <stashbot>	 T352579: Update DNS records for 1Password - https://phabricator.wikimedia.org/T352579
[14:34:11] <wikibugs>	 (03Merged) 10jenkins-bot: Create new namespaces and namespace aliases for bd.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977196 (https://phabricator.wikimedia.org/T351903) (owner: 10MdsShakil)
[14:34:16] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[14:34:25] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:977196|Create new namespaces and namespace aliases for bd.wikimedia.org (T351903)]]
[14:34:29] <stashbot>	 T351903: Create new namespaces and namespace aliases for bd.wikimedia.org - https://phabricator.wikimedia.org/T351903
[14:36:10] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and mdsshakil: Backport for [[gerrit:977196|Create new namespaces and namespace aliases for bd.wikimedia.org (T351903)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[14:36:31] <Lucas_WMDE>	 MdsShakil: the change should be live on one of the mwdebug servers, can you test it there?
[14:36:50] <MdsShakil>	 Lucas_WMDE yah, testing 
[14:36:52] <wikibugs>	 10SRE, 10Wikimedia-production-error: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 (10Wbm1058) I've gotten this error twice, when trying to make the same simple edit to a page  A database query error has occurred. This may indicate a bug in the sof...
[14:37:06] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm, at least that should be what puppet agent is missing on contint hosts" [puppet] - 10https://gerrit.wikimedia.org/r/979943 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi)
[14:37:35] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cp4038.ulsfo.wmnet
[14:38:39] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch cp4038 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/979975 (https://phabricator.wikimedia.org/T349619)
[14:39:04] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:39:16] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[14:39:18] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch cp4038 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/979975 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[14:40:07] <MdsShakil>	 Lucas_WMDE looks good to me 
[14:40:14] <Lucas_WMDE>	 cool, thanks!
[14:40:16] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and mdsshakil: Continuing with sync
[14:41:47] <wikibugs>	 (03PS1) 10Ssingh: wikimedia.org: remove already verified jamf TXT record [dns] - 10https://gerrit.wikimedia.org/r/979976 (https://phabricator.wikimedia.org/T349665)
[14:43:07] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] wikimedia.org: remove already verified jamf TXT record [dns] - 10https://gerrit.wikimedia.org/r/979976 (https://phabricator.wikimedia.org/T349665) (owner: 10Ssingh)
[14:43:30] <sukhe>	 !log running authdns-update for CR 979976 [revert of T349665]
[14:43:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:43:34] <stashbot>	 T349665: Update DNS for Jamf account SSO - https://phabricator.wikimedia.org/T349665
[14:43:37] <wikibugs>	 (03CR) 10Klausman: [C: 03+2] profiles: Remove more ORES leftovers [puppet] - 10https://gerrit.wikimedia.org/r/979916 (https://phabricator.wikimedia.org/T347278) (owner: 10Klausman)
[14:44:06] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cp4038.ulsfo.wmnet
[14:46:14] <logmsgbot>	 !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:977196|Create new namespaces and namespace aliases for bd.wikimedia.org (T351903)]] (duration: 11m 48s)
[14:46:18] <stashbot>	 T351903: Create new namespaces and namespace aliases for bd.wikimedia.org - https://phabricator.wikimedia.org/T351903
[14:46:43] <Lucas_WMDE>	 !log UTC afternoon backport+config window done
[14:46:44] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cp4046.ulsfo.wmnet
[14:46:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:47:10] <MdsShakil>	 Lucas_WMDE namespaceDupes?
[14:47:32] <Lucas_WMDE>	 ah
[14:47:33] <Lucas_WMDE>	 good point
[14:47:42] <wikibugs>	 (03PS2) 10Hnowlan: jobqueue: switch a medium weight job [deployment-charts] - 10https://gerrit.wikimedia.org/r/979395 (https://phabricator.wikimedia.org/T349796)
[14:47:44] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1076.eqiad.wmnet with OS bullseye
[14:47:50] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host ms-be1076.eqiad.wmnet with OS bullseye
[14:47:59] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch cp4046 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/979978 (https://phabricator.wikimedia.org/T349619)
[14:48:05] <Lucas_WMDE>	 ah. “Unsafe to run at this time. See: T350443”
[14:48:05] <stashbot>	 T350443: namespaceDupes.php doesn't have limit on write queries - https://phabricator.wikimedia.org/T350443
[14:48:56] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch cp4046 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/979978 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[14:48:58] <MdsShakil>	 Task seems resolved 
[14:49:20] <Lucas_WMDE>	 yeah, which is unfortunate
[14:49:32] <Lucas_WMDE>	 given that the revert reenabling the script won’t be deployed for another week
[14:49:34] <Lucas_WMDE>	 (no train this week)
[14:50:23] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host ms-be1077
[14:50:24] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be1077
[14:50:32] <Lucas_WMDE>	 would be nice if I could at least dry-run the script
[14:50:37] <Lucas_WMDE>	 but it was disabled too forcefully for that
[14:51:03] <vgutierrez>	 !log upload tcp-mss-clamper 0.4 to apt.wm.o (bookworm)
[14:51:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:51:06] <icinga-wm>	 PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-k8s-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:52:08] <MdsShakil>	 Lucas_WMDE so we need to wait until it's fully resolved
[14:52:17] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: update kubernetes::clusters in CI [puppet] - 10https://gerrit.wikimedia.org/r/979943 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi)
[14:53:03] <Lucas_WMDE>	 I’m trying to see if there’s any way to run the SELECT queries without the script, at least
[14:53:19] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cp4046.ulsfo.wmnet
[14:54:04] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:54:16] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[14:54:24] <wikibugs>	 (03PS4) 10Brouberol: Define a DNS A record for the dse k8s ingress gateway [dns] - 10https://gerrit.wikimedia.org/r/979891 (https://phabricator.wikimedia.org/T352639)
[14:54:55] <Lucas_WMDE>	 hmph, 62 rows
[14:55:37] <godog>	 jelto: we're back re: contint, puppet runs
[14:59:16] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[14:59:29] <Lucas_WMDE>	 MdsShakil: I dumped the titles on the task, not much more that can be done at the moment I think
[14:59:38] <Lucas_WMDE>	 unless you want to revert the config change
[15:00:19] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[15:00:45] <jinxer-wm>	 (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk
[15:01:14] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1076']
[15:02:01] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1077']
[15:02:15] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1078']
[15:02:25] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be1077']
[15:02:29] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1077']
[15:03:29] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1079']
[15:04:02] <MdsShakil>	 Lucas_WMDE I think we can keep the patch and fixed later dupes issue 
[15:04:11] <Lucas_WMDE>	 alright
[15:06:29] <wikibugs>	 10SRE, 10Wikimedia-production-error: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 (10TheresNoTime) Got a verbose log for `[e7bc3819-b052-43a3-a9e2-438ae9d4b38f] 2023-12-04 15:01:09: Fatal exception of type "Wikimedia\Rdbms\DBQueryError"`, on artic...
[15:08:09] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ms-be1078']
[15:08:16] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ms-be1077']
[15:09:24] <jelto>	 godog: yes puppet is happy again, thanks!
[15:09:35] <godog>	 sure np
[15:11:23] <wikibugs>	 (03CR) 10David Caro: [C: 03+2] quarry: use github remote [puppet] - 10https://gerrit.wikimedia.org/r/965514 (https://phabricator.wikimedia.org/T348748) (owner: 10Vivian Rook)
[15:12:30] <wikibugs>	 (03PS2) 10Jelto: add wmf-debci image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/979355 (https://phabricator.wikimedia.org/T352003)
[15:12:53] <wikibugs>	 (03CR) 10Herron: [C: 03+2] thanos-query: enable auto-downsampling [puppet] - 10https://gerrit.wikimedia.org/r/979163 (owner: 10Herron)
[15:12:53] <wikibugs>	 10SRE, 10Wikimedia-production-error: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 (10A455bcd9) I got 3 different error messages multiple times today while editing:     - "Server returned error: HTTP 500."   - "[XXXX-XXX-XXX-XXX-XXX] Caught excepti...
[15:13:16] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] jobqueue: switch a medium weight job [deployment-charts] - 10https://gerrit.wikimedia.org/r/979395 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[15:13:20] <wikibugs>	 (03PS5) 10Brouberol: Enable ingress for the spark-history server services via the dse ingress gw [dns] - 10https://gerrit.wikimedia.org/r/979892 (https://phabricator.wikimedia.org/T352639)
[15:16:52] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] lvs::realserver::ipip: Check that TCP MSS clamping is working [puppet] - 10https://gerrit.wikimedia.org/r/977696 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[15:18:27] <wikibugs>	 (03CR) 10Jelto: [V: 03+2 C: 03+2] add wmf-debci image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/979355 (https://phabricator.wikimedia.org/T352003) (owner: 10Jelto)
[15:20:28] <wikibugs>	 (03PS1) 10Bking: wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355)
[15:20:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[15:20:58] <wikibugs>	 10SRE, 10Wikimedia-production-error: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 (10TheresNoTime)
[15:21:08] <wikibugs>	 10SRE, 10Wikimedia-production-error: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 (10TheresNoTime)
[15:21:17] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] jobqueue: switch a medium weight job [deployment-charts] - 10https://gerrit.wikimedia.org/r/979395 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[15:22:06] <wikibugs>	 (03Merged) 10jenkins-bot: jobqueue: switch a medium weight job [deployment-charts] - 10https://gerrit.wikimedia.org/r/979395 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan)
[15:22:47] <wikibugs>	 (03PS2) 10Bking: wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355)
[15:26:02] <wikibugs>	 10SRE, 10Wikimedia-production-error: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 (10Yann) It happened again  `[80340636-5581-4e19-a4ce-a0a6b2a7215e] 2023-12-04 15:23:08: Fatal exception of type "Wikimedia\Rdbms\DBQueryError"`
[15:28:27] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T348183)', diff saved to https://phabricator.wikimedia.org/P54104 and previous config saved to /var/cache/conftool/dbconfig/20231204-152826-arnaudb.json
[15:28:31] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[15:29:13] <wikibugs>	 (03PS3) 10Bking: wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355)
[15:29:22] <wikibugs>	 (03PS4) 10Jcrespo: Implement batch deletion, restoration and query of files [software/mediabackups] - 10https://gerrit.wikimedia.org/r/979919 (https://phabricator.wikimedia.org/T352655)
[15:29:25] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[15:30:37] <wikibugs>	 10SRE, 10Wikimedia-production-error: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 (10Ladsgroup) At least for TVB, I can't reproduce it anymore: https://en.wikipedia.org/w/index.php?title=TVB_(disambiguation)&action=history Can someone give me a re...
[15:32:03] <wikibugs>	 10SRE, 10Wikimedia-production-error: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 (10Ladsgroup) nvm got it.
[15:32:25] <wikibugs>	 (03PS5) 10Jcrespo: Implement batch deletion, restoration and query of files [software/mediabackups] - 10https://gerrit.wikimedia.org/r/979919 (https://phabricator.wikimedia.org/T352655)
[15:34:25] <wikibugs>	 10SRE, 10Cloud-VPS, 10observability, 10Patch-For-Review, and 2 others: ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710 (10fgiunchedi) Current situation:  * We have a separate `rsyslog-receiver` unit/instance with only the receiver bits on centrallog hosts * The fleet is runni...
[15:35:18] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Disable rp_filter on ncredir@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/979984 (https://phabricator.wikimedia.org/T351069)
[15:35:20] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Enable IPIP encapsulation on ncredir@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/979985 (https://phabricator.wikimedia.org/T351069)
[15:36:56] <wikibugs>	 (03PS1) 10Dreamy Jazz: Enable read new on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979986 (https://phabricator.wikimedia.org/T341829)
[15:38:25] <wikibugs>	 (03PS23) 10D3r1ck01: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366)
[15:38:51] <wikibugs>	 (03PS6) 10Jcrespo: Implement batch deletion, restoration and query of files [software/mediabackups] - 10https://gerrit.wikimedia.org/r/979919 (https://phabricator.wikimedia.org/T352655)
[15:39:14] <wikibugs>	 10SRE, 10Wikimedia-production-error: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 (10Ladsgroup) `  Expectation (writeQueryTime <= 1) by MediaWiki::main not met (actual: 7.6661319732666) in trx #1701ce9c66: role-primary: SELECT page_latest FROM `pa...
[15:40:18] <wikibugs>	 10SRE, 10Wikimedia-production-error: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 (10NightWolf1223) This is also happening on https://en.wikipedia.org/wiki/CDDA with the following error: ` [0458d586-c21c-4c1b-bc95-35edbaabe49d] 2023-12-04 15:33:32...
[15:41:24] <wikibugs>	 (03PS4) 10Bking: wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355)
[15:43:33] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P54105 and previous config saved to /var/cache/conftool/dbconfig/20231204-154333-arnaudb.json
[15:45:15] <wikibugs>	 10SRE, 10Wikimedia-production-error: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 (10Ladsgroup) Hi, We get a log error for each one of these, we see them and I'm investigating. No need to paste them here anymore. Thanks!
[15:45:47] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply
[15:46:05] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply
[15:46:33] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[15:47:14] <wikibugs>	 (03PS5) 10Bking: wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355)
[15:47:18] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply
[15:47:35] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[15:47:44] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply
[15:48:15] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply
[15:48:40] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply
[15:48:52] <wikibugs>	 10SRE, 10Wikimedia-production-error: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 (10matmarex) The errors increased sharply around 6:30 UTC today: (searching for `exception.class` `Wikimedia\Rdbms\DBTransactionSizeError`)  https://logstash.wikimed...
[15:49:09] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/814/con" [puppet] - 10https://gerrit.wikimedia.org/r/979984 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[15:49:27] <wikibugs>	 (03CR) 10Vgutierrez: hiera: Disable rp_filter on ncredir@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/979984 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[15:50:46] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/815/con" [puppet] - 10https://gerrit.wikimedia.org/r/979985 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[15:50:52] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] prometheus::sysctl: Support configurable sysctls [puppet] - 10https://gerrit.wikimedia.org/r/979297 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[15:51:36] <wikibugs>	 10SRE, 10Wikimedia-production-error: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 (10Ladsgroup) The underlying issue is that locking any row in page table is extremely slow now, this one took 7 seconds: https://logstash.wikimedia.org/app/discover#...
[15:52:54] <wikibugs>	 (03PS6) 10Bking: wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355)
[15:52:59] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on elastic1107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[15:53:40] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[15:54:59] <wikibugs>	 (03PS7) 10Bking: wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355)
[15:55:28] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[15:55:31] <wikibugs>	 (03PS1) 10Awight: [beta] Enable FileImporter Codex mode on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979988
[15:56:14] <wikibugs>	 (03PS8) 10Bking: wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355)
[15:56:56] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host ms-be1076.eqiad.wmnet with OS bullseye
[15:57:34] <wikibugs>	 10SRE, 10Wikimedia-production-error: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 (10Ladsgroup) innodb_lock_row_wait on master of s1 has skyrocketed but unlike spacex rockets is not going down: https://grafana.wikimedia.org/d/000000273/mysql?orgId...
[15:57:35] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for mcastro-wmf - https://phabricator.wikimedia.org/T352684 (10Mcastro)
[15:58:40] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P54107 and previous config saved to /var/cache/conftool/dbconfig/20231204-155840-arnaudb.json
[15:58:46] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] hiera: Enable IPIP encapsulation on ncredir@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/979985 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[16:02:06] <wikibugs>	 10SRE, 10Wikimedia-production-error: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 (10Ladsgroup) a:03Ladsgroup We made a lot of progress.
[16:02:48] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[16:02:59] <jinxer-wm>	 (PuppetFailure) resolved: Puppet has failed on elastic1107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[16:03:39] <wikibugs>	 (03CR) 10Fabfur: [C: 03+1] "Looks coherent with I24cf4fce8ba2f6517dfe343ea2c127cd26195712" [puppet] - 10https://gerrit.wikimedia.org/r/979985 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[16:03:59] <wikibugs>	 (03CR) 10Fabfur: [C: 03+1] "Looks coherent with I6720e89360c9026ea26a77601d5f490d347a6cba" [puppet] - 10https://gerrit.wikimedia.org/r/979984 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[16:04:44] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Hide the client IP address in the SMTP Received header for authenticated relay clients - https://phabricator.wikimedia.org/T317574 (10jhathaway) 05Open→03Resolved
[16:04:48] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Update DNS record to allow us to send emails from @wikimedia.org on Qualtrics - https://phabricator.wikimedia.org/T314815 (10jhathaway)
[16:05:09] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] hiera: Disable rp_filter on ncredir@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/979984 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[16:05:48] <wikibugs>	 (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979990 (https://phabricator.wikimedia.org/T128546)
[16:07:41] <wikibugs>	 (03CR) 10Phuedx: Define the corresponding stream for scroll (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977785 (https://phabricator.wikimedia.org/T350883) (owner: 10Kimberly Sarabia)
[16:07:53] <wikibugs>	 (03CR) 10Phuedx: [C: 03+1] Define the corresponding stream for scroll [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977785 (https://phabricator.wikimedia.org/T350883) (owner: 10Kimberly Sarabia)
[16:11:36] <wikibugs>	 (03CR) 10Svantje Lilienthal: [C: 03+1] [beta] Enable FileImporter Codex mode on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979988 (owner: 10Awight)
[16:12:22] <wikibugs>	 (03PS2) 10Svantje Lilienthal: [beta] Enable FileImporter Codex mode on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979988 (https://phabricator.wikimedia.org/T348759) (owner: 10Awight)
[16:13:47] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T348183)', diff saved to https://phabricator.wikimedia.org/P54108 and previous config saved to /var/cache/conftool/dbconfig/20231204-161346-arnaudb.json
[16:13:49] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2125.codfw.wmnet with reason: Maintenance
[16:13:53] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[16:14:02] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2125.codfw.wmnet with reason: Maintenance
[16:14:09] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2125 (T348183)', diff saved to https://phabricator.wikimedia.org/P54109 and previous config saved to /var/cache/conftool/dbconfig/20231204-161408-arnaudb.json
[16:14:46] <wikibugs>	 (03Abandoned) 10Peter Fischer: enable page_rerender for commonswiki, frwiki, itwiki, and wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979970 (owner: 10Peter Fischer)
[16:15:05] <wikibugs>	 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Make Spicerack cookbook to resize ganeti VM - https://phabricator.wikimedia.org/T219454 (10MoritzMuehlenhoff) 05Open→03Declined This is a rara operation and basically only requires to run a straight-forward CLI command (followed by running sre.ganeti.r...
[16:15:07] <wikibugs>	 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Cookbooks for Ganeti maintenance tasks - https://phabricator.wikimedia.org/T283319 (10MoritzMuehlenhoff)
[16:16:55] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Enable IPIP encapsulation on ncredir@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/979985 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[16:17:02] <wikibugs>	 (03PS2) 10Vgutierrez: hiera: Enable IPIP encapsulation on ncredir@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/979985 (https://phabricator.wikimedia.org/T351069)
[16:17:27] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10homer: Add Homer support to Cookbooks - https://phabricator.wikimedia.org/T265342 (10ayounsi) 05Open→03Invalid Hello past me, not needed anymore.
[16:19:45] <wikibugs>	 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations: Tracking task for DCOps privileged commands - https://phabricator.wikimedia.org/T233685 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This was handled in various other tasks.
[16:19:53] <wikibugs>	 (03PS1) 10Vgutierrez: hiera: Enable IPIP on eqsin text|secondary LVS [puppet] - 10https://gerrit.wikimedia.org/r/979994 (https://phabricator.wikimedia.org/T351069)
[16:20:05] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T348183)', diff saved to https://phabricator.wikimedia.org/P54110 and previous config saved to /var/cache/conftool/dbconfig/20231204-162005-arnaudb.json
[16:20:12] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[16:20:29] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Python3-Porting: Puppet: forbid new Python2 code - https://phabricator.wikimedia.org/T197804 (10joanna_borun) 05Open→03Invalid
[16:20:54] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 (10Vgutierrez)
[16:21:28] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/979994 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[16:22:08] <wikibugs>	 (03CR) 10Peter Fischer: [C: 03+1] "The changes to the kafka topic won't be applied, see https://phabricator.wikimedia.org/T351503" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979155 (https://phabricator.wikimedia.org/T352335) (owner: 10Ebernhardson)
[16:22:11] <wikibugs>	 (03PS2) 10Vgutierrez: hiera: Enable IPIP on eqsin text|secondary LVS [puppet] - 10https://gerrit.wikimedia.org/r/979994 (https://phabricator.wikimedia.org/T351069)
[16:22:28] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Python3-Porting: Python2: track Py2 softwares - https://phabricator.wikimedia.org/T197803 (10MoritzMuehlenhoff) 05Open→03Declined Bookworm no longer includes Python 2 at all and in Bullseye Python gets uninstalled unless one sets an explicit Hiera flag to keep...
[16:22:53] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Python3-Porting: Puppet: forbid new Python2 code - https://phabricator.wikimedia.org/T197804 (10MoritzMuehlenhoff) Bookworm no longer includes Python 2 at all and in Bullseye Python gets uninstalled unless one sets an explicit Hiera flag to keep it (pybal e.g.), w...
[16:24:44] <wikibugs>	 (03PS3) 10Svantje Lilienthal: [beta] Enable FileImporter Codex mode on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979988 (https://phabricator.wikimedia.org/T347453) (owner: 10Awight)
[16:25:10] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] hiera: Enable IPIP on eqsin text|secondary LVS [puppet] - 10https://gerrit.wikimedia.org/r/979994 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez)
[16:29:16] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[16:29:58] <wikibugs>	 (03PS9) 10Bking: wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355)
[16:30:04] <jouncebot>	 jan_drewniak: How many deployers does it take to do Wikimedia Portals Update deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231204T1630).
[16:30:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[16:34:07] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations, 10Orchestrator: Add database host removal from Orchestrator to sre.hosts.decommission cookbook - https://phabricator.wikimedia.org/T287954 (10Volans) p:05Triage→03Low @Marostegui is this request still valid/needed? If we are going to add this steps I would need...
[16:34:16] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[16:35:07] <wikibugs>	 (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979990 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[16:35:12] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P54111 and previous config saved to /var/cache/conftool/dbconfig/20231204-163511-arnaudb.json
[16:35:20] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Long timeout on debmonitor client with server unreachable/unpingable - https://phabricator.wikimedia.org/T302205 (10Volans) It seems that the current defaults are generally working fine. @fgiunchedi have you encounter any specific issue in the last ~2y that still requ...
[16:35:30] <wikibugs>	 10SRE-tools, 10Infrastructure-Foundations: Long timeout on debmonitor client with server unreachable/unpingable - https://phabricator.wikimedia.org/T302205 (10Volans) p:05Triage→03Low
[16:35:51] <wikibugs>	 (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979990 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak)
[16:39:02] <wikibugs>	 (03CR) 10Clare Ming: [C: 03+1] Define the corresponding stream for scroll (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977785 (https://phabricator.wikimedia.org/T350883) (owner: 10Kimberly Sarabia)
[16:39:44] <wikibugs>	 (03PS10) 10Bking: wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355)
[16:40:13] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[16:41:50] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:42:30] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:42:38] <wikibugs>	 (03PS2) 10Dreamy Jazz: MediaModeration: Set MediaModerationDeveloperMode to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979969 (owner: 10Kosta Harlan)
[16:43:25] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] MediaModeration: Set MediaModerationDeveloperMode to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979969 (owner: 10Kosta Harlan)
[16:43:56] <wikibugs>	 (03CR) 10Dreamy Jazz: MediaModeration: Set MediaModerationDeveloperMode to false (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979969 (owner: 10Kosta Harlan)
[16:44:52] <logmsgbot>	 !log jdrewniak@deploy2002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:979990| Bumping portals to master (T128546)]] (duration: 06m 40s)
[16:44:55] <stashbot>	 T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546
[16:46:37] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 33604
[16:47:50] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 33604
[16:48:20] <icinga-wm>	 PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:49:10] <wikibugs>	 (03PS1) 10Elukey: slo_template: update SLO sliding window [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/980000
[16:49:25] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[16:50:19] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P54112 and previous config saved to /var/cache/conftool/dbconfig/20231204-165018-arnaudb.json
[16:52:38] <logmsgbot>	 !log jdrewniak@deploy2002 Synchronized portals: Wikimedia Portals Update: [[gerrit:979990| Bumping portals to master (T128546)]] (duration: 07m 45s)
[16:52:41] <stashbot>	 T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546
[16:53:52] <wikibugs>	 (03PS3) 10Kosta Harlan: MediaModeration: Set MediaModerationDeveloperMode to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979969
[16:54:35] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] MediaModeration: Set MediaModerationDeveloperMode to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979969 (owner: 10Kosta Harlan)
[16:54:50] <wikibugs>	 (03PS1) 10Ilias Sarantopoulos: ml-services: fix rest gateway endpoint creation in article descriptions [deployment-charts] - 10https://gerrit.wikimedia.org/r/980002 (https://phabricator.wikimedia.org/T351940)
[16:54:53] <wikibugs>	 (03CR) 10Kosta Harlan: MediaModeration: Set MediaModerationDeveloperMode to false (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979969 (owner: 10Kosta Harlan)
[16:55:20] <wikibugs>	 (03CR) 10Herron: [C: 03+1] "thanks!" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/980000 (owner: 10Elukey)
[16:55:40] <wikibugs>	 (03PS11) 10Bking: wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355)
[16:55:55] <wikibugs>	 (03PS4) 10Kosta Harlan: MediaModeration: Set MediaModerationDeveloperMode to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979969
[16:56:00] <wikibugs>	 (03PS5) 10Kosta Harlan: MediaModeration: Set MediaModerationDeveloperMode to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979969
[16:56:16] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[16:56:17] <wikibugs>	 (03PS1) 10Elukey: slo_definitions: restrict Lift Wing metrics with one extr label [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/980004 (https://phabricator.wikimedia.org/T351390)
[16:56:50] <wikibugs>	 (03CR) 10Elukey: [V: 03+2 C: 03+2] slo_template: update SLO sliding window [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/980000 (owner: 10Elukey)
[16:57:21] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] ml-services: fix rest gateway endpoint creation in article descriptions [deployment-charts] - 10https://gerrit.wikimedia.org/r/980002 (https://phabricator.wikimedia.org/T351940) (owner: 10Ilias Sarantopoulos)
[16:58:37] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[16:59:27] <wikibugs>	 (03PS12) 10Bking: wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355)
[17:01:55] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[17:05:25] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T348183)', diff saved to https://phabricator.wikimedia.org/P54113 and previous config saved to /var/cache/conftool/dbconfig/20231204-170525-arnaudb.json
[17:05:28] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2126.codfw.wmnet with reason: Maintenance
[17:05:30] <wikibugs>	 (03CR) 10Kevin Bazira: [C: 03+1] ml-services: fix rest gateway endpoint creation in article descriptions (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980002 (https://phabricator.wikimedia.org/T351940) (owner: 10Ilias Sarantopoulos)
[17:05:34] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[17:05:42] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2126.codfw.wmnet with reason: Maintenance
[17:05:43] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance
[17:05:58] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance
[17:06:05] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2126 (T348183)', diff saved to https://phabricator.wikimedia.org/P54114 and previous config saved to /var/cache/conftool/dbconfig/20231204-170604-arnaudb.json
[17:07:24] <icinga-wm>	 RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:07:32] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.297 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:08:18] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51007 bytes in 0.265 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[17:08:29] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1077.eqiad.wmnet with OS bullseye
[17:08:35] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host ms-be1077.eqiad.wmnet with OS bullseye
[17:09:02] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1078.eqiad.wmnet with OS bullseye
[17:09:07] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T348183)', diff saved to https://phabricator.wikimedia.org/P54115 and previous config saved to /var/cache/conftool/dbconfig/20231204-170906-arnaudb.json
[17:09:08] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host ms-be1078.eqiad.wmnet with OS bullseye
[17:09:11] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be1076']
[17:09:30] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1076']
[17:09:40] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be1076']
[17:11:11] <wikibugs>	 (03Abandoned) 10Elukey: slo_definitions: restrict Lift Wing metrics with one extr label [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/980004 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey)
[17:11:43] <wikibugs>	 (03PS1) 10Jforrester: nlwikivoyage: Drop Listings extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980009 (https://phabricator.wikimedia.org/T352696)
[17:12:13] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1076']
[17:12:48] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ms-be1076']
[17:13:48] <wikibugs>	 (03CR) 10Dreamy Jazz: [C: 03+1] MediaModeration: Set MediaModerationDeveloperMode to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979969 (owner: 10Kosta Harlan)
[17:14:01] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ms-be1079']
[17:14:17] <wikibugs>	 (03PS1) 10Ladsgroup: Category: Stop locking thousands of rows [core] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979692 (https://phabricator.wikimedia.org/T352628)
[17:15:04] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Category: Stop locking thousands of rows [core] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979692 (https://phabricator.wikimedia.org/T352628) (owner: 10Ladsgroup)
[17:15:21] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1079']
[17:15:23] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Grant access to nda LDAP group to xqt - https://phabricator.wikimedia.org/T348520 (10Dzahn) Thanks for responding @Xqt. Yes, it's possible to not publish the real name. We will just use "known to legal" in the realname field in the repo. Thanks for confirmin...
[17:15:39] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be1079']
[17:15:44] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1079']
[17:15:52] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be1079']
[17:16:05] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1079']
[17:16:08] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be1079']
[17:18:06] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1079']
[17:18:30] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: ml-services: fix rest gateway endpoint creation in article descriptions (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980002 (https://phabricator.wikimedia.org/T351940) (owner: 10Ilias Sarantopoulos)
[17:18:33] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ms-be1079']
[17:18:53] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1076']
[17:19:11] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [core] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979692 (https://phabricator.wikimedia.org/T352628) (owner: 10Ladsgroup)
[17:19:15] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ms-be1076']
[17:20:50] <Amir1>	 jouncebot: nowandnext
[17:20:50] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 39 minute(s)
[17:20:50] <jouncebot>	 In 0 hour(s) and 39 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231204T1800)
[17:20:50] <jouncebot>	 In 0 hour(s) and 39 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231204T1800)
[17:21:05] <wikibugs>	 (03PS1) 10Hnowlan: mw-jobrunner: increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/980011 (https://phabricator.wikimedia.org/T349796)
[17:24:13] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P54116 and previous config saved to /var/cache/conftool/dbconfig/20231204-172413-arnaudb.json
[17:25:30] <wikibugs>	 10SRE, 10Patch-For-Review, 10Wikimedia-production-error: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 (10aaron) One thing to also fix here is that things like SELECT FOR UPDATE, SELECT GET_LOCK()...any SELECT really...should be exempted from the...
[17:26:09] <wikibugs>	 10SRE, 10Patch-For-Review, 10Wikimedia-production-error: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 (10aaron) One thing to fix here is that SELECT FOR UPDATE should be except from the transaction size check in approvePrimaryChanges(). There is...
[17:26:35] <wikibugs>	 (03PS1) 10Dzahn: admin: add user xqt to ldap_only admins, volunteer NDA [puppet] - 10https://gerrit.wikimedia.org/r/980013 (https://phabricator.wikimedia.org/T348520)
[17:27:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] admin: add user xqt to ldap_only admins, volunteer NDA [puppet] - 10https://gerrit.wikimedia.org/r/980013 (https://phabricator.wikimedia.org/T348520) (owner: 10Dzahn)
[17:28:06] <wikibugs>	 (03PS2) 10Dzahn: admin: add user xqt to ldap_only admins, volunteer NDA [puppet] - 10https://gerrit.wikimedia.org/r/980013 (https://phabricator.wikimedia.org/T348520)
[17:33:07] <wikibugs>	 (03Merged) 10jenkins-bot: Category: Stop locking thousands of rows [core] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979692 (https://phabricator.wikimedia.org/T352628) (owner: 10Ladsgroup)
[17:33:18] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:979692|Category: Stop locking thousands of rows (T352628)]]
[17:33:21] <stashbot>	 T352628: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628
[17:34:48] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:979692|Category: Stop locking thousands of rows (T352628)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[17:34:58] <wikibugs>	 (03CR) 10Kevin Bazira: [C: 03+1] ml-services: fix rest gateway endpoint creation in article descriptions (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980002 (https://phabricator.wikimedia.org/T351940) (owner: 10Ilias Sarantopoulos)
[17:35:06] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10Patch-For-Review: Grant access to nda LDAP group to xqt - https://phabricator.wikimedia.org/T348520 (10Dzahn) a:03Dzahn
[17:35:33] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Continuing with sync
[17:39:20] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P54117 and previous config saved to /var/cache/conftool/dbconfig/20231204-173919-arnaudb.json
[17:39:36] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] "Verified the task linked and with dzahn." [puppet] - 10https://gerrit.wikimedia.org/r/980013 (https://phabricator.wikimedia.org/T348520) (owner: 10Dzahn)
[17:41:16] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[17:41:25] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:979692|Category: Stop locking thousands of rows (T352628)]] (duration: 08m 07s)
[17:41:28] <stashbot>	 T352628: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628
[17:46:16] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[17:46:42] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10Patch-For-Review: Grant access to nda LDAP group to xqt - https://phabricator.wikimedia.org/T348520 (10Dzahn) >>! In T348520#9335183, @KFrancis wrote: > Hi all, I was finally granted access to see the signature confirmation page.  I can confirm https://p...
[17:46:48] <wikibugs>	 10SRE, 10Patch-For-Review, 10Wikimedia-production-error: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 (10Ladsgroup) Right after the patch deployment, contention went to basically zero {F41560311}  https://grafana.wikimedia.org/d/000000273/mysql?...
[17:54:26] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T348183)', diff saved to https://phabricator.wikimedia.org/P54118 and previous config saved to /var/cache/conftool/dbconfig/20231204-175426-arnaudb.json
[17:54:28] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance
[17:54:31] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[17:54:42] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance
[17:54:44] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T352653 (10Dzahn) Hi @ArthurTaylor could you please send an email to @KFrancis https://meta.wikimedia.org/wiki/User:KFrancis_(WMF) ofthe Legal department to proceed with the NDA signing? Just so she go...
[17:54:49] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3312 (T348183)', diff saved to https://phabricator.wikimedia.org/P54119 and previous config saved to /var/cache/conftool/dbconfig/20231204-175448-arnaudb.json
[17:55:15] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host ms-be1078.eqiad.wmnet with OS bullseye
[17:55:24] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group for ArthurTaylor - https://phabricator.wikimedia.org/T352653 (10Dzahn)
[17:58:01] <wikibugs>	 (03PS1) 10Brion VIBBER: Always load transcode state from db when opting in to primary db [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979693 (https://phabricator.wikimedia.org/T200939)
[17:59:15] <wikibugs>	 (03PS2) 10Brion VIBBER: Always load transcode state from db when opting in to primary db [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979693 (https://phabricator.wikimedia.org/T200939)
[17:59:56] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1076.eqiad.wmnet with OS bullseye
[18:00:03] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host ms-be1076.eqiad.wmnet with OS bullseye
[18:00:04] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231204T1800)
[18:00:05] <jouncebot>	 ryankemper: It is that lovely time of the day again! You are hereby commanded to deploy Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231204T1800).
[18:00:48] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T348183)', diff saved to https://phabricator.wikimedia.org/P54120 and previous config saved to /var/cache/conftool/dbconfig/20231204-180047-arnaudb.json
[18:00:57] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[18:01:56] <wikibugs>	 (03PS1) 10Brion VIBBER: Encoding cleanup with remuxing support [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979694 (https://phabricator.wikimedia.org/T68722)
[18:02:08] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host ms-be1077.eqiad.wmnet with OS bullseye
[18:04:33] <wikibugs>	 (03CR) 10Effie Mouzeli: mcrouter: add helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/979363 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli)
[18:06:25] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "I think I found some smaller issues, please see inline questions/comments" [puppet] - 10https://gerrit.wikimedia.org/r/972929 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse)
[18:09:21] <wikibugs>	 10ops-codfw, 10DC-Ops: Q2:rack/setup/install test R760xd host - https://phabricator.wikimedia.org/T352703 (10RobH)
[18:15:54] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P54121 and previous config saved to /var/cache/conftool/dbconfig/20231204-181554-arnaudb.json
[18:18:30] <wikibugs>	 (03CR) 10Dzahn: [V: 04-1] "waiting for addition to google doc by legal" [puppet] - 10https://gerrit.wikimedia.org/r/980013 (https://phabricator.wikimedia.org/T348520) (owner: 10Dzahn)
[18:24:32] <wikibugs>	 (03PS1) 10Dzahn: etherpad: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/980018
[18:25:00] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] etherpad: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/980018 (owner: 10Dzahn)
[18:25:58] <wikibugs>	 (03PS3) 10Jforrester: wikifunctions: Drop beta monitoring [puppet] - 10https://gerrit.wikimedia.org/r/952488 (https://phabricator.wikimedia.org/T321099)
[18:27:21] <wikibugs>	 (03Abandoned) 10Jforrester: wikifunctions: Add production alerting alongside beta [puppet] - 10https://gerrit.wikimedia.org/r/952486 (owner: 10Jforrester)
[18:27:55] <wikibugs>	 (03PS13) 10Bking: wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355)
[18:28:11] <wikibugs>	 (03PS1) 10Dzahn: peopleweb: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/980020
[18:28:24] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[18:28:53] <wikibugs>	 (03PS2) 10Dzahn: etherpad: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/980018
[18:29:26] <wikibugs>	 (03PS14) 10Bking: wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355)
[18:29:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[18:31:01] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P54122 and previous config saved to /var/cache/conftool/dbconfig/20231204-183100-arnaudb.json
[18:31:06] <wikibugs>	 (03PS15) 10Bking: wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355)
[18:31:35] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[18:33:00] <wikibugs>	 (03PS16) 10Bking: wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355)
[18:33:30] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[18:37:19] <wikibugs>	 (03PS17) 10Bking: wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355)
[18:38:48] <wikibugs>	 (03CR) 10Muehlenhoff: etherpad: replace ferm::service with firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/980018 (owner: 10Dzahn)
[18:39:49] <wikibugs>	 (03PS1) 10Andrew Bogott: Horizon: allow image uploading via horizon for users with glance admin [puppet] - 10https://gerrit.wikimedia.org/r/980021 (https://phabricator.wikimedia.org/T326818)
[18:40:07] <wikibugs>	 (03CR) 10Muehlenhoff: peopleweb: replace ferm::service with firewall::service (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/980020 (owner: 10Dzahn)
[18:41:06] <wikibugs>	 (03PS1) 10Dzahn: firewall::service: spelling fixes, add missing parameter comments [puppet] - 10https://gerrit.wikimedia.org/r/980022
[18:41:54] <wikibugs>	 (03PS3) 10Dzahn: etherpad: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/980018
[18:41:56] <wikibugs>	 (03CR) 10Dzahn: etherpad: replace ferm::service with firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/980018 (owner: 10Dzahn)
[18:43:13] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[18:43:29] <wikibugs>	 (03PS2) 10Dzahn: peopleweb: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/980020
[18:43:43] <wikibugs>	 (03CR) 10Dzahn: peopleweb: replace ferm::service with firewall::service (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/980020 (owner: 10Dzahn)
[18:44:00] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] peopleweb: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/980020 (owner: 10Dzahn)
[18:44:49] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM (sans aligning issue making CI fail)" [puppet] - 10https://gerrit.wikimedia.org/r/980020 (owner: 10Dzahn)
[18:45:08] <wikibugs>	 (03PS3) 10Dzahn: peopleweb: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/980020
[18:45:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] etherpad: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/980018 (owner: 10Dzahn)
[18:46:08] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T348183)', diff saved to https://phabricator.wikimedia.org/P54123 and previous config saved to /var/cache/conftool/dbconfig/20231204-184607-arnaudb.json
[18:46:09] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2148.codfw.wmnet with reason: Maintenance
[18:46:11] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[18:46:24] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2148.codfw.wmnet with reason: Maintenance
[18:46:30] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2148 (T348183)', diff saved to https://phabricator.wikimedia.org/P54124 and previous config saved to /var/cache/conftool/dbconfig/20231204-184630-arnaudb.json
[18:46:55] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] vrts: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/979364 (owner: 10Muehlenhoff)
[18:47:31] <wikibugs>	 (03CR) 10Muehlenhoff: firewall::service: spelling fixes, add missing parameter comments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/980022 (owner: 10Dzahn)
[18:47:40] <wikibugs>	 (03PS18) 10Bking: wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355)
[18:50:39] <wikibugs>	 (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[18:50:53] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1199 - https://phabricator.wikimedia.org/T352238 (10VRiley-WMF) This drive have been replaced. Shipping out faulty drive back as per requested.  Completed
[18:51:14] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1199 - https://phabricator.wikimedia.org/T352238 (10VRiley-WMF) 05Open→03Resolved
[18:51:52] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1079.eqiad.wmnet with OS bullseye
[18:51:54] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1078.eqiad.wmnet with OS bullseye
[18:51:56] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1077.eqiad.wmnet with OS bullseye
[18:51:58] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host ms-be1079.eqiad.wmnet with OS bullseye
[18:52:00] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host ms-be1078.eqiad.wmnet with OS bullseye
[18:52:02] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host ms-be1077.eqiad.wmnet with OS bullseye
[18:52:40] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host ms-be1076.eqiad.wmnet with OS bullseye
[18:52:54] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "confirmed noop in prod" [puppet] - 10https://gerrit.wikimedia.org/r/979364 (owner: 10Muehlenhoff)
[18:54:03] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/980020 (owner: 10Dzahn)
[18:55:07] <jinxer-wm>	 (JobUnavailable) firing: (9) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:55:20] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T348183)', diff saved to https://phabricator.wikimedia.org/P54125 and previous config saved to /var/cache/conftool/dbconfig/20231204-185519-arnaudb.json
[18:55:24] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[18:58:51] <wikibugs>	 10SRE, 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder)
[19:00:19] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[19:00:45] <jinxer-wm>	 (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk
[19:04:30] <wikibugs>	 (03PS2) 10Dzahn: firewall::service: spelling fixes, add missing parameter comments [puppet] - 10https://gerrit.wikimedia.org/r/980022
[19:05:24] <wikibugs>	 (03CR) 10Dzahn: firewall::service: spelling fixes, add missing parameter comments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/980022 (owner: 10Dzahn)
[19:06:14] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] etherpad: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/980018 (owner: 10Dzahn)
[19:08:58] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host ms-be1077.eqiad.wmnet with OS bullseye
[19:09:05] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "noop confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/980018 (owner: 10Dzahn)
[19:09:06] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host ms-be1078.eqiad.wmnet with OS bullseye
[19:09:38] <wikibugs>	 (03CR) 10Dzahn: peopleweb: replace ferm::service with firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/980020 (owner: 10Dzahn)
[19:10:03] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host ms-be1079.eqiad.wmnet with OS bullseye
[19:10:06] <wikibugs>	 (03CR) 10Muehlenhoff: firewall::service: spelling fixes, add missing parameter comments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/980022 (owner: 10Dzahn)
[19:10:26] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P54126 and previous config saved to /var/cache/conftool/dbconfig/20231204-191026-arnaudb.json
[19:18:56] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus updater: Update deployed image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/980024
[19:20:56] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1077.eqiad.wmnet with OS bullseye
[19:21:04] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host ms-be1077.eqiad.wmnet with OS bullseye
[19:21:06] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1079.eqiad.wmnet with OS bullseye
[19:21:12] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host ms-be1079.eqiad.wmnet with OS bullseye
[19:21:15] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1078.eqiad.wmnet with OS bullseye
[19:21:21] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host ms-be1078.eqiad.wmnet with OS bullseye
[19:21:23] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1076.eqiad.wmnet with OS bullseye
[19:21:29] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host ms-be1076.eqiad.wmnet with OS bullseye
[19:21:56] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] peopleweb: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/980020 (owner: 10Dzahn)
[19:22:31] <wikibugs>	 (03PS19) 10Ryan Kemper: wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[19:23:37] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "typo "firwall" and didn't replace ferm::service in second example. 'doh :)" [puppet] - 10https://gerrit.wikimedia.org/r/980020 (owner: 10Dzahn)
[19:25:15] <wikibugs>	 (03CR) 10Gehel: "Minor comments inline, otherwise LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[19:25:33] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P54128 and previous config saved to /var/cache/conftool/dbconfig/20231204-192532-arnaudb.json
[19:25:46] <wikibugs>	 (03PS1) 10Dzahn: peoplweb: fix typo after ferm->firewall change [puppet] - 10https://gerrit.wikimedia.org/r/980025
[19:26:13] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] peoplweb: fix typo after ferm->firewall change [puppet] - 10https://gerrit.wikimedia.org/r/980025 (owner: 10Dzahn)
[19:26:25] <wikibugs>	 (03PS2) 10Dzahn: peopleweb: fix typo after ferm->firewall change [puppet] - 10https://gerrit.wikimedia.org/r/980025
[19:31:55] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Update deployed image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/980024 (owner: 10Ebernhardson)
[19:32:45] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus updater: Update deployed image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/980024 (owner: 10Ebernhardson)
[19:32:47] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "ok after follow-up https://gerrit.wikimedia.org/r/c/operations/puppet/+/980025" [puppet] - 10https://gerrit.wikimedia.org/r/980020 (owner: 10Dzahn)
[19:35:32] <wikibugs>	 (03PS3) 10Dzahn: firewall::service: spelling fixes, add missing parameter comments [puppet] - 10https://gerrit.wikimedia.org/r/980022
[19:35:35] <wikibugs>	 (03CR) 10Dzahn: firewall::service: spelling fixes, add missing parameter comments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/980022 (owner: 10Dzahn)
[19:37:14] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[19:37:25] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[19:40:39] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T348183)', diff saved to https://phabricator.wikimedia.org/P54129 and previous config saved to /var/cache/conftool/dbconfig/20231204-194039-arnaudb.json
[19:40:42] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance
[19:40:44] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[19:40:57] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance
[19:41:03] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3312 (T348183)', diff saved to https://phabricator.wikimedia.org/P54130 and previous config saved to /var/cache/conftool/dbconfig/20231204-194103-arnaudb.json
[19:42:37] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host ms-be1077.eqiad.wmnet with OS bullseye
[19:42:44] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host ms-be1078.eqiad.wmnet with OS bullseye
[19:42:45] <wikibugs>	 (03CR) 10Gehel: wdqs: Monitor LDF endpoint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[19:42:51] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host ms-be1076.eqiad.wmnet with OS bullseye
[19:43:03] <logmsgbot>	 !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host ms-be1079.eqiad.wmnet with OS bullseye
[19:55:29] <wikibugs>	 (03PS1) 10Bernard Wang: Deploy VectorClientPreferences to beta and pl,fr,ca wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980028
[19:56:19] <wikibugs>	 (03PS2) 10Bernard Wang: Deploy VectorClientPreferences to pl,fr,ca wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980028
[19:56:46] <wikibugs>	 (03PS3) 10Bernard Wang: Deploy VectorClientPreferences to pl,fr,ca wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980028 (https://phabricator.wikimedia.org/T351339)
[19:56:50] <wikibugs>	 10sre-alert-triage, 10Data-Platform-SRE: Alert in need of triage: SmartNotHealthy (instance an-worker1086:9100) - https://phabricator.wikimedia.org/T352168 (10Jclark-ctr)
[19:57:07] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Replace 4TB SATA disk in an-worker1086 - https://phabricator.wikimedia.org/T352529 (10Jclark-ctr) 05Open→03Resolved @BTullis  Swapped hdd
[20:04:12] <wikibugs>	 (03PS4) 10Bernard Wang: Deploy VectorClientPreferences to beta and pl,fr,ca wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980028
[20:04:33] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: hw troubleshooting: SSD failure (/dev/sde) for aqs1013.eqiad.wmnet - https://phabricator.wikimedia.org/T352344 (10Jclark-ctr) 05Open→03Resolved server is out of warranty. Replaced failed drive  with one from recently decommissioned servers
[20:04:47] <wikibugs>	 (03PS5) 10Bernard Wang: Deploy VectorClientPreferences to beta and pl,fr,ca wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980028
[20:05:24] <jinxer-wm>	 (MDRAIDFailedDisk) resolved: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk
[20:14:17] <wikibugs>	 (03PS1) 10Kamila Součková: mw-api-int: increase replicas by 30% [deployment-charts] - 10https://gerrit.wikimedia.org/r/980032
[20:19:00] <icinga-wm>	 RECOVERY - Dell PowerEdge RAID Controller on db1199 is OK: communication: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring
[20:19:22] <jinxer-wm>	 (MDRAIDNotEnoughDisks) firing: (2) MD RAID - insufficient active disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDNotEnoughDisks
[20:23:52] <wikibugs>	 10SRE, 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder)
[20:27:22] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T348183)', diff saved to https://phabricator.wikimedia.org/P54131 and previous config saved to /var/cache/conftool/dbconfig/20231204-202722-arnaudb.json
[20:27:29] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[20:36:41] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1077.eqiad.wmnet with OS bullseye
[20:36:47] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host ms-be1077.eqiad.wmnet with OS bullseye
[20:39:22] <jinxer-wm>	 (MDRAIDNotEnoughDisks) resolved: (2) MD RAID - insufficient active disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDNotEnoughDisks
[20:42:29] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P54132 and previous config saved to /var/cache/conftool/dbconfig/20231204-204228-arnaudb.json
[20:49:25] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[20:50:25] <wikibugs>	 (03PS1) 10Kamila Součková: Move mw api servers to kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/980039 (https://phabricator.wikimedia.org/T351074)
[20:50:28] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1077.eqiad.wmnet with reason: host reimage
[20:53:38] <wikibugs>	 (03PS1) 10Kamila Součková: Move mw api servers to kubernetes workers [homer/public] - 10https://gerrit.wikimedia.org/r/980040 (https://phabricator.wikimedia.org/T351074)
[20:53:45] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1077.eqiad.wmnet with reason: host reimage
[20:57:36] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P54133 and previous config saved to /var/cache/conftool/dbconfig/20231204-205735-arnaudb.json
[21:00:04] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231204T2100).
[21:00:04] <jouncebot>	 bvibber and ebernhardson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[21:00:14] <bvibber>	 \o/ whee
[21:00:55] <ebernhardson>	 \o
[21:06:17] <ryankemper>	 !log T351503 Setting partition count to 5: `ryankemper@kafka-main1001:~$ kafka topics --alter --topic eqiad.mediawiki.cirrussearch.page_rerender.v1 --partitions 5`
[21:06:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:06:32] <stashbot>	 T351503: Enable mediawiki.cirrussearch.page_rerender.v1 on all public wikis - https://phabricator.wikimedia.org/T351503
[21:09:01] <ryankemper>	 !log T351503 Setting partition count to 5: `ryankemper@kafka-main1001:~$ kafka topics --alter --topic codfw.mediawiki.cirrussearch.page_rerender.v1 --partitions 5`
[21:09:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:09:16] <TheresNoTime>	 I can't deploy this evening, sorry! Hopefully someone else will be along shortly
[21:10:46] <bvibber>	 no worries
[21:12:42] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T348183)', diff saved to https://phabricator.wikimedia.org/P54134 and previous config saved to /var/cache/conftool/dbconfig/20231204-211241-arnaudb.json
[21:12:44] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2175.codfw.wmnet with reason: Maintenance
[21:12:48] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[21:12:59] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2175.codfw.wmnet with reason: Maintenance
[21:13:06] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2175 (T348183)', diff saved to https://phabricator.wikimedia.org/P54135 and previous config saved to /var/cache/conftool/dbconfig/20231204-211305-arnaudb.json
[21:13:35] <wikibugs>	 (03PS2) 10Kamila Součková: mobileapps: 45% to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/976221 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto)
[21:14:05] <logmsgbot>	 !log pt1979@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin1001"
[21:18:03] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T348183)', diff saved to https://phabricator.wikimedia.org/P54136 and previous config saved to /var/cache/conftool/dbconfig/20231204-211803-arnaudb.json
[21:18:10] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[21:18:33] <wikibugs>	 (03PS1) 10Ebernhardson: cirrus updater: Remove kafka start offset [deployment-charts] - 10https://gerrit.wikimedia.org/r/980043
[21:19:07] <jinxer-wm>	 (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency
[21:19:13] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin1001"
[21:19:19] <logmsgbot>	 !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1077.eqiad.wmnet with OS bullseye
[21:19:25] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host ms-be1077.eqiad.wmnet with OS bullseye completed: - ms-be...
[21:22:43] <bvibber>	 so no deployer this window? :(
[21:23:11] <ebernhardson>	 hmm, i can probably do it i suppose
[21:23:26] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Remove kafka start offset [deployment-charts] - 10https://gerrit.wikimedia.org/r/980043 (owner: 10Ebernhardson)
[21:23:29] <bvibber>	 yay
[21:23:45] <bvibber>	 thanks :D
[21:24:13] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus updater: Remove kafka start offset [deployment-charts] - 10https://gerrit.wikimedia.org/r/980043 (owner: 10Ebernhardson)
[21:24:27] <ebernhardson>	 bvibber: can ship your two patches together?
[21:25:32] <bvibber>	 they can deploy together yeah
[21:25:39] <bvibber>	 one will only affect backend job queue scripts though :D
[21:25:59] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+2] Always load transcode state from db when opting in to primary db [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979693 (https://phabricator.wikimedia.org/T200939) (owner: 10Brion VIBBER)
[21:26:07] <bvibber>	 woohoo
[21:26:07] <wikibugs>	 (03CR) 10Ebernhardson: [C: 03+2] Encoding cleanup with remuxing support [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979694 (https://phabricator.wikimedia.org/T68722) (owner: 10Brion VIBBER)
[21:27:26] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply
[21:27:44] <logmsgbot>	 !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply
[21:28:12] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] restbase: set production role and add config for restbase2028 [puppet] - 10https://gerrit.wikimedia.org/r/979161 (https://phabricator.wikimedia.org/T352468) (owner: 10Eevans)
[21:32:39] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+1] "Happy to have a go at this" [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/978030 (https://phabricator.wikimedia.org/T347717) (owner: 10Jgiannelos)
[21:32:59] <wikibugs>	 (03PS1) 10Jforrester: Drop Listings extension from Wikivoyages where unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980047 (https://phabricator.wikimedia.org/T352719)
[21:33:10] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P54137 and previous config saved to /var/cache/conftool/dbconfig/20231204-213309-arnaudb.json
[21:38:53] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ebernhardson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979155 (https://phabricator.wikimedia.org/T352335) (owner: 10Ebernhardson)
[21:39:36] <wikibugs>	 (03Merged) 10jenkins-bot: cirrus: Enable event bus bridge on more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979155 (https://phabricator.wikimedia.org/T352335) (owner: 10Ebernhardson)
[21:39:51] <logmsgbot>	 !log ebernhardson@deploy2002 Started scap: Backport for [[gerrit:979155|cirrus: Enable event bus bridge on more wikis (T352335)]]
[21:39:55] <stashbot>	 T352335: Deploy the new Cirrus Updater to update select wikis in cloudelastic - https://phabricator.wikimedia.org/T352335
[21:41:07] <logmsgbot>	 !log ebernhardson@deploy2002 ebernhardson: Backport for [[gerrit:979155|cirrus: Enable event bus bridge on more wikis (T352335)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:42:39] <logmsgbot>	 !log ebernhardson@deploy2002 ebernhardson: Continuing with sync
[21:44:06] <wikibugs>	 (03Merged) 10jenkins-bot: Always load transcode state from db when opting in to primary db [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979693 (https://phabricator.wikimedia.org/T200939) (owner: 10Brion VIBBER)
[21:44:24] <wikibugs>	 (03Merged) 10jenkins-bot: Encoding cleanup with remuxing support [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979694 (https://phabricator.wikimedia.org/T68722) (owner: 10Brion VIBBER)
[21:45:26] <bvibber>	 yay
[21:46:18] <wikibugs>	 (03PS1) 10Herron: grafana: add dashboard graphite usage exporter [puppet] - 10https://gerrit.wikimedia.org/r/980048 (https://phabricator.wikimedia.org/T350591)
[21:47:28] <ryankemper>	 !log T351503 Setting partition count to 5: `ryankemper@kafka-main2001:~$ kafka topics --alter --topic eqiad.mediawiki.cirrussearch.page_rerender.v1 --partitions 5`
[21:47:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:47:33] <stashbot>	 T351503: Enable mediawiki.cirrussearch.page_rerender.v1 on all public wikis - https://phabricator.wikimedia.org/T351503
[21:47:37] <ryankemper>	 !log T351503 Setting partition count to 5: `ryankemper@kafka-main2001:~$ kafka topics --alter --topic codfw.mediawiki.cirrussearch.page_rerender.v1 --partitions 5`
[21:47:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:48:17] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P54138 and previous config saved to /var/cache/conftool/dbconfig/20231204-214816-arnaudb.json
[21:49:15] <logmsgbot>	 !log ebernhardson@deploy2002 Finished scap: Backport for [[gerrit:979155|cirrus: Enable event bus bridge on more wikis (T352335)]] (duration: 09m 23s)
[21:49:23] <stashbot>	 T352335: Deploy the new Cirrus Updater to update select wikis in cloudelastic - https://phabricator.wikimedia.org/T352335
[21:50:08] <logmsgbot>	 !log ebernhardson@deploy2002 Started scap: Backport for [[gerrit:979693|Always load transcode state from db when opting in to primary db]]
[21:51:09] <icinga-wm>	 PROBLEM - Check systemd state on mw2261 is CRITICAL: CRITICAL - degraded: The following units failed: php7.4-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:51:32] <logmsgbot>	 !log ebernhardson@deploy2002 ebernhardson and brion: Backport for [[gerrit:979693|Always load transcode state from db when opting in to primary db]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug)
[21:52:01] <ebernhardson>	 bvibber: its loaded onto test servers
[21:52:16] <bvibber>	 testing...
[21:52:47] <bvibber>	 perfect
[21:52:52] <ebernhardson>	 alright, continuing
[21:52:52] <bvibber>	 looks good ebernhardson :D
[21:52:53] <logmsgbot>	 !log ebernhardson@deploy2002 ebernhardson and brion: Continuing with sync
[21:58:46] <logmsgbot>	 !log ebernhardson@deploy2002 Finished scap: Backport for [[gerrit:979693|Always load transcode state from db when opting in to primary db]] (duration: 08m 37s)
[22:00:05] <jouncebot>	 Reedy, sbassett, Maryum, and manfredi: OwO what's this, a deployment window?? Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231204T2200). nyaa~
[22:00:16] <ebernhardson>	 decent timing, backport window is now complete
[22:00:36] <bvibber>	 thanks very much ebernhardson ! :D
[22:01:09] <wikibugs>	 (03PS6) 10Bernard Wang: Deploy VectorClientPreferences to beta and pl,fr,ca,fa wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980028
[22:01:59] <ebernhardson>	 np
[22:03:23] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T348183)', diff saved to https://phabricator.wikimedia.org/P54140 and previous config saved to /var/cache/conftool/dbconfig/20231204-220322-arnaudb.json
[22:03:25] <logmsgbot>	 !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2189.codfw.wmnet with reason: Maintenance
[22:03:29] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[22:03:39] <logmsgbot>	 !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2189.codfw.wmnet with reason: Maintenance
[22:03:46] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2189 (T348183)', diff saved to https://phabricator.wikimedia.org/P54141 and previous config saved to /var/cache/conftool/dbconfig/20231204-220345-arnaudb.json
[22:04:46] <icinga-wm>	 PROBLEM - cassandra-a CQL 10.192.16.237:9042 on restbase2028 is CRITICAL: connect to address 10.192.16.237 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[22:07:06] <icinga-wm>	 PROBLEM - cassandra-a SSL 10.192.16.237:7000 on restbase2028 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[22:08:17] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T348183)', diff saved to https://phabricator.wikimedia.org/P54142 and previous config saved to /var/cache/conftool/dbconfig/20231204-220817-arnaudb.json
[22:11:56] <icinga-wm>	 PROBLEM - cassandra-b CQL 10.192.16.238:9042 on restbase2028 is CRITICAL: connect to address 10.192.16.238 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[22:12:03] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10Papaul)
[22:14:24] <icinga-wm>	 PROBLEM - cassandra-b SSL 10.192.16.238:7000 on restbase2028 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[22:15:01] <wikibugs>	 (03PS1) 10Eevans: restbase: migrate restbase2028 to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980049 (https://phabricator.wikimedia.org/T352468)
[22:19:16] <icinga-wm>	 PROBLEM - cassandra-c CQL 10.192.16.239:9042 on restbase2028 is CRITICAL: connect to address 10.192.16.239 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886
[22:21:40] <icinga-wm>	 PROBLEM - cassandra-c SSL 10.192.16.239:7000 on restbase2028 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates
[22:23:24] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P54144 and previous config saved to /var/cache/conftool/dbconfig/20231204-222323-arnaudb.json
[22:33:48] <icinga-wm>	 PROBLEM - Restbase root url on restbase2028 is CRITICAL: connect to address 10.192.16.64 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase
[22:38:30] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P54145 and previous config saved to /var/cache/conftool/dbconfig/20231204-223830-arnaudb.json
[22:52:32] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] Drop Listings extension from Wikivoyages where unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980047 (https://phabricator.wikimedia.org/T352719) (owner: 10Jforrester)
[22:52:40] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+1] nlwikivoyage: Drop Listings extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980009 (https://phabricator.wikimedia.org/T352696) (owner: 10Jforrester)
[22:53:37] <logmsgbot>	 !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T348183)', diff saved to https://phabricator.wikimedia.org/P54146 and previous config saved to /var/cache/conftool/dbconfig/20231204-225336-arnaudb.json
[22:53:40] <stashbot>	 T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183
[22:59:04] <jinxer-wm>	 (JobUnavailable) firing: (9) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:59:27] <wikibugs>	 (03PS2) 10EoghanGaffney: [admin] Add user account for xiaoxiao to data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/979389 (https://phabricator.wikimedia.org/T352098)
[22:59:29] <wikibugs>	 (03PS1) 10EoghanGaffney: [admin] Add ecarg shell account [puppet] - 10https://gerrit.wikimedia.org/r/980060 (https://phabricator.wikimedia.org/T350918)
[23:00:19] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[23:00:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [admin] Add ecarg shell account [puppet] - 10https://gerrit.wikimedia.org/r/980060 (https://phabricator.wikimedia.org/T350918) (owner: 10EoghanGaffney)
[23:03:19] <wikibugs>	 (03PS2) 10EoghanGaffney: [admin] Add ecarg shell account [puppet] - 10https://gerrit.wikimedia.org/r/980060 (https://phabricator.wikimedia.org/T350918)
[23:03:59] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] restbase: migrate restbase2028 to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980049 (https://phabricator.wikimedia.org/T352468) (owner: 10Eevans)
[23:06:46] <wikibugs>	 (03PS3) 10EoghanGaffney: [admin] Add ecarg shell account [puppet] - 10https://gerrit.wikimedia.org/r/980060 (https://phabricator.wikimedia.org/T350918)
[23:07:59] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on restbase2028:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[23:11:22] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:11:32] <icinga-wm>	 PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:12:20] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[23:15:38] <icinga-wm>	 PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[23:20:16] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:20:26] <icinga-wm>	 RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:21:14] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[23:32:18] <wikibugs>	 (03PS1) 10Kimberly Sarabia: Remove readability survey tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980063 (https://phabricator.wikimedia.org/T349337)
[23:33:23] <wikibugs>	 (03PS20) 10Bking: wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355)
[23:33:32] <wikibugs>	 (03CR) 10Bking: wdqs: Monitor LDF endpoint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking)
[23:33:37] <wikibugs>	 (03PS21) 10Effie Mouzeli: mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690)
[23:37:38] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[23:37:45] <wikibugs>	 (03PS22) 10Effie Mouzeli: mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690)
[23:38:10] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:38:20] <icinga-wm>	 PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:38:49] <wikibugs>	 (03CR) 10Effie Mouzeli: mcrouter: add chart (038 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli)
[23:38:59] <wikibugs>	 (03CR) 10Jdlrobson: [C: 04-1] "I just wanted to check the approach you are taking is consistent with my understanding:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980063 (https://phabricator.wikimedia.org/T349337) (owner: 10Kimberly Sarabia)
[23:39:58] <icinga-wm>	 RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[23:40:38] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[23:41:10] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[23:41:29] <wikibugs>	 (03CR) 10Jdlrobson: [C: 04-1] Deploy VectorClientPreferences to beta and pl,fr,ca,fa wikis (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980028 (owner: 10Bernard Wang)
[23:42:15] <wikibugs>	 (03CR) 10Bking: [C: 03+1] restbase: migrate restbase2028 to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980049 (https://phabricator.wikimedia.org/T352468) (owner: 10Eevans)
[23:43:25] <wikibugs>	 (03PS21) 10Bking: wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355)
[23:43:43] <wikibugs>	 (03PS7) 10Jdlrobson: Deploy VectorClientPreferences to beta and pl,fr,ca,fa wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980028 (https://phabricator.wikimedia.org/T351339) (owner: 10Bernard Wang)
[23:44:07] <wikibugs>	 (03PS22) 10Bking: wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355)
[23:51:31] <wikibugs>	 (03CR) 10Eevans: [C: 03+2] restbase: migrate restbase2028 to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980049 (https://phabricator.wikimedia.org/T352468) (owner: 10Eevans)
[23:51:46] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:52:51] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.puppet.migrate-host for host restbase2028.codfw.wmnet
[23:53:06] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.310 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[23:53:27] <logmsgbot>	 !log eevans@cumin1001 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host restbase2028.codfw.wmnet
[23:54:02] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[23:55:25] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10Patch-For-Review: Grant access to nda LDAP group to xqt - https://phabricator.wikimedia.org/T348520 (10KFrancis) Done, thanks!