[00:09:25] !log andrew@cumin1001 START - Cookbook sre.hosts.reboot-single for host cloudcontrol1006.eqiad.wmnet [00:16:11] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:16:28] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudcontrol1006.eqiad.wmnet [00:20:35] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_eventlogging_analytics.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:38:20] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/979450 [00:38:26] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/979450 (owner: 10TrainBranchBot) [00:40:00] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [00:49:24] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [00:58:49] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/979450 (owner: 10TrainBranchBot) [01:23:52] 10SRE, 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [02:01:05] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:39:04] (JobUnavailable) firing: (10) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:49:49] PROBLEM - Work requests waiting in Zuul Gearman server on contint2002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [02:55:45] RECOVERY - Work requests waiting in Zuul Gearman server on contint2002 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [03:00:18] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [03:00:44] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [03:07:37] PROBLEM - Work requests waiting in Zuul Gearman server on contint2002 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [400.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [03:09:04] (JobUnavailable) firing: (10) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [03:28:53] (03CR) 10KartikMistry: [C: 03+2] Update MinT to 2023-11-21-115852-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/978727 (owner: 10KartikMistry) [03:30:00] (03Merged) 10jenkins-bot: Update MinT to 2023-11-21-115852-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/978727 (owner: 10KartikMistry) [03:30:01] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-int_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:34:11] RECOVERY - Work requests waiting in Zuul Gearman server on contint2002 is OK: OK: Less than 100.00% above the threshold [200.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/d/000000322/zuul-gearman?orgId=1&viewPanel=10 [03:36:49] PROBLEM - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:16:07] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-k8s-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:26:25] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:29:15] RECOVERY - Check unit status of httpbb_kubernetes_mw-api-int_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-int_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [04:40:01] (BlazegraphFreeAllocatorsDecreasingRapidly) firing: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [04:43:37] !log [WDQS] Clearing `BlazegraphFreeAllocatorsDecreasingRapidly` -> `ryankemper@wdqs1007:~$ sudo systemctl restart wdqs-blazegraph` [04:43:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:49:24] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [04:50:01] (BlazegraphFreeAllocatorsDecreasingRapidly) resolved: Blazegraph instance wdqs1007:9193 is burning free allocators at a very high rate - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Runbook#Free_allocators_decrease_rapidly - https://grafana.wikimedia.org/d/000000489/wikidata-query-service - https://alerts.wikimedia.org/?q=alertname%3DBlazegraphFreeAllocatorsDecreasingRapidly [05:37:15] PROBLEM - BGP status on cr1-esams is CRITICAL: BGP CRITICAL - No response from remote host 185.15.59.128 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [05:54:33] (03PS1) 10KartikMistry: Update cxserver to 2023-12-04-055024-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/979487 (https://phabricator.wikimedia.org/T270060) [05:55:56] ^ Deploying cxserver.. [05:56:43] (03CR) 10KartikMistry: [C: 03+2] Update cxserver to 2023-12-04-055024-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/979487 (https://phabricator.wikimedia.org/T270060) (owner: 10KartikMistry) [05:57:46] (03Merged) 10jenkins-bot: Update cxserver to 2023-12-04-055024-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/979487 (https://phabricator.wikimedia.org/T270060) (owner: 10KartikMistry) [05:58:46] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/cxserver: apply [05:59:21] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/cxserver: apply [06:02:51] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/cxserver: apply [06:03:30] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/cxserver: apply [06:05:43] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/cxserver: apply [06:06:14] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/cxserver: apply [06:08:03] !log Updated cxserver to 2023-12-04-055024-production (T270060, T350773, T352620) [06:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:08:10] T270060: Package apertium-fra-frp (French-Arpitan) - https://phabricator.wikimedia.org/T270060 [06:08:10] T350773: Remove preq and use node fetch - https://phabricator.wikimedia.org/T350773 [06:08:10] T352620: Failure to start new translations - https://phabricator.wikimedia.org/T352620 [06:11:58] !log kartik@deploy2002 helmfile [staging] START helmfile.d/services/machinetranslation: apply [06:12:10] Minor deployment for MinT too ^^ [06:14:09] (03PS1) 10Zoranzoki21: Revert "throttle.php: Cleanup old rules, add new one" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979224 (https://phabricator.wikimedia.org/T352569) [06:14:53] !log kartik@deploy2002 helmfile [staging] DONE helmfile.d/services/machinetranslation: apply [06:15:37] (03PS1) 10Giuseppe Lavagetto: Add throttle rule for editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979488 (https://phabricator.wikimedia.org/T352569) [06:15:45] (03PS2) 10Zoranzoki21: Revert "throttle.php: Cleanup old rules, add new one" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979224 (https://phabricator.wikimedia.org/T352569) [06:16:07] (03Abandoned) 10Zoranzoki21: Revert "throttle.php: Cleanup old rules, add new one" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979224 (https://phabricator.wikimedia.org/T352569) (owner: 10Zoranzoki21) [06:17:53] (03CR) 10Zoranzoki21: [C: 04-1] Add throttle rule for editathon (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979488 (https://phabricator.wikimedia.org/T352569) (owner: 10Giuseppe Lavagetto) [06:22:59] (PuppetZeroResources) firing: Puppet has failed generate resources on elastic1107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [06:28:08] !log kartik@deploy2002 helmfile [eqiad] START helmfile.d/services/machinetranslation: apply [06:31:04] (03CR) 10Anzx: [C: 04-1] Add throttle rule for editathon (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979488 (https://phabricator.wikimedia.org/T352569) (owner: 10Giuseppe Lavagetto) [06:33:49] !log kartik@deploy2002 helmfile [eqiad] DONE helmfile.d/services/machinetranslation: apply [06:35:51] (03CR) 10Giuseppe Lavagetto: Add throttle rule for editathon (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979488 (https://phabricator.wikimedia.org/T352569) (owner: 10Giuseppe Lavagetto) [06:37:59] (PuppetZeroResources) resolved: Puppet has failed generate resources on elastic1107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetZeroResources [06:38:20] (03PS2) 10Giuseppe Lavagetto: Add throttle rule for editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979488 (https://phabricator.wikimedia.org/T352569) [06:42:00] (03CR) 10Zoranzoki21: [C: 03+1] Add throttle rule for editathon (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979488 (https://phabricator.wikimedia.org/T352569) (owner: 10Giuseppe Lavagetto) [06:44:51] !log kartik@deploy2002 helmfile [codfw] START helmfile.d/services/machinetranslation: apply [06:46:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on db[2135,2160].codfw.wmnet,db[1119,1176,1217].eqiad.wmnet with reason: m5 master switch T352505 [06:47:00] T352505: Switchover m5 master db1176 -> db1119 - https://phabricator.wikimedia.org/T352505 [06:47:15] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on db[2135,2160].codfw.wmnet,db[1119,1176,1217].eqiad.wmnet with reason: m5 master switch T352505 [06:49:00] (03PS1) 10Marostegui: mariadb: Promote db1119 to m5 master [puppet] - 10https://gerrit.wikimedia.org/r/979489 (https://phabricator.wikimedia.org/T352505) [06:49:49] !log kartik@deploy2002 helmfile [codfw] DONE helmfile.d/services/machinetranslation: apply [06:50:14] (03PS4) 10Marostegui: parsercachepurging.pp: Increase retention back to 30 days [puppet] - 10https://gerrit.wikimedia.org/r/877205 (https://phabricator.wikimedia.org/T280604) [06:51:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:52:42] (03CR) 10Krinkle: "should pc4 use the same expiry?" [puppet] - 10https://gerrit.wikimedia.org/r/877205 (https://phabricator.wikimedia.org/T280604) (owner: 10Marostegui) [06:53:02] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1119 to m5 master [puppet] - 10https://gerrit.wikimedia.org/r/979489 (https://phabricator.wikimedia.org/T352505) (owner: 10Marostegui) [06:53:11] multiple people on the Help desk reporting that they're getting Rdbms errors when editing https://en.wikipedia.org/wiki/Wikipedia:Help_desk [06:55:51] (03CR) 10Marostegui: "Good point Timo! It will!" [puppet] - 10https://gerrit.wikimedia.org/r/877205 (https://phabricator.wikimedia.org/T280604) (owner: 10Marostegui) [06:56:07] something about the write duration exceeding a 3 second limit, someone opened a task at https://phabricator.wikimedia.org/T352628 [06:56:16] (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [06:57:07] chlod: I guess a write taking more than 3 seconds [06:57:20] !log Failover m5 from db1176 to db1119 - T332155 [06:57:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:57:24] T332155: Switchover m5 master (db1106 -> db1176) - https://phabricator.wikimedia.org/T332155 [07:00:02] (03PS5) 10Marostegui: parsercachepurging.pp: Increase retention back to 30 days [puppet] - 10https://gerrit.wikimedia.org/r/877205 (https://phabricator.wikimedia.org/T280604) [07:00:18] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [07:00:44] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [07:01:14] (03PS1) 10Marostegui: db1176: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/979490 (https://phabricator.wikimedia.org/T352361) [07:02:01] (03CR) 10Marostegui: [C: 03+2] db1176: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/979490 (https://phabricator.wikimedia.org/T352361) (owner: 10Marostegui) [07:03:27] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host db1176.eqiad.wmnet with OS bookworm [07:07:33] !log Updated MinT to 2023-11-21-115852-production [07:07:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:07:40] Forgot to log earlier ^^ [07:10:05] (JobUnavailable) firing: (9) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [07:15:20] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on db1176.eqiad.wmnet with reason: host reimage [07:16:21] (03PS1) 10Marostegui: Revert "db1176: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/979225 [07:16:28] (03CR) 10Marostegui: [C: 04-2] "Not yet" [puppet] - 10https://gerrit.wikimedia.org/r/979225 (owner: 10Marostegui) [07:18:23] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on db1176.eqiad.wmnet with reason: host reimage [07:19:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:24:16] (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [07:31:32] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host db1176.eqiad.wmnet with OS bookworm [07:32:14] (03CR) 10Marostegui: Revert "db1176: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/979225 (owner: 10Marostegui) [07:32:17] (03CR) 10Marostegui: [C: 03+2] Revert "db1176: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/979225 (owner: 10Marostegui) [07:33:29] (03CR) 10Marostegui: [C: 03+2] parsercachepurging.pp: Increase retention back to 30 days [puppet] - 10https://gerrit.wikimedia.org/r/877205 (https://phabricator.wikimedia.org/T280604) (owner: 10Marostegui) [07:39:26] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1129.eqiad.wmnet with reason: Maintenance [07:39:51] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1129.eqiad.wmnet with reason: Maintenance [07:39:58] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1129 (T348183)', diff saved to https://phabricator.wikimedia.org/P54062 and previous config saved to /var/cache/conftool/dbconfig/20231204-073957-arnaudb.json [07:40:02] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [07:42:36] PROBLEM - SSH on wdqs1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [07:42:39] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T348183)', diff saved to https://phabricator.wikimedia.org/P54063 and previous config saved to /var/cache/conftool/dbconfig/20231204-074238-arnaudb.json [07:53:34] PROBLEM - Check systemd state on wdqs1023 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:54:07] (03PS1) 10Marostegui: dbproxy1022: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/979676 (https://phabricator.wikimedia.org/T351864) [07:54:37] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1022.eqiad.wmnet with OS bookworm [07:54:40] (03CR) 10Marostegui: [C: 03+2] dbproxy1022: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/979676 (https://phabricator.wikimedia.org/T351864) (owner: 10Marostegui) [07:55:42] (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [07:57:46] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P54064 and previous config saved to /var/cache/conftool/dbconfig/20231204-075745-arnaudb.json [08:00:04] Amir1 and Urbanecm: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231204T0800). [08:00:05] _joe_ and aanzx: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:12] (03PS4) 10Anzx: hewikivoyage: add tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979686 (https://phabricator.wikimedia.org/T351981) [08:00:15] (03PS2) 10Anzx: azwiki: Enable $wgMinervaEnableSiteNotice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979223 (https://phabricator.wikimedia.org/T352621) [08:00:17] (03PS3) 10Anzx: trwikivoyage: update wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978522 (https://phabricator.wikimedia.org/T352329) [08:02:54] <_joe_> o/ [08:03:15] <_joe_> I'm happy to merge my own patch, I can't be the general deployer though [08:04:10] <_joe_> urbanecm: around? [08:04:18] yes [08:04:19] <_joe_> or Amir1 [08:04:21] <_joe_> ack [08:04:23] 'morning everyone [08:04:30] <_joe_> good morning :) [08:04:43] <_joe_> I'll go and merge this change for the naughty editathoners [08:04:51] o/ [08:05:01] <_joe_> who opened the throttle request on late friday evening for monday :P [08:05:26] heh [08:05:43] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by oblivian@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979488 (https://phabricator.wikimedia.org/T352569) (owner: 10Giuseppe Lavagetto) [08:05:46] (03PS1) 10Muehlenhoff: ganeti: Switch eqiad to PKI [puppet] - 10https://gerrit.wikimedia.org/r/979838 (https://phabricator.wikimedia.org/T350686) [08:06:04] <_joe_> urbanecm: I'm trying to be lenient but yeah... [08:06:30] (03Merged) 10jenkins-bot: Add throttle rule for editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979488 (https://phabricator.wikimedia.org/T352569) (owner: 10Giuseppe Lavagetto) [08:07:27] !log oblivian@deploy2002 Started scap: Backport for [[gerrit:979488|Add throttle rule for editathon (T352569)]] [08:07:31] T352569: Lift IP cap on 2023-12-04 for Editathon for commonswiki and eswiki - https://phabricator.wikimedia.org/T352569 [08:08:26] _joe_: fwiw, the official guidelines (https://meta.wikimedia.org/wiki/Mass_account_creation#Requesting_temporary_lift_of_IP_cap) say "two weeks in advance" :-/ [08:08:49] <_joe_> urbanecm: https://phabricator.wikimedia.org/T352569#9377671 [08:08:51] <_joe_> :P [08:09:14] <_joe_> urbanecm: I even have to run a script, ofc [08:09:24] <_joe_> because we can't ever have nice things [08:09:39] PROBLEM - SSH on wdqs1022 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:10:07] yeah... and hope the IP info is okay. [08:10:44] (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:10:52] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbproxy1022.eqiad.wmnet with OS bookworm [08:11:26] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1022.eqiad.wmnet with OS bookworm [08:12:52] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P54065 and previous config saved to /var/cache/conftool/dbconfig/20231204-081251-arnaudb.json [08:17:27] !log oblivian@deploy2002 oblivian: Backport for [[gerrit:979488|Add throttle rule for editathon (T352569)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:17:30] T352569: Lift IP cap on 2023-12-04 for Editathon for commonswiki and eswiki - https://phabricator.wikimedia.org/T352569 [08:17:31] (03PS7) 10Elukey: changeprop: refactor templating for Kafka producer/consumer settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950) [08:18:33] !log oblivian@deploy2002 oblivian: Continuing with sync [08:19:27] PROBLEM - Check systemd state on wdqs1023 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:19:52] (03CR) 10Muehlenhoff: [C: 03+2] ganeti: Switch eqiad to PKI [puppet] - 10https://gerrit.wikimedia.org/r/979838 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [08:21:41] <_joe_> urbanecm: I'm almost done; building the image took a long time but almost everything is unusually slow [08:23:01] Ack [08:23:12] (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:23:15] <_joe_> !log clearing throttle cache for T352569 [08:23:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:23:19] T352569: Lift IP cap on 2023-12-04 for Editathon for commonswiki and eswiki - https://phabricator.wikimedia.org/T352569 [08:24:38] !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM moscovium.eqiad.wmnet [08:25:32] !log oblivian@deploy2002 Finished scap: Backport for [[gerrit:979488|Add throttle rule for editathon (T352569)]] (duration: 18m 04s) [08:26:59] <_joe_> urbanecm: I'm done [08:27:05] ack [08:27:12] anzx: hi, still around? :) [08:27:17] Ues [08:27:20] Yes [08:27:30] (03PS5) 10Urbanecm: hewikivoyage: add tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979686 (https://phabricator.wikimedia.org/T351981) (owner: 10Anzx) [08:27:35] (03PS3) 10Urbanecm: azwiki: Enable $wgMinervaEnableSiteNotice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979223 (https://phabricator.wikimedia.org/T352621) (owner: 10Anzx) [08:27:39] (03CR) 10Urbanecm: [C: 03+2] hewikivoyage: add tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979686 (https://phabricator.wikimedia.org/T351981) (owner: 10Anzx) [08:27:44] (03CR) 10Urbanecm: [C: 03+2] azwiki: Enable $wgMinervaEnableSiteNotice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979223 (https://phabricator.wikimedia.org/T352621) (owner: 10Anzx) [08:27:50] (03PS4) 10Urbanecm: trwikivoyage: update wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978522 (https://phabricator.wikimedia.org/T352329) (owner: 10Anzx) [08:27:54] (03CR) 10Urbanecm: [C: 03+2] trwikivoyage: update wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978522 (https://phabricator.wikimedia.org/T352329) (owner: 10Anzx) [08:27:59] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T348183)', diff saved to https://phabricator.wikimedia.org/P54066 and previous config saved to /var/cache/conftool/dbconfig/20231204-082758-arnaudb.json [08:28:01] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1139.eqiad.wmnet with reason: Maintenance [08:28:03] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [08:28:15] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1139.eqiad.wmnet with reason: Maintenance [08:28:34] (03Merged) 10jenkins-bot: hewikivoyage: add tagline [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979686 (https://phabricator.wikimedia.org/T351981) (owner: 10Anzx) [08:28:38] (03Merged) 10jenkins-bot: azwiki: Enable $wgMinervaEnableSiteNotice [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979223 (https://phabricator.wikimedia.org/T352621) (owner: 10Anzx) [08:28:46] (03Merged) 10jenkins-bot: trwikivoyage: update wordmark [mediawiki-config] - 10https://gerrit.wikimedia.org/r/978522 (https://phabricator.wikimedia.org/T352329) (owner: 10Anzx) [08:28:53] let's send it out [08:29:32] (03PS1) 10David Caro: codfw1dev: add smart_hosts [puppet] - 10https://gerrit.wikimedia.org/r/979888 (https://phabricator.wikimedia.org/T350008) [08:29:44] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:979686|hewikivoyage: add tagline (T351981)]], [[gerrit:979223|azwiki: Enable $wgMinervaEnableSiteNotice (T352621)]], [[gerrit:978522|trwikivoyage: update wordmark (T352329)]] [08:29:50] T351981: Change Hebrew Wikivoyage wordmark logo - https://phabricator.wikimedia.org/T351981 [08:29:51] T352621: Enable $wgMinervaEnableSiteNotice for azwiki - https://phabricator.wikimedia.org/T352621 [08:29:51] T352329: Remove logo from Turkish Wikivoyage wordmark - https://phabricator.wikimedia.org/T352329 [08:30:42] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance [08:30:43] (03PS2) 10David Caro: codfw1dev: add smart_hosts [puppet] - 10https://gerrit.wikimedia.org/r/979888 (https://phabricator.wikimedia.org/T350008) [08:30:56] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1146.eqiad.wmnet with reason: Maintenance [08:31:03] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1146:3312 (T348183)', diff saved to https://phabricator.wikimedia.org/P54067 and previous config saved to /var/cache/conftool/dbconfig/20231204-083102-arnaudb.json [08:31:05] !log urbanecm@deploy2002 urbanecm and anzx: Backport for [[gerrit:979686|hewikivoyage: add tagline (T351981)]], [[gerrit:979223|azwiki: Enable $wgMinervaEnableSiteNotice (T352621)]], [[gerrit:978522|trwikivoyage: update wordmark (T352329)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [08:31:12] Checking [08:31:13] anzx: please test at the debug servers [08:31:14] ty [08:32:26] (03PS8) 10Elukey: changeprop: refactor templating for Kafka producer/consumer settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950) [08:32:52] urbanecm: looks good [08:33:23] !log urbanecm@deploy2002 urbanecm and anzx: Continuing with sync [08:35:35] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T348183)', diff saved to https://phabricator.wikimedia.org/P54068 and previous config saved to /var/cache/conftool/dbconfig/20231204-083534-arnaudb.json [08:35:39] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [08:38:12] (03CR) 10Elukey: "Hugh/Joe: Tried to refactor another time the charts, lemme know if you like it or not :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey) [08:39:17] (03PS3) 10David Caro: codfw1dev: add smart_hosts [puppet] - 10https://gerrit.wikimedia.org/r/979888 (https://phabricator.wikimedia.org/T350008) [08:39:34] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:979686|hewikivoyage: add tagline (T351981)]], [[gerrit:979223|azwiki: Enable $wgMinervaEnableSiteNotice (T352621)]], [[gerrit:978522|trwikivoyage: update wordmark (T352329)]] (duration: 09m 49s) [08:39:37] anzx: done [08:39:39] T351981: Change Hebrew Wikivoyage wordmark logo - https://phabricator.wikimedia.org/T351981 [08:39:40] T352621: Enable $wgMinervaEnableSiteNotice for azwiki - https://phabricator.wikimedia.org/T352621 [08:39:40] T352329: Remove logo from Turkish Wikivoyage wordmark - https://phabricator.wikimedia.org/T352329 [08:41:11] (03CR) 10David Caro: "Tested on codfw by cherry-picking on the local puppetmaster and running on a VM (etcd-discovery-2.cloudinfra-codfw1dev.codfw1dev.wikimedia" [puppet] - 10https://gerrit.wikimedia.org/r/979888 (https://phabricator.wikimedia.org/T350008) (owner: 10David Caro) [08:43:00] !log upgrade istio (buster -> bullseye) on dse-k8s-eqiad - T351933 [08:43:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:43:10] T351933: Bump istio Docker images to Bookworm - https://phabricator.wikimedia.org/T351933 [08:43:12] (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1023:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [08:43:43] urbanecm: thanks logo seems to appears [08:44:15] correctly [08:44:41] PROBLEM - Host moscovium is DOWN: PING CRITICAL - Packet loss = 100% [08:44:59] yay [08:45:49] !log marostegui@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host dbproxy1022.eqiad.wmnet with OS bookworm [08:46:19] RECOVERY - Host moscovium is UP: PING OK - Packet loss = 0%, RTA = 0.43 ms [08:48:09] !log upgrade istio (buster -> bullseye) on aux-k8s-eqiad - T351933 [08:48:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:14] T351933: Bump istio Docker images to Bookworm - https://phabricator.wikimedia.org/T351933 [08:49:04] (ProbeDown) resolved: Service moscovium:443 has failed probes (http_rt_wikimedia_org_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#moscovium:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [08:49:05] RECOVERY - SSH on wdqs1023 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:49:23] RECOVERY - Check systemd state on wdqs1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:49:24] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [08:50:09] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM moscovium.eqiad.wmnet [08:50:34] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1022.eqiad.wmnet with OS bookworm [08:50:41] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P54069 and previous config saved to /var/cache/conftool/dbconfig/20231204-085041-arnaudb.json [08:53:13] PROBLEM - SSH on wdqs1023 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [08:55:05] 10sre-alert-triage, 10Infrastructure-Foundations: Alert triage: overdue alert [warning] Systemd units failing on debmonitor2003 - https://phabricator.wikimedia.org/T343897 (10LSobanski) Updating as this alert came up on the overdue list again. [08:58:40] !log upgrade istio (buster -> bullseye) on ml-serve-eqiad - T351933 [08:58:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:44] T351933: Bump istio Docker images to Bookworm - https://phabricator.wikimedia.org/T351933 [09:00:07] RECOVERY - SSH on wdqs1023 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:02:23] (03PS1) 10Muehlenhoff: ganeti: Configure eqiad/test for PKI [puppet] - 10https://gerrit.wikimedia.org/r/979890 (https://phabricator.wikimedia.org/T350686) [09:05:48] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312', diff saved to https://phabricator.wikimedia.org/P54070 and previous config saved to /var/cache/conftool/dbconfig/20231204-090547-arnaudb.json [09:07:31] PROBLEM - Check systemd state on wdqs1022 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:09:42] (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:14:42] (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:17:47] PROBLEM - Check systemd state on wdqs1022 is CRITICAL: CRITICAL - degraded: The following units failed: systemd-timedated.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:19:42] (SystemdUnitFailed) firing: systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:20:08] (03PS1) 10Brouberol: Define a DNS A record for the dse k8s ingress gateway [dns] - 10https://gerrit.wikimedia.org/r/979891 (https://phabricator.wikimedia.org/T352639) [09:20:10] (03PS1) 10Brouberol: Enable ingress for the spark-history server services via the dse ingress gw [dns] - 10https://gerrit.wikimedia.org/r/979892 (https://phabricator.wikimedia.org/T352639) [09:20:54] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1146:3312 (T348183)', diff saved to https://phabricator.wikimedia.org/P54072 and previous config saved to /var/cache/conftool/dbconfig/20231204-092054-arnaudb.json [09:20:58] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance [09:20:58] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [09:21:11] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance [09:21:12] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [09:21:30] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [09:21:37] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T348183)', diff saved to https://phabricator.wikimedia.org/P54073 and previous config saved to /var/cache/conftool/dbconfig/20231204-092136-arnaudb.json [09:22:57] (03PS1) 10MVernon: Swift: Set new-style storage for ms-be1076-89,ms-be2080-9 [puppet] - 10https://gerrit.wikimedia.org/r/979893 (https://phabricator.wikimedia.org/T349840) [09:23:58] (03CR) 10Muehlenhoff: [C: 03+2] ganeti: Configure eqiad/test for PKI [puppet] - 10https://gerrit.wikimedia.org/r/979890 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [09:24:52] (03CR) 10Elukey: [C: 04-1] "Please check what was done for other k8s ingress services (for example k8s-ingress-aux). This needs to be a new LVS service:" [dns] - 10https://gerrit.wikimedia.org/r/979891 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol) [09:26:01] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T348183)', diff saved to https://phabricator.wikimedia.org/P54074 and previous config saved to /var/cache/conftool/dbconfig/20231204-092600-arnaudb.json [09:26:05] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [09:26:53] (03CR) 10Arnaudb: [V: 03+1 C: 03+1] Swift: Set new-style storage for ms-be1076-89,ms-be2080-9 [puppet] - 10https://gerrit.wikimedia.org/r/979893 (https://phabricator.wikimedia.org/T349840) (owner: 10MVernon) [09:28:43] (03CR) 10Brouberol: "Ah, I see. We define an LVS-ed service with a reserved IP, and that's the IP being resolved by the DNS record, not the k8s service DNS. Th" [dns] - 10https://gerrit.wikimedia.org/r/979891 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol) [09:29:57] (03CR) 10MVernon: [C: 03+2] Swift: Set new-style storage for ms-be1076-89,ms-be2080-9 [puppet] - 10https://gerrit.wikimedia.org/r/979893 (https://phabricator.wikimedia.org/T349840) (owner: 10MVernon) [09:34:42] (SystemdUnitFailed) resolved: systemd-timedated.service Failed on wdqs1022:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [09:35:29] RECOVERY - Check systemd state on wdqs1022 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:35:36] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10MatthewVernon) @Jclark-ctr sorry, there are some puppet changes that have to be made before new ms-be* nodes will install cleanly, which is why those nodes failed on Friday.... [09:35:55] RECOVERY - SSH on wdqs1022 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:36:14] !log upgrade istio (buster -> bullseye) on ml-serve-codfw - T351933 [09:36:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:18] T351933: Bump istio Docker images to Bookworm - https://phabricator.wikimedia.org/T351933 [09:41:08] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P54075 and previous config saved to /var/cache/conftool/dbconfig/20231204-094107-arnaudb.json [09:44:42] (03PS1) 10Muehlenhoff: ganeti: Remove non-PKI code for RAPI access [puppet] - 10https://gerrit.wikimedia.org/r/979897 (https://phabricator.wikimedia.org/T350686) [09:47:09] (03CR) 10Filippo Giunchedi: [C: 03+2] k8s: allow setting prometheus retention in cluster definition [puppet] - 10https://gerrit.wikimedia.org/r/977687 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi) [09:49:20] !log volans@cumin1001 START - Cookbook sre.hosts.provision for host dbproxy1022.mgmt.eqiad.wmnet with reboot policy GRACEFUL [09:50:47] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/979897 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [09:52:55] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: set 850GB retention for prometheus@k8s [puppet] - 10https://gerrit.wikimedia.org/r/977688 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi) [09:56:14] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P54076 and previous config saved to /var/cache/conftool/dbconfig/20231204-095614-arnaudb.json [09:57:39] !log roll-restart prometheus/k8s to apply size-based retention - T351179 [09:57:40] (03CR) 10Giuseppe Lavagetto: [C: 03+1] add wmf-debci image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/979355 (https://phabricator.wikimedia.org/T352003) (owner: 10Jelto) [09:57:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:57:43] T351179: LVM vg0 close to getting full on prometheus eqiad - https://phabricator.wikimedia.org/T351179 [09:58:27] !log volans@cumin1001 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host dbproxy1022.mgmt.eqiad.wmnet with reboot policy GRACEFUL [09:59:05] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy1022.eqiad.wmnet with reason: host reimage [09:59:54] 10SRE, 10SRE-Access-Requests, 10Structured-Data-Backlog, 10UploadWizard: Access request to deleted image files in the backup cluster - https://phabricator.wikimedia.org/T350020 (10mfossati) >>! In T350020#9376060, @jcrespo wrote: >>>! In T350020#9375684, @mfossati wrote: >> @jcrespo , would it be possible... [10:00:29] 10SRE, 10SRE-Access-Requests, 10Structured-Data-Backlog, 10UploadWizard: Access request to deleted image files in the backup cluster - https://phabricator.wikimedia.org/T350020 (10mfossati) Also CC @fkaelin . [10:02:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy1022.eqiad.wmnet with reason: host reimage [10:08:53] (03PS1) 10Filippo Giunchedi: hieradata: adjust prometheus k8s retention to current utilization [puppet] - 10https://gerrit.wikimedia.org/r/979898 (https://phabricator.wikimedia.org/T351179) [10:11:21] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T348183)', diff saved to https://phabricator.wikimedia.org/P54077 and previous config saved to /var/cache/conftool/dbconfig/20231204-101120-arnaudb.json [10:11:23] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [10:11:32] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [10:11:37] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [10:11:44] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T348183)', diff saved to https://phabricator.wikimedia.org/P54078 and previous config saved to /var/cache/conftool/dbconfig/20231204-101143-arnaudb.json [10:16:15] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T348183)', diff saved to https://phabricator.wikimedia.org/P54079 and previous config saved to /var/cache/conftool/dbconfig/20231204-101615-arnaudb.json [10:17:01] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 138997 [10:17:37] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 138997 [10:17:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy1022.eqiad.wmnet with OS bookworm [10:19:22] (03PS1) 10Marostegui: Revert "dbproxy1022: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/979689 [10:20:35] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 35 days, 0:00:00 on debmonitor2003.codfw.wmnet with reason: WIP [10:20:37] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1022: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/979689 (owner: 10Marostegui) [10:20:53] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 35 days, 0:00:00 on debmonitor2003.codfw.wmnet with reason: WIP [10:21:54] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/977734 (https://phabricator.wikimedia.org/T351936) (owner: 10Filippo Giunchedi) [10:23:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:26:09] 10SRE, 10LDAP-Access-Requests: Grant Access to archiva-deployers for pfischer - https://phabricator.wikimedia.org/T352475 (10Gehel) Approved! [10:26:26] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: adjust prometheus k8s retention to current utilization [puppet] - 10https://gerrit.wikimedia.org/r/979898 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi) [10:28:16] (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:28:22] !log pgrade istio (buster -> bullseye) on wikikube eqiad - T351933 [10:28:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:26] T351933: Bump istio Docker images to Bookworm - https://phabricator.wikimedia.org/T351933 [10:29:52] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 237 [10:29:56] (03PS9) 10Vgutierrez: lvs::realserver::ipip: Check that TCP MSS clamping is working [puppet] - 10https://gerrit.wikimedia.org/r/977696 (https://phabricator.wikimedia.org/T351069) [10:30:29] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 237 [10:30:32] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 19165 [10:31:22] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P54080 and previous config saved to /var/cache/conftool/dbconfig/20231204-103121-arnaudb.json [10:31:27] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 19165 [10:32:14] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 15305 [10:32:43] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 15305 [10:32:57] !log upgrade istio (buster -> bullseye) on wikikube codfw - T351933 [10:33:00] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 398446 [10:33:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:13] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 398446 [10:33:32] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 142505 [10:33:54] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 142505 [10:34:14] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 33604 [10:34:17] (03PS1) 10David Caro: openstack,trove: increase api response alert to 3s [alerts] - 10https://gerrit.wikimedia.org/r/979899 [10:35:11] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 33604 [10:35:14] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 4800 [10:35:53] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 4800 [10:36:05] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 44592 [10:36:34] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 44592 [10:36:38] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 58952 [10:37:26] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 58952 [10:37:30] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 31898 [10:37:54] (03CR) 10Giuseppe Lavagetto: "I think the code can be made slightly more readable, see my suggestion. Otherwise LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/971113 (https://phabricator.wikimedia.org/T348950) (owner: 10Elukey) [10:38:16] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 31898 [10:38:20] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 63927 [10:38:46] (03PS1) 10Urbanecm: User impact: sort datestring keys to ascending alphanumeric order [extensions/GrowthExperiments] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979690 (https://phabricator.wikimedia.org/T352349) [10:39:12] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 63927 [10:39:19] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'email' for AS: 23856 [10:39:44] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'email' for AS: 23856 [10:40:08] (03PS1) 10Muehlenhoff: Remove ganeti RAPI dummy certs [labs/private] - 10https://gerrit.wikimedia.org/r/979901 (https://phabricator.wikimedia.org/T350686) [10:42:37] (03PS1) 10Slyngshede: C:prometheus::node_exporter allow CPU flags collection [puppet] - 10https://gerrit.wikimedia.org/r/979902 (https://phabricator.wikimedia.org/T350694) [10:42:59] (03PS1) 10Vgutierrez: lvs::realserver::ipip: Clamp on lo too [puppet] - 10https://gerrit.wikimedia.org/r/979903 (https://phabricator.wikimedia.org/T351069) [10:43:25] 10SRE, 10Infrastructure-Foundations, 10netops: Add 4x10G breakout cable to cr2-esams - https://phabricator.wikimedia.org/T347323 (10ayounsi) 05Open→03Resolved Ports freed up in T347403 [10:44:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:44:43] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/805/con" [puppet] - 10https://gerrit.wikimedia.org/r/979903 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [10:44:46] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/977733 (https://phabricator.wikimedia.org/T351936) (owner: 10Filippo Giunchedi) [10:45:18] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Remove ganeti RAPI dummy certs [labs/private] - 10https://gerrit.wikimedia.org/r/979901 (https://phabricator.wikimedia.org/T350686) (owner: 10Muehlenhoff) [10:46:21] (03PS1) 10Muehlenhoff: Remove obsolete dummy cert [labs/private] - 10https://gerrit.wikimedia.org/r/979905 [10:46:28] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312', diff saved to https://phabricator.wikimedia.org/P54081 and previous config saved to /var/cache/conftool/dbconfig/20231204-104628-arnaudb.json [10:46:34] (03PS1) 10JMeybohm: Drop remaining k8s master cergen certs [puppet] - 10https://gerrit.wikimedia.org/r/979906 (https://phabricator.wikimedia.org/T329826) [10:47:01] (03PS2) 10Vgutierrez: lvs::realserver::ipip: Clamp on lo too [puppet] - 10https://gerrit.wikimedia.org/r/979903 (https://phabricator.wikimedia.org/T351069) [10:47:03] (03CR) 10CI reject: [V: 04-1] Drop remaining k8s master cergen certs [puppet] - 10https://gerrit.wikimedia.org/r/979906 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [10:47:55] (03PS2) 10JMeybohm: Drop remaining k8s master cergen certs [puppet] - 10https://gerrit.wikimedia.org/r/979906 (https://phabricator.wikimedia.org/T329826) [10:48:11] !log btullis@cumin1001 START - Cookbook sre.dns.netbox [10:48:18] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/806/con" [puppet] - 10https://gerrit.wikimedia.org/r/979903 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [10:48:38] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (NOOP 4): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/807/console" [puppet] - 10https://gerrit.wikimedia.org/r/979906 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [10:49:16] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:49:51] (03PS1) 10Elukey: ml-services: remove mlstaging ingress settings [deployment-charts] - 10https://gerrit.wikimedia.org/r/979907 [10:50:29] !log btullis@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add service records for the k8s-ingress-dse endpoints - btullis@cumin1001" [10:51:18] !log btullis@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add service records for the k8s-ingress-dse endpoints - btullis@cumin1001" [10:51:18] !log btullis@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [10:52:16] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:53:58] (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/808/con" [puppet] - 10https://gerrit.wikimedia.org/r/979902 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [10:54:39] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-role for role: eventschemas::service [10:56:50] (03CR) 10Elukey: [V: 03+2 C: 03+2] Remove obsolete dummy cert [labs/private] - 10https://gerrit.wikimedia.org/r/979905 (owner: 10Muehlenhoff) [10:57:16] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [10:57:34] (03CR) 10Elukey: [C: 03+1] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/979906 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [10:58:44] (03PS1) 10Muehlenhoff: Switch eventschems::service to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/979908 (https://phabricator.wikimedia.org/T349619) [11:00:04] Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231204T1100) [11:00:19] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [11:00:44] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [11:00:49] (03CR) 10Kamila Součková: [C: 03+2] Move mw api servers to kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/977659 (https://phabricator.wikimedia.org/T351074) (owner: 10Kamila Součková) [11:00:59] (03CR) 10Kamila Součková: [C: 03+2] Move mw api servers to kubernetes workers [homer/public] - 10https://gerrit.wikimedia.org/r/977660 (https://phabricator.wikimedia.org/T351074) (owner: 10Kamila Součková) [11:01:35] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1170:3312 (T348183)', diff saved to https://phabricator.wikimedia.org/P54082 and previous config saved to /var/cache/conftool/dbconfig/20231204-110134-arnaudb.json [11:01:37] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance [11:01:45] (03Merged) 10jenkins-bot: Move mw api servers to kubernetes workers [homer/public] - 10https://gerrit.wikimedia.org/r/977660 (https://phabricator.wikimedia.org/T351074) (owner: 10Kamila Součková) [11:01:47] (03PS2) 10Kamila Součková: Move mw api servers to kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/977659 (https://phabricator.wikimedia.org/T351074) [11:01:50] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [11:01:50] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance [11:01:57] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T348183)', diff saved to https://phabricator.wikimedia.org/P54083 and previous config saved to /var/cache/conftool/dbconfig/20231204-110156-arnaudb.json [11:03:46] (03CR) 10Muehlenhoff: [C: 03+2] Switch eventschems::service to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/979908 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [11:04:41] (03CR) 10Filippo Giunchedi: [C: 04-1] "Python part LGTM, the Puppet part won't work as-is" [puppet] - 10https://gerrit.wikimedia.org/r/977696 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [11:05:37] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:05:54] (03PS1) 10Brouberol: Add an entry related to the dse k8s cluster ingress gateway to conftool [puppet] - 10https://gerrit.wikimedia.org/r/979910 (https://phabricator.wikimedia.org/T352639) [11:06:00] (03PS1) 10Brouberol: Add the k8s-ingress-dse LVS service to the service list [puppet] - 10https://gerrit.wikimedia.org/r/979911 (https://phabricator.wikimedia.org/T352639) [11:06:36] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T348183)', diff saved to https://phabricator.wikimedia.org/P54084 and previous config saved to /var/cache/conftool/dbconfig/20231204-110635-arnaudb.json [11:06:51] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [11:07:07] (03PS10) 10Vgutierrez: lvs::realserver::ipip: Check that TCP MSS clamping is working [puppet] - 10https://gerrit.wikimedia.org/r/977696 (https://phabricator.wikimedia.org/T351069) [11:07:11] (03PS2) 10Brouberol: Add the k8s-ingress-dse LVS service to the service list [puppet] - 10https://gerrit.wikimedia.org/r/979911 (https://phabricator.wikimedia.org/T352639) [11:07:28] (03CR) 10Vgutierrez: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/977696 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [11:08:37] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-role (exit_code=0) for role: eventschemas::service [11:10:05] (JobUnavailable) firing: (9) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [11:10:19] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:11:03] (03PS2) 10Brouberol: Define a DNS A record for the dse k8s ingress gateway [dns] - 10https://gerrit.wikimedia.org/r/979891 (https://phabricator.wikimedia.org/T352639) [11:11:05] (03PS2) 10Brouberol: Enable ingress for the spark-history server services via the dse ingress gw [dns] - 10https://gerrit.wikimedia.org/r/979892 (https://phabricator.wikimedia.org/T352639) [11:11:07] (03PS3) 10Elukey: cert-manager: bump appVersion [deployment-charts] - 10https://gerrit.wikimedia.org/r/978640 (https://phabricator.wikimedia.org/T351933) [11:11:32] 10SRE, 10SRE-tools, 10Infrastructure-Foundations, 10Puppet-Core, and 3 others: Migrate roles to puppet7 - https://phabricator.wikimedia.org/T349619 (10MoritzMuehlenhoff) [11:12:27] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Grant access to nda LDAP group to xqt - https://phabricator.wikimedia.org/T348520 (10Xqt) a:05Xqt→03None @Dzahn: > @Xqt Would you like us to keep your real name out of public repos or you don't mind? I propose not to publish my real name if possible. >... [11:13:28] (03CR) 10Filippo Giunchedi: [C: 03+1] openstack,trove: increase api response alert to 3s [alerts] - 10https://gerrit.wikimedia.org/r/979899 (owner: 10David Caro) [11:14:30] (03CR) 10Filippo Giunchedi: [C: 03+1] lvs::realserver::ipip: Check that TCP MSS clamping is working [puppet] - 10https://gerrit.wikimedia.org/r/977696 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [11:15:37] !log kamila@cumin1001 START - Cookbook sre.hosts.reimage for host mw2422.codfw.wmnet with OS bullseye [11:16:24] (03CR) 10Fabfur: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/979903 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [11:17:12] !log kamila@cumin1001 START - Cookbook sre.hosts.reimage for host mw1462.eqiad.wmnet with OS bullseye [11:19:37] PROBLEM - Check systemd state on kubernetes1038 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:20:37] (03PS4) 10Elukey: cert-manager: bump version in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/978640 (https://phabricator.wikimedia.org/T351933) [11:21:16] (03PS5) 10Elukey: cert-manager: bump version in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/978640 (https://phabricator.wikimedia.org/T351933) [11:21:28] (03CR) 10Elukey: cert-manager: bump version in staging (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/978640 (https://phabricator.wikimedia.org/T351933) (owner: 10Elukey) [11:21:42] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P54085 and previous config saved to /var/cache/conftool/dbconfig/20231204-112141-arnaudb.json [11:22:33] RECOVERY - Check systemd state on kubernetes1038 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:24:44] (03PS1) 10Dreamy Jazz: Enable read new for event tables migration on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979914 (https://phabricator.wikimedia.org/T341829) [11:26:28] (03PS2) 10EoghanGaffney: [apt-staging] Add script to pull artifacts from gitlab [puppet] - 10https://gerrit.wikimedia.org/r/979912 [11:28:44] (03PS1) 10Klausman: hiera: clean up more ORES leftovers [labs/private] - 10https://gerrit.wikimedia.org/r/979915 (https://phabricator.wikimedia.org/T347278) [11:29:04] (03CR) 10Btullis: [C: 03+1] "Thanks." [puppet] - 10https://gerrit.wikimedia.org/r/978466 (owner: 10Muehlenhoff) [11:29:06] (03PS6) 10Elukey: cert-manager: bump version in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/978640 (https://phabricator.wikimedia.org/T351933) [11:29:31] (03CR) 10Btullis: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/977088 (owner: 10Muehlenhoff) [11:29:57] (03CR) 10Btullis: [C: 03+1] "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/977181 (owner: 10Muehlenhoff) [11:30:18] !log kamila@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw1462.eqiad.wmnet with reason: host reimage [11:30:40] (03CR) 10Btullis: [C: 03+1] "Nice, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/978539 (https://phabricator.wikimedia.org/T346947) (owner: 10Majavah) [11:30:46] (03CR) 10Elukey: [C: 03+1] hiera: clean up more ORES leftovers [labs/private] - 10https://gerrit.wikimedia.org/r/979915 (https://phabricator.wikimedia.org/T347278) (owner: 10Klausman) [11:31:00] (03PS1) 10Klausman: profiles: Remove more ORES leftovers [puppet] - 10https://gerrit.wikimedia.org/r/979916 (https://phabricator.wikimedia.org/T347278) [11:31:30] (03CR) 10Btullis: [C: 03+1] Remove analytics_cluster::hadoop::client role [puppet] - 10https://gerrit.wikimedia.org/r/979338 (owner: 10Muehlenhoff) [11:32:19] (03CR) 10Elukey: profiles: Remove more ORES leftovers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979916 (https://phabricator.wikimedia.org/T347278) (owner: 10Klausman) [11:32:28] (03CR) 10Btullis: "Looks good, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/979333 (https://phabricator.wikimedia.org/T352193) (owner: 10Muehlenhoff) [11:32:33] !log kamila@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mw2422.codfw.wmnet with reason: host reimage [11:32:49] (03CR) 10Clément Goubert: [C: 03+1] deployment_server: add mcrouter service 1 [puppet] - 10https://gerrit.wikimedia.org/r/979339 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [11:32:59] (03CR) 10Clément Goubert: [C: 03+1] Add namespace for mcrouter service 2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/979340 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [11:33:34] !log kamila@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw1462.eqiad.wmnet with reason: host reimage [11:34:17] (03CR) 10JMeybohm: [C: 03+1] cert-manager: bump version in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/978640 (https://phabricator.wikimedia.org/T351933) (owner: 10Elukey) [11:34:19] (03CR) 10JMeybohm: [C: 03+2] Drop remaining k8s master cergen certs [puppet] - 10https://gerrit.wikimedia.org/r/979906 (https://phabricator.wikimedia.org/T329826) (owner: 10JMeybohm) [11:35:22] (03CR) 10Elukey: [C: 03+2] cert-manager: bump version in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/978640 (https://phabricator.wikimedia.org/T351933) (owner: 10Elukey) [11:36:19] !log kamila@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mw2422.codfw.wmnet with reason: host reimage [11:36:49] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P54086 and previous config saved to /var/cache/conftool/dbconfig/20231204-113648-arnaudb.json [11:37:41] (03CR) 10Btullis: [C: 03+2] Rewrite metrics sent by Airflow [puppet] - 10https://gerrit.wikimedia.org/r/979118 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu) [11:38:11] (03PS1) 10Marostegui: dbproxy1027: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/979917 (https://phabricator.wikimedia.org/T351864) [11:38:49] (03CR) 10Marostegui: [C: 03+2] dbproxy1027: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/979917 (https://phabricator.wikimedia.org/T351864) (owner: 10Marostegui) [11:39:25] !log marostegui@cumin1001 START - Cookbook sre.hosts.reimage for host dbproxy1027.eqiad.wmnet with OS bookworm [11:39:28] !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [11:39:35] (03CR) 10Btullis: [C: 03+2] "I can deploy this whenever it's convenient for you. I was wondering whether you need to coordinate it with an airflow-dags deployment of t" [puppet] - 10https://gerrit.wikimedia.org/r/979118 (https://phabricator.wikimedia.org/T349532) (owner: 10Aqu) [11:39:55] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] lvs::realserver::ipip: Clamp on lo too [puppet] - 10https://gerrit.wikimedia.org/r/979903 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [11:40:01] !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [11:40:29] 10SRE, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T352653 (10ArthurTaylor) As the #WMF-Legal project tag was added to this task, some general information to avoid wrong expectations: Please note that public tasks i... [11:40:49] 10SRE, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T352653 (10ArthurTaylor) [11:41:24] (03CR) 10Btullis: [C: 03+1] "This looks good to me now. It matches the address in https://netbox.wikimedia.org/ipam/ip-addresses/15582/" [dns] - 10https://gerrit.wikimedia.org/r/979891 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol) [11:42:02] (03CR) 10Btullis: [C: 03+1] Add an entry related to the dse k8s cluster ingress gateway to conftool [puppet] - 10https://gerrit.wikimedia.org/r/979910 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol) [11:42:08] !log elukey@deploy2002 helmfile [staging-codfw] START helmfile.d/admin 'sync'. [11:42:21] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 44592 [11:42:40] !log elukey@deploy2002 helmfile [staging-codfw] DONE helmfile.d/admin 'sync'. [11:42:58] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 44592 [11:43:18] !log elukey@deploy2002 helmfile [staging-eqiad] START helmfile.d/admin 'sync'. [11:43:58] !log elukey@deploy2002 helmfile [staging-eqiad] DONE helmfile.d/admin 'sync'. [11:44:09] (03PS4) 10KartikMistry: Update cxserver to 2023-12-04-083437-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/977983 (https://phabricator.wikimedia.org/T344982) [11:45:15] (03CR) 10Slyngshede: [V: 03+1 C: 03+2] P:url_downloader add blackbox exporter. [puppet] - 10https://gerrit.wikimedia.org/r/973780 (https://phabricator.wikimedia.org/T350694) (owner: 10Slyngshede) [11:47:08] (03CR) 10Btullis: [C: 03+1] "This also looks good to me, but I would also recommend a second opinion from someone else who knows the service catalog well." [puppet] - 10https://gerrit.wikimedia.org/r/979911 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol) [11:47:42] (03PS1) 10Jcrespo: Implement batch deletion, restoration and query of files [software/mediabackups] - 10https://gerrit.wikimedia.org/r/979919 (https://phabricator.wikimedia.org/T352655) [11:48:16] (03CR) 10CI reject: [V: 04-1] Implement batch deletion, restoration and query of files [software/mediabackups] - 10https://gerrit.wikimedia.org/r/979919 (https://phabricator.wikimedia.org/T352655) (owner: 10Jcrespo) [11:51:08] !log kamila@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw1462.eqiad.wmnet with OS bullseye [11:51:55] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T348183)', diff saved to https://phabricator.wikimedia.org/P54087 and previous config saved to /var/cache/conftool/dbconfig/20231204-115154-arnaudb.json [11:51:57] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance [11:51:59] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [11:52:10] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance [11:52:17] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1188 (T348183)', diff saved to https://phabricator.wikimedia.org/P54088 and previous config saved to /var/cache/conftool/dbconfig/20231204-115217-arnaudb.json [11:52:27] 10SRE, 10SRE-Access-Requests, 10Structured-Data-Backlog, 10UploadWizard: Access request to deleted image files in the production Swift cluster - https://phabricator.wikimedia.org/T350020 (10jcrespo) [11:52:42] 10SRE, 10SRE-Access-Requests, 10Structured-Data-Backlog, 10UploadWizard: Access request to deleted image files in the production Swift cluster - https://phabricator.wikimedia.org/T350020 (10jcrespo) Updating title to reflect current request. [11:53:15] (03CR) 10Brouberol: "Claime, as you added the k8s-ingress-aux service, could I ask for a review for something similar? Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/979911 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol) [11:53:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on dbproxy1027.eqiad.wmnet with reason: host reimage [11:54:38] (03CR) 10Clément Goubert: [C: 03+1] mcrouter: add vanila chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/979107 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [11:54:46] !log kamila@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mw2422.codfw.wmnet with OS bullseye [11:54:56] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T348183)', diff saved to https://phabricator.wikimedia.org/P54089 and previous config saved to /var/cache/conftool/dbconfig/20231204-115455-arnaudb.json [11:56:50] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on dbproxy1027.eqiad.wmnet with reason: host reimage [12:00:55] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host druid1011.eqiad.wmnet [12:01:15] (03CR) 10Jelto: "It seems this change introduced some problems with Puppet runs on contint (role::ci) hosts. They fail with" [puppet] - 10https://gerrit.wikimedia.org/r/977687 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi) [12:01:40] (03PS1) 10Ladsgroup: Bump ParserCache TTL back to 30 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979920 (https://phabricator.wikimedia.org/T280604) [12:01:59] (03PS1) 10Muehlenhoff: Switch druid1011 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/979921 (https://phabricator.wikimedia.org/T349619) [12:04:14] jouncebot: nowandnext [12:04:14] No deployments scheduled for the next 1 hour(s) and 55 minute(s) [12:04:14] In 1 hour(s) and 55 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231204T1400) [12:04:31] (03CR) 10Urbanecm: [C: 03+2] User impact: sort datestring keys to ascending alphanumeric order [extensions/GrowthExperiments] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979690 (https://phabricator.wikimedia.org/T352349) (owner: 10Urbanecm) [12:05:00] (03CR) 10Muehlenhoff: [C: 03+2] Switch druid1011 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/979921 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:07:40] (03PS2) 10Klausman: profiles: Remove more ORES leftovers [puppet] - 10https://gerrit.wikimedia.org/r/979916 (https://phabricator.wikimedia.org/T347278) [12:08:07] (03CR) 10Klausman: "I want to wait for tavvi's answer regarding the generated file before submitting this." [puppet] - 10https://gerrit.wikimedia.org/r/979916 (https://phabricator.wikimedia.org/T347278) (owner: 10Klausman) [12:08:35] (03CR) 10Tacsipacsi: Bump ParserCache TTL back to 30 days (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979920 (https://phabricator.wikimedia.org/T280604) (owner: 10Ladsgroup) [12:09:58] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host druid1011.eqiad.wmnet [12:09:58] (03PS1) 10Marostegui: Revert "dbproxy1027: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/979691 [12:10:02] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P54090 and previous config saved to /var/cache/conftool/dbconfig/20231204-121002-arnaudb.json [12:11:16] (03CR) 10Clément Goubert: "The overall code and logic LGTM, but some of the changes should in my opinion be spun off into new module versions." [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [12:11:30] (03CR) 10Marostegui: [C: 03+2] Revert "dbproxy1027: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/979691 (owner: 10Marostegui) [12:12:46] (03CR) 10Btullis: [C: 03+2] Deploy kube-state-metrics to the dse-k8s cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/978504 (https://phabricator.wikimedia.org/T264625) (owner: 10Btullis) [12:13:02] (03CR) 10Kosta Harlan: [C: 03+1] Enable read new for event tables migration on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979914 (https://phabricator.wikimedia.org/T341829) (owner: 10Dreamy Jazz) [12:15:16] (03Merged) 10jenkins-bot: Deploy kube-state-metrics to the dse-k8s cluster [deployment-charts] - 10https://gerrit.wikimedia.org/r/978504 (https://phabricator.wikimedia.org/T264625) (owner: 10Btullis) [12:15:56] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host dbproxy1027.eqiad.wmnet with OS bookworm [12:18:17] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [12:19:44] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host an-druid1005.eqiad.wmnet [12:19:47] !log btullis@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [12:21:31] (03PS1) 10Muehlenhoff: Switch an-druid1005 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/979924 (https://phabricator.wikimedia.org/T349619) [12:22:40] (03CR) 10Muehlenhoff: [C: 03+2] Switch an-druid1005 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/979924 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [12:25:09] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P54091 and previous config saved to /var/cache/conftool/dbconfig/20231204-122508-arnaudb.json [12:25:22] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [extensions/GrowthExperiments] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979690 (https://phabricator.wikimedia.org/T352349) (owner: 10Urbanecm) [12:25:31] (03Merged) 10jenkins-bot: User impact: sort datestring keys to ascending alphanumeric order [extensions/GrowthExperiments] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979690 (https://phabricator.wikimedia.org/T352349) (owner: 10Urbanecm) [12:25:44] !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:979690|User impact: sort datestring keys to ascending alphanumeric order (T352349 T351898)]] [12:25:49] T352349: Impact Module: Views on articles you've edited graph - https://phabricator.wikimedia.org/T352349 [12:25:49] T351898: Reduce size of growthexperiments_user_impact.geui_data json blobs - https://phabricator.wikimedia.org/T351898 [12:25:52] (03CR) 10CI reject: [V: 04-1] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/979926 (owner: 10L10n-bot) [12:27:47] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host an-druid1005.eqiad.wmnet [12:28:19] !log urbanecm@deploy2002 urbanecm: Backport for [[gerrit:979690|User impact: sort datestring keys to ascending alphanumeric order (T352349 T351898)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [12:28:23] (03PS1) 10Elukey: admin_ng: deploy kube-state-metrics on all ml clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/979930 (https://phabricator.wikimedia.org/T264625) [12:29:11] !log urbanecm@deploy2002 urbanecm: Continuing with sync [12:29:47] (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/979465 [12:33:39] (03CR) 10Volans: [C: 04-1] "Missing the reverse PTR for both eqiad and codfw (as a commented line reserved for)" [dns] - 10https://gerrit.wikimedia.org/r/979891 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol) [12:35:21] (03CR) 10Clément Goubert: [C: 04-1] Add the k8s-ingress-dse LVS service to the service list (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/979911 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol) [12:35:28] !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:979690|User impact: sort datestring keys to ascending alphanumeric order (T352349 T351898)]] (duration: 09m 43s) [12:35:36] T352349: Impact Module: Views on articles you've edited graph - https://phabricator.wikimedia.org/T352349 [12:35:36] T351898: Reduce size of growthexperiments_user_impact.geui_data json blobs - https://phabricator.wikimedia.org/T351898 [12:40:15] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T348183)', diff saved to https://phabricator.wikimedia.org/P54092 and previous config saved to /var/cache/conftool/dbconfig/20231204-124015-arnaudb.json [12:40:17] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance [12:40:20] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [12:40:31] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance [12:40:38] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1197 (T348183)', diff saved to https://phabricator.wikimedia.org/P54093 and previous config saved to /var/cache/conftool/dbconfig/20231204-124037-arnaudb.json [12:43:17] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T348183)', diff saved to https://phabricator.wikimedia.org/P54094 and previous config saved to /var/cache/conftool/dbconfig/20231204-124316-arnaudb.json [12:44:15] (03CR) 10Muehlenhoff: [C: 03+2] firewall: Remove special case handling for flerovium [puppet] - 10https://gerrit.wikimedia.org/r/979333 (https://phabricator.wikimedia.org/T352193) (owner: 10Muehlenhoff) [12:47:38] (03PS3) 10Brouberol: Define a DNS A record for the dse k8s ingress gateway [dns] - 10https://gerrit.wikimedia.org/r/979891 (https://phabricator.wikimedia.org/T352639) [12:47:40] (03PS3) 10Brouberol: Enable ingress for the spark-history server services via the dse ingress gw [dns] - 10https://gerrit.wikimedia.org/r/979892 (https://phabricator.wikimedia.org/T352639) [12:48:54] RECOVERY - Check systemd state on build2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:49:24] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [12:49:28] (03PS7) 10MdsShakil: Create new namespaces and namespace aliases for bd.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977196 (https://phabricator.wikimedia.org/T351903) [12:49:53] (03CR) 10Muehlenhoff: [C: 03+2] Remove analytics_cluster::hadoop::client role [puppet] - 10https://gerrit.wikimedia.org/r/979338 (owner: 10Muehlenhoff) [12:51:35] (03CR) 10Muehlenhoff: [C: 03+2] archiva: Update outdated comments [puppet] - 10https://gerrit.wikimedia.org/r/978466 (owner: 10Muehlenhoff) [12:52:14] 10SRE-tools, 10Dumps-Generation, 10Infrastructure-Foundations, 10serviceops, and 2 others: Some Service Operations clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271142 (10akosiaris) @Volans All of these (which can be grouped in 2 just 2 categores, **mw** and **mc**, have be... [12:52:59] (03CR) 10Muehlenhoff: [C: 03+2] statistics::web: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/977088 (owner: 10Muehlenhoff) [12:53:35] (03CR) 10Clément Goubert: [C: 04-1] Add the k8s-ingress-dse LVS service to the service list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979911 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol) [12:54:15] (03CR) 10Brouberol: Add the k8s-ingress-dse LVS service to the service list (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979911 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol) [12:54:27] (03PS3) 10Brouberol: Add the k8s-ingress-dse LVS service to the service list [puppet] - 10https://gerrit.wikimedia.org/r/979911 (https://phabricator.wikimedia.org/T352639) [12:55:01] (03CR) 10CI reject: [V: 04-1] Add the k8s-ingress-dse LVS service to the service list [puppet] - 10https://gerrit.wikimedia.org/r/979911 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol) [12:55:44] (03PS4) 10Brouberol: Add the k8s-ingress-dse LVS service to the service list [puppet] - 10https://gerrit.wikimedia.org/r/979911 (https://phabricator.wikimedia.org/T352639) [12:56:20] 10SRE-tools, 10Dumps-Generation, 10Infrastructure-Foundations, 10serviceops, and 2 others: Some Service Operations clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271142 (10Volans) @akosiaris sure, and having a cluster deemed as *not* IPv6 ready is totally ok. The problem arise... [12:56:42] (03CR) 10Brouberol: Define a DNS A record for the dse k8s ingress gateway (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/979891 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol) [12:57:56] (03PS4) 10Brouberol: Enable ingress for the spark-history server services via the dse ingress gw [dns] - 10https://gerrit.wikimedia.org/r/979892 (https://phabricator.wikimedia.org/T352639) [12:58:23] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P54095 and previous config saved to /var/cache/conftool/dbconfig/20231204-125823-arnaudb.json [12:59:58] (03PS1) 10Hnowlan: jobqueue: reduce ThumbnailRender concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/979942 (https://phabricator.wikimedia.org/T337649) [13:00:29] (03CR) 10Muehlenhoff: [C: 03+2] analytics::postgresql: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/977181 (owner: 10Muehlenhoff) [13:03:12] (03CR) 10Alexandros Kosiaris: [C: 03+1] jobqueue: reduce ThumbnailRender concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/979942 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan) [13:03:25] (03CR) 10Hnowlan: [C: 03+2] jobqueue: reduce ThumbnailRender concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/979942 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan) [13:04:28] (03Merged) 10jenkins-bot: jobqueue: reduce ThumbnailRender concurrency [deployment-charts] - 10https://gerrit.wikimedia.org/r/979942 (https://phabricator.wikimedia.org/T337649) (owner: 10Hnowlan) [13:04:40] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [13:04:51] (03CR) 10Clément Goubert: [C: 03+1] Add the k8s-ingress-dse LVS service to the service list [puppet] - 10https://gerrit.wikimedia.org/r/979911 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol) [13:05:01] (03PS1) 10Filippo Giunchedi: hieradata: update kubernetes::clusters in CI [puppet] - 10https://gerrit.wikimedia.org/r/979943 (https://phabricator.wikimedia.org/T351179) [13:05:04] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [13:05:32] (03CR) 10CI reject: [V: 04-1] hieradata: update kubernetes::clusters in CI [puppet] - 10https://gerrit.wikimedia.org/r/979943 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi) [13:05:46] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [13:05:48] (03CR) 10Filippo Giunchedi: [C: 03+2] k8s: allow setting prometheus retention in cluster definition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/977687 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi) [13:06:08] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [13:06:21] hej, just bubbling T352628 and T352659 up — `Wikimedia\Rdbms\DBQueryError` but maybe an issue with the jobqueue? [13:06:22] T352628: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 [13:06:22] T352659: [13f3f15c-98c2-4126-8e87-6d6d81706e13] 2023-12-04 12:39:58: Fatal exception of type "Wikimedia\Rdbms\DBQueryError" - https://phabricator.wikimedia.org/T352659 [13:06:31] (03PS2) 10Filippo Giunchedi: hieradata: update kubernetes::clusters in CI [puppet] - 10https://gerrit.wikimedia.org/r/979943 (https://phabricator.wikimedia.org/T351179) [13:06:47] hnowlan: sorry for the ping, seeing that you're touching jobqueue at the moment? [13:07:55] (03CR) 10Filippo Giunchedi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/979943 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi) [13:08:30] ^ T352663 [13:08:30] T352663: JobQueueError: Could not enqueue jobs - https://phabricator.wikimedia.org/T352663 [13:09:45] (03PS2) 10Jcrespo: Implement batch deletion, restoration and query of files [software/mediabackups] - 10https://gerrit.wikimedia.org/r/979919 (https://phabricator.wikimedia.org/T352655) [13:10:58] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me. I" [puppet] - 10https://gerrit.wikimedia.org/r/979912 (owner: 10EoghanGaffney) [13:10:59] TheresNoTime: good shout, it's not related to that change but it is most likely related to something I've been doing recently [13:11:15] (03CR) 10Filippo Giunchedi: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/811/console" [puppet] - 10https://gerrit.wikimedia.org/r/977733 (https://phabricator.wikimedia.org/T351936) (owner: 10Filippo Giunchedi) [13:12:07] (03CR) 10Filippo Giunchedi: "PCC failed though only in 'prod' which is expected" [puppet] - 10https://gerrit.wikimedia.org/r/979943 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi) [13:13:30] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P54096 and previous config saved to /var/cache/conftool/dbconfig/20231204-131329-arnaudb.json [13:14:34] 10SRE, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T352653 (10Aklapper) Hi and welcome! Unrelated: Could you please also [connect your WMDE SUL account on mediawiki.org](https://phabricator.wikimedia.org/settings/panel/external/) to your Phab account?... [13:15:01] (03PS10) 10D3r1ck01: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) [13:15:42] (03CR) 10CI reject: [V: 04-1] ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01) [13:16:22] PROBLEM - Check systemd state on ml-serve1001 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:16:36] PROBLEM - Check systemd state on kubernetes1027 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:17:03] (03PS11) 10D3r1ck01: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) [13:17:17] (03PS1) 10Arnaudb: mariadb: add db2194 to multiinstance pool [puppet] - 10https://gerrit.wikimedia.org/r/979946 (https://phabricator.wikimedia.org/T343674) [13:17:49] (03CR) 10CI reject: [V: 04-1] ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01) [13:19:07] (03PS3) 10Jcrespo: Implement batch deletion, restoration and query of files [software/mediabackups] - 10https://gerrit.wikimedia.org/r/979919 (https://phabricator.wikimedia.org/T352655) [13:20:00] (03PS12) 10D3r1ck01: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) [13:20:42] TheresNoTime: still looking but it seems unlikely to be related to my work - we're migrating jobs to the k8s jobrunners, but that job hasn't been touched yet [13:20:55] (03CR) 10CI reject: [V: 04-1] ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01) [13:22:13] (03CR) 10David Caro: [C: 03+2] openstack,trove: increase api response alert to 3s [alerts] - 10https://gerrit.wikimedia.org/r/979899 (owner: 10David Caro) [13:22:28] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [13:22:43] (03PS2) 10Brouberol: Add an entry related to the dse k8s cluster ingress gateway to conftool [puppet] - 10https://gerrit.wikimedia.org/r/979910 (https://phabricator.wikimedia.org/T352639) [13:22:45] (03PS5) 10Brouberol: Add the k8s-ingress-dse LVS service to the service list [puppet] - 10https://gerrit.wikimedia.org/r/979911 (https://phabricator.wikimedia.org/T352639) [13:22:51] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [13:22:58] !log installing libde265 security updates [13:23:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:23:00] hnowlan: hm, ack — thanks for looking.. the `JobQueueError`s do seem to have died down a little (started at around 12:45 UTC and finished(?) at 13:16 UTC, does that match anything changing that you know of?) [13:23:17] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/979911 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol) [13:23:30] (03CR) 10Brouberol: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/979910 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol) [13:24:54] PROBLEM - Check whether ferm is active by checking the default input chain on kubernetes1027 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:25:14] (03CR) 10Filippo Giunchedi: [V: 03+1 C: 03+2] prometheus: re-introduce distro-specific node-exporter arguments [puppet] - 10https://gerrit.wikimedia.org/r/977733 (https://phabricator.wikimedia.org/T351936) (owner: 10Filippo Giunchedi) [13:25:44] RECOVERY - Check systemd state on kubernetes1027 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:25:59] TheresNoTime: no, the only jobs that would be changing were related to thumbor and my deploy started/finished within that window but not in any way that aligned :( [13:26:36] that error looks pretty clearly pointing to the queries being run which the jobrunner migration wouldn't affect at all [13:26:59] (03PS6) 10Brouberol: Add the k8s-ingress-dse LVS service to the service list [puppet] - 10https://gerrit.wikimedia.org/r/979911 (https://phabricator.wikimedia.org/T352639) [13:27:32] PROBLEM - Check whether ferm is active by checking the default input chain on ml-serve1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:27:53] (03PS13) 10D3r1ck01: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) [13:27:55] 10SRE, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T352653 (10ArthurTaylor) Done! [13:28:32] (03CR) 10D3r1ck01: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency (034 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01) [13:28:34] (03CR) 10CI reject: [V: 04-1] ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01) [13:28:37] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T348183)', diff saved to https://phabricator.wikimedia.org/P54097 and previous config saved to /var/cache/conftool/dbconfig/20231204-132836-arnaudb.json [13:28:38] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance [13:28:41] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [13:28:48] (03CR) 10Brouberol: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/812/console" [puppet] - 10https://gerrit.wikimedia.org/r/979911 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol) [13:28:53] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance [13:28:59] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db1222 (T348183)', diff saved to https://phabricator.wikimedia.org/P54098 and previous config saved to /var/cache/conftool/dbconfig/20231204-132859-arnaudb.json [13:30:07] (03CR) 10D3r1ck01: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01) [13:30:51] (03CR) 10JMeybohm: mcrouter: add chart (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [13:30:53] !log instaling dbus security updates on buster [13:30:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:33:23] (03CR) 10JMeybohm: "Do we plan to just run one mcrouter deployment per cluster? If not, mcrouter is a too generic name IMHO. Does it maybe make sense to run a" [deployment-charts] - 10https://gerrit.wikimedia.org/r/979363 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [13:33:29] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T348183)', diff saved to https://phabricator.wikimedia.org/P54099 and previous config saved to /var/cache/conftool/dbconfig/20231204-133328-arnaudb.json [13:34:16] (03CR) 10Marostegui: "Just a brief comment here" [puppet] - 10https://gerrit.wikimedia.org/r/979390 (https://phabricator.wikimedia.org/T207253) (owner: 10Ladsgroup) [13:35:39] (03PS14) 10D3r1ck01: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) [13:36:26] (03CR) 10CI reject: [V: 04-1] ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01) [13:36:56] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: exclude timer units from systemd collector [puppet] - 10https://gerrit.wikimedia.org/r/977734 (https://phabricator.wikimedia.org/T351936) (owner: 10Filippo Giunchedi) [13:37:03] (03PS3) 10Filippo Giunchedi: prometheus: exclude timer units from systemd collector [puppet] - 10https://gerrit.wikimedia.org/r/977734 (https://phabricator.wikimedia.org/T351936) [13:37:54] 10SRE, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T352653 (10WMDE-leszek) I support this request from WMDE's side. [13:38:06] (03CR) 10Marostegui: mariadb: add db2194 to multiinstance pool (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979946 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [13:39:13] (03PS2) 10Arnaudb: mariadb: add db2194 to multiinstance pool [puppet] - 10https://gerrit.wikimedia.org/r/979946 (https://phabricator.wikimedia.org/T343674) [13:39:31] (03CR) 10Arnaudb: "this has been fixed!" [puppet] - 10https://gerrit.wikimedia.org/r/979946 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [13:39:33] (03CR) 10Marostegui: [C: 03+1] mariadb: add db2194 to multiinstance pool [puppet] - 10https://gerrit.wikimedia.org/r/979946 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [13:39:35] (03CR) 10Brouberol: [C: 03+2] Add an entry related to the dse k8s cluster ingress gateway to conftool [puppet] - 10https://gerrit.wikimedia.org/r/979910 (https://phabricator.wikimedia.org/T352639) (owner: 10Brouberol) [13:39:53] (03CR) 10Arnaudb: [C: 03+2] mariadb: add db2194 to multiinstance pool [puppet] - 10https://gerrit.wikimedia.org/r/979946 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [13:40:34] (03CR) 10Jcrespo: [C: 03+1] mariadb: add db2194 to multiinstance pool [puppet] - 10https://gerrit.wikimedia.org/r/979946 (https://phabricator.wikimedia.org/T343674) (owner: 10Arnaudb) [13:42:19] (03PS15) 10D3r1ck01: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) [13:42:59] (03CR) 10CI reject: [V: 04-1] ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01) [13:43:00] !log btullis@cumin1001 START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid jvm daemons. [13:43:45] (03PS16) 10D3r1ck01: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) [13:44:25] (03CR) 10CI reject: [V: 04-1] ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01) [13:45:36] (03CR) 10Atieno: [C: 03+1] Bump ParserCache TTL back to 30 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979920 (https://phabricator.wikimedia.org/T280604) (owner: 10Ladsgroup) [13:45:39] (03PS2) 10Ladsgroup: Bump ParserCache TTL back to 30 days [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979920 (https://phabricator.wikimedia.org/T280604) [13:46:21] (03CR) 10Elukey: [C: 03+2] admin_ng: deploy kube-state-metrics on all ml clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/979930 (https://phabricator.wikimedia.org/T264625) (owner: 10Elukey) [13:46:41] 10SRE, 10HyperSwitch, 10Wikimedia-production-error: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 (10TheresNoTime) [13:46:46] (03CR) 10Ladsgroup: Bump ParserCache TTL back to 30 days (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979920 (https://phabricator.wikimedia.org/T280604) (owner: 10Ladsgroup) [13:46:53] jouncebot: nowandnext [13:46:54] No deployments scheduled for the next 0 hour(s) and 13 minute(s) [13:46:54] In 0 hour(s) and 13 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231204T1400) [13:48:35] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P54100 and previous config saved to /var/cache/conftool/dbconfig/20231204-134835-arnaudb.json [13:52:02] !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'. [13:52:02] (03CR) 10Btullis: [V: 03+1] Bring an-coord1003 into service as a hadoop coordinator (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979086 (https://phabricator.wikimedia.org/T336045) (owner: 10Btullis) [13:52:09] (03CR) 10Btullis: [V: 03+1 C: 03+2] Bring an-coord1003 into service as a hadoop coordinator [puppet] - 10https://gerrit.wikimedia.org/r/979086 (https://phabricator.wikimedia.org/T336045) (owner: 10Btullis) [13:52:25] (03PS17) 10D3r1ck01: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) [13:52:28] !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'. [13:53:17] (03CR) 10CI reject: [V: 04-1] ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01) [13:55:22] RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1027 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [13:55:24] (03PS18) 10D3r1ck01: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) [13:56:04] (03CR) 10CI reject: [V: 04-1] ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01) [13:56:32] !log installing postgresql-13 security updates [13:56:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:43] (03PS1) 10Kosta Harlan: MediaModeration: Set MediaModerationDeveloperMode to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979969 [13:57:24] (03CR) 10CI reject: [V: 04-1] MediaModeration: Set MediaModerationDeveloperMode to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979969 (owner: 10Kosta Harlan) [13:57:34] !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'. [13:57:49] (03CR) 10Lucas Werkmeister (WMDE): Enable read new for event tables migration on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979914 (https://phabricator.wikimedia.org/T341829) (owner: 10Dreamy Jazz) [13:58:19] !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'. [13:59:21] !log elukey@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'. [13:59:57] !log elukey@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'. [14:00:05] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC afternoon backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231204T1400). [14:00:05] James_F, Dreamy_Jazz, and MdsShakil: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:17] o/ [14:00:18] \o [14:00:24] \o/ [14:00:30] Now we've got a complete set. [14:00:32] Hello [14:00:55] Lucas_WMDE: FYI T352628, don't think it's a deploy stopper [14:00:55] T352628: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 [14:01:33] *nods* [14:01:46] James_F: I assume you’ll self-service? [14:01:54] * Lucas_WMDE has no idea what to do about that transaction size error unfortunately [14:02:13] (03PS19) 10D3r1ck01: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) [14:02:52] (03CR) 10CI reject: [V: 04-1] ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01) [14:03:10] (03PS2) 10Dreamy Jazz: Enable read new for event tables migration on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979914 (https://phabricator.wikimedia.org/T341829) [14:03:42] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P54101 and previous config saved to /var/cache/conftool/dbconfig/20231204-140341-arnaudb.json [14:03:48] (03CR) 10Dreamy Jazz: Enable read new for event tables migration on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979914 (https://phabricator.wikimedia.org/T341829) (owner: 10Dreamy Jazz) [14:04:02] (03PS1) 10Peter Fischer: enable page_rerender for commonswiki, frwiki, itwiki, and wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979970 [14:04:04] (03PS6) 10Hnowlan: rest-gateway: add device-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/970823 [14:04:25] Lucas_WMDE: Oh, sure. [14:04:44] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979362 (https://phabricator.wikimedia.org/T352532) (owner: 10Jforrester) [14:04:46] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979180 (https://phabricator.wikimedia.org/T352495) (owner: 10Terasail) [14:05:32] (03Merged) 10jenkins-bot: wikifunctionswiki: Disable thumbnail in Vector search [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979362 (https://phabricator.wikimedia.org/T352532) (owner: 10Jforrester) [14:05:36] (03CR) 10CI reject: [V: 04-1] wikifunctionswiki: Add ability for sysops to manage Functioneer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979180 (https://phabricator.wikimedia.org/T352495) (owner: 10Terasail) [14:06:03] (03CR) 10Dreamy Jazz: Enable read new for event tables migration on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979914 (https://phabricator.wikimedia.org/T341829) (owner: 10Dreamy Jazz) [14:06:51] Meh. [14:06:55] (03PS1) 10Arnaudb: homedir: add tmux.conf [puppet] - 10https://gerrit.wikimedia.org/r/979947 (https://phabricator.wikimedia.org/T348183) [14:07:05] (03PS5) 10Jforrester: wikifunctionswiki: Add ability for sysops to manage Functioneer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979180 (https://phabricator.wikimedia.org/T352495) (owner: 10Terasail) [14:07:09] (03CR) 10TrainBranchBot: "Approved by jforrester@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979180 (https://phabricator.wikimedia.org/T352495) (owner: 10Terasail) [14:07:27] Dear CI, please don't flake when I'm deploying, kthxbai. [14:07:51] (03Merged) 10jenkins-bot: wikifunctionswiki: Add ability for sysops to manage Functioneer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979180 (https://phabricator.wikimedia.org/T352495) (owner: 10Terasail) [14:08:07] To be able to test my config change I would need to be given the checkuser group on testwiki. [14:08:08] !log jforrester@deploy2002 Started scap: Backport for [[gerrit:979362|wikifunctionswiki: Disable thumbnail in Vector search (T352532)]], [[gerrit:979180|wikifunctionswiki: Add ability for sysops to manage Functioneer (T352495)]] [14:08:13] T352532: Disable Vector 2022 search thumbnails on Wikifunctions - https://phabricator.wikimedia.org/T352532 [14:08:14] T352495: Add ability for administrators to add and remove functioneer - https://phabricator.wikimedia.org/T352495 [14:08:53] hm, not sure if it’s okay to hand out that group tbh :/ [14:08:56] even temporarily and on testwiki [14:09:00] it’s still real IP addresses… [14:09:07] It's been done before. [14:09:09] but I don’t know the usual process to gain that right [14:09:09] ok [14:09:25] !log jforrester@deploy2002 jforrester and terasail: Backport for [[gerrit:979362|wikifunctionswiki: Disable thumbnail in Vector search (T352532)]], [[gerrit:979180|wikifunctionswiki: Add ability for sysops to manage Functioneer (T352495)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:09:32] do you have a link to some chat archive or log entry where it happened? [14:09:38] Lucas_WMDE: Because CU isn't available on beta cluster it gets more use on testwiki than it should. [14:10:03] !log jforrester@deploy2002 jforrester and terasail: Continuing with sync [14:10:43] (03PS20) 10D3r1ck01: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) [14:10:50] T337126 confirms NDA, at least [14:10:50] T337126: Log stash access for Dreamy Jazz - https://phabricator.wikimedia.org/T337126 [14:10:52] 10SRE, 10HyperSwitch, 10Wikimedia-production-error: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 (10Ladsgroup) I don't know how restbase or hyperswitch ended up in critical path of saving edits, that is a rather important issue we need to check.... [14:11:06] See https://test.wikipedia.org/wiki/Special:UserRights/Dreamy_Jazz [14:11:23] (03CR) 10CI reject: [V: 04-1] ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01) [14:11:35] ack [14:12:04] 10SRE, 10HyperSwitch, 10RESTBase, 10Wikimedia-production-error: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 (10TheresNoTime) [14:12:28] would probably be good to have a steward around to give you the right [14:12:32] IIRC createAndPromote.php isn’t logged as well [14:12:33] (03CR) 10Marostegui: [C: 03+1] homedir: add tmux.conf [puppet] - 10https://gerrit.wikimedia.org/r/979947 (https://phabricator.wikimedia.org/T348183) (owner: 10Arnaudb) [14:12:35] (03PS21) 10D3r1ck01: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) [14:12:39] (though it would be an option) [14:12:58] (03CR) 10Arnaudb: [C: 03+2] homedir: add tmux.conf [puppet] - 10https://gerrit.wikimedia.org/r/979947 (https://phabricator.wikimedia.org/T348183) (owner: 10Arnaudb) [14:13:03] Perhaps Urbanecm could? [14:13:30] (listed as being on this window) [14:13:41] (03CR) 10CI reject: [V: 04-1] ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01) [14:14:04] Otherwise I'm happy to delay and coordinate with them to make the change. [14:14:14] in a later window. [14:15:45] OK, PHP-restarts are finally finishing, over to Lucas_WMDE, sorry for the slowness of scap. [14:15:50] !log jforrester@deploy2002 Finished scap: Backport for [[gerrit:979362|wikifunctionswiki: Disable thumbnail in Vector search (T352532)]], [[gerrit:979180|wikifunctionswiki: Add ability for sysops to manage Functioneer (T352495)]] (duration: 07m 41s) [14:15:57] T352532: Disable Vector 2022 search thumbnails on Wikifunctions - https://phabricator.wikimedia.org/T352532 [14:15:58] T352495: Add ability for administrators to add and remove functioneer - https://phabricator.wikimedia.org/T352495 [14:15:58] alright [14:16:31] (03PS22) 10D3r1ck01: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) [14:16:40] (03CR) 10Hnowlan: [C: 03+2] rest-gateway: add device-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/970823 (owner: 10Hnowlan) [14:16:53] * Lucas_WMDE digs up yubikey [14:17:12] (03CR) 10CI reject: [V: 04-1] ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) (owner: 10D3r1ck01) [14:17:20] Dreamy_Jazz: hey, i saw your slack ping [14:17:26] Thanks. [14:17:30] (03Merged) 10jenkins-bot: rest-gateway: add device-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/970823 (owner: 10Hnowlan) [14:17:36] The change is https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/979914/ [14:17:44] Dreamy_Jazz: you just need the testwiki cu flag, right? or am i supposed to deploy sth as well? [14:17:47] (03CR) 10Lucas Werkmeister (WMDE): Enable read new for event tables migration on testwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979914 (https://phabricator.wikimedia.org/T341829) (owner: 10Dreamy Jazz) [14:17:59] urbanecm: I can deploy, unless you want to :) [14:18:13] but I can’t give out the right [14:18:16] i'd prefer someone else to deploy if possible [14:18:30] happy to do it then [14:18:36] 10SRE, 10HyperSwitch, 10RESTBase, 10Wikimedia-production-error: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 (10Joe) @Ladsgroup I think the log linked by @TheresNoTime is a typical example of a distributed transaction going wrong: * We start... [14:18:48] Dreamy_Jazz: volunteer / staff acc? [14:18:49] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T348183)', diff saved to https://phabricator.wikimedia.org/P54102 and previous config saved to /var/cache/conftool/dbconfig/20231204-141848-arnaudb.json [14:18:50] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance [14:18:53] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [14:18:58] (03CR) 10AOkoth: [C: 03+1] vrts: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/979364 (owner: 10Muehlenhoff) [14:19:03] or either is fine? [14:19:04] Volunteer probably best just as I'll have recent actions for that account. [14:19:05] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance [14:19:08] Either is fine though. [14:19:31] (03PS3) 10Lucas Werkmeister (WMDE): Enable read new for event tables migration on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979914 (https://phabricator.wikimedia.org/T341829) (owner: 10Dreamy Jazz) [14:19:38] (03CR) 10Tacsipacsi: Bump ParserCache TTL back to 30 days (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979920 (https://phabricator.wikimedia.org/T280604) (owner: 10Ladsgroup) [14:19:42] 10SRE-tools, 10Dumps-Generation, 10Infrastructure-Foundations, 10serviceops, and 2 others: Some Service Operations clusters apparently do not support IPv6 - https://phabricator.wikimedia.org/T271142 (10akosiaris) >>! In T271142#9378778, @Volans wrote: > @akosiaris sure, and having a cluster deemed as *not*... [14:19:46] Dreamy_Jazz: granted for an hour [14:19:49] Thanks! [14:19:57] * Lucas_WMDE looks at diffConfig [14:20:43] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979914 (https://phabricator.wikimedia.org/T341829) (owner: 10Dreamy Jazz) [14:21:09] (03PS1) 10Btullis: Prevent removal of python2 on hadoop coordinators [puppet] - 10https://gerrit.wikimedia.org/r/979973 (https://phabricator.wikimedia.org/T336045) [14:21:23] (03CR) 10Filippo Giunchedi: [C: 03+2] centralserver: reintroduce tls-remedy for centralserver [puppet] - 10https://gerrit.wikimedia.org/r/979108 (https://phabricator.wikimedia.org/T351710) (owner: 10Filippo Giunchedi) [14:21:25] (03Merged) 10jenkins-bot: Enable read new for event tables migration on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979914 (https://phabricator.wikimedia.org/T341829) (owner: 10Dreamy Jazz) [14:21:35] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [14:21:41] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:979914|Enable read new for event tables migration on testwiki (T341829)]] [14:21:47] T341829: Enable read new for the event table migration - https://phabricator.wikimedia.org/T341829 [14:21:49] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance [14:21:49] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10MatthewVernon) [14:22:22] (03PS2) 10Btullis: Prevent removal of python2 on hadoop coordinators [puppet] - 10https://gerrit.wikimedia.org/r/979973 (https://phabricator.wikimedia.org/T336045) [14:22:58] !log lucaswerkmeister-wmde@deploy2002 dreamyjazz and lucaswerkmeister-wmde: Backport for [[gerrit:979914|Enable read new for event tables migration on testwiki (T341829)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:23:09] 10SRE, 10Wikimedia-production-error: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 (10Ladsgroup) Yup, after looking at logs properly, it's clear. [14:23:18] (03CR) 10Nikerabbit: [V: 03+2] Localisation updates from https://translatewiki.net. [phabricator/translations] (wmf/stable) - 10https://gerrit.wikimedia.org/r/979926 (owner: 10L10n-bot) [14:23:20] Testing now. [14:23:44] ok [14:23:56] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/813/console" [puppet] - 10https://gerrit.wikimedia.org/r/979973 (https://phabricator.wikimedia.org/T336045) (owner: 10Btullis) [14:24:15] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance [14:24:29] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2097.codfw.wmnet with reason: Maintenance [14:24:30] (03CR) 10Btullis: [V: 03+1 C: 03+2] Prevent removal of python2 on hadoop coordinators [puppet] - 10https://gerrit.wikimedia.org/r/979973 (https://phabricator.wikimedia.org/T336045) (owner: 10Btullis) [14:24:50] Test complete and successful. [14:25:13] Ran a few checks on my own account. [14:25:24] alright [14:25:25] thanks! [14:25:26] !log lucaswerkmeister-wmde@deploy2002 dreamyjazz and lucaswerkmeister-wmde: Continuing with sync [14:26:25] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Create new namespaces and namespace aliases for bd.wikimedia.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977196 (https://phabricator.wikimedia.org/T351903) (owner: 10MdsShakil) [14:26:40] MdsShakil: ^ left a suggestion on your change [14:26:51] but otherwise it should be okay to deploy once this backport is done [14:26:59] *config change [14:27:33] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2104.codfw.wmnet with reason: Maintenance [14:27:42] I think it's not necessary, since already mentioned on current task [14:27:47] Lucas_WMDE [14:27:48] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2104.codfw.wmnet with reason: Maintenance [14:27:55] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2104 (T348183)', diff saved to https://phabricator.wikimedia.org/P54103 and previous config saved to /var/cache/conftool/dbconfig/20231204-142754-arnaudb.json [14:27:58] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [14:27:59] sure, not necessary [14:28:02] but still nice imho :) [14:28:24] if I want to know when the Photowalk namespace was established, it would be nice to have the older task ID there directly [14:28:51] but if you don’t want to add it I can live with that ^^ [14:29:25] Lucas_WMDE you can do it :) [14:29:34] hm, ok ^^ [14:29:37] * Lucas_WMDE downloads the change [14:29:52] (03CR) 10Ladsgroup: Bump ParserCache TTL back to 30 days (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979920 (https://phabricator.wikimedia.org/T280604) (owner: 10Ladsgroup) [14:30:30] (03PS8) 10Lucas Werkmeister (WMDE): Create new namespaces and namespace aliases for bd.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977196 (https://phabricator.wikimedia.org/T351903) (owner: 10MdsShakil) [14:30:39] (03CR) 10Lucas Werkmeister (WMDE): Create new namespaces and namespace aliases for bd.wikimedia.org (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977196 (https://phabricator.wikimedia.org/T351903) (owner: 10MdsShakil) [14:31:26] 10SRE, 10Wikimedia-production-error: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 (10Ladsgroup) I have to go to a meeting, if someone is willing to reproduce the issue in mwdebug while verbose log (there is an option for it in x-debug) is enabled... [14:32:03] (03CR) 10Klausman: [V: 03+2 C: 03+2] hiera: clean up more ORES leftovers [labs/private] - 10https://gerrit.wikimedia.org/r/979915 (https://phabricator.wikimedia.org/T347278) (owner: 10Klausman) [14:32:08] !log btullis@cumin1001 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid analytics cluster: Roll restart of Druid jvm daemons. [14:32:23] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:979914|Enable read new for event tables migration on testwiki (T341829)]] (duration: 10m 42s) [14:32:27] T341829: Enable read new for the event table migration - https://phabricator.wikimedia.org/T341829 [14:33:07] (03CR) 10Ssingh: [C: 03+2] wikimedia.org: add 1Password site verification [dns] - 10https://gerrit.wikimedia.org/r/979421 (https://phabricator.wikimedia.org/T352579) (owner: 10Ssingh) [14:33:28] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by lucaswerkmeister-wmde@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977196 (https://phabricator.wikimedia.org/T351903) (owner: 10MdsShakil) [14:33:37] !log running authdns-update for T352579 [14:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:42] T352579: Update DNS records for 1Password - https://phabricator.wikimedia.org/T352579 [14:34:11] (03Merged) 10jenkins-bot: Create new namespaces and namespace aliases for bd.wikimedia.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977196 (https://phabricator.wikimedia.org/T351903) (owner: 10MdsShakil) [14:34:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:34:25] !log lucaswerkmeister-wmde@deploy2002 Started scap: Backport for [[gerrit:977196|Create new namespaces and namespace aliases for bd.wikimedia.org (T351903)]] [14:34:29] T351903: Create new namespaces and namespace aliases for bd.wikimedia.org - https://phabricator.wikimedia.org/T351903 [14:36:10] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and mdsshakil: Backport for [[gerrit:977196|Create new namespaces and namespace aliases for bd.wikimedia.org (T351903)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [14:36:31] MdsShakil: the change should be live on one of the mwdebug servers, can you test it there? [14:36:50] Lucas_WMDE yah, testing [14:36:52] 10SRE, 10Wikimedia-production-error: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 (10Wbm1058) I've gotten this error twice, when trying to make the same simple edit to a page A database query error has occurred. This may indicate a bug in the sof... [14:37:06] (03CR) 10Jelto: [C: 03+1] "lgtm, at least that should be what puppet agent is missing on contint hosts" [puppet] - 10https://gerrit.wikimedia.org/r/979943 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi) [14:37:35] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cp4038.ulsfo.wmnet [14:38:39] (03PS1) 10Muehlenhoff: Switch cp4038 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/979975 (https://phabricator.wikimedia.org/T349619) [14:39:04] (JobUnavailable) firing: (10) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:39:16] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:39:18] (03CR) 10Muehlenhoff: [C: 03+2] Switch cp4038 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/979975 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:40:07] Lucas_WMDE looks good to me [14:40:14] cool, thanks! [14:40:16] !log lucaswerkmeister-wmde@deploy2002 lucaswerkmeister-wmde and mdsshakil: Continuing with sync [14:41:47] (03PS1) 10Ssingh: wikimedia.org: remove already verified jamf TXT record [dns] - 10https://gerrit.wikimedia.org/r/979976 (https://phabricator.wikimedia.org/T349665) [14:43:07] (03CR) 10Ssingh: [C: 03+2] wikimedia.org: remove already verified jamf TXT record [dns] - 10https://gerrit.wikimedia.org/r/979976 (https://phabricator.wikimedia.org/T349665) (owner: 10Ssingh) [14:43:30] !log running authdns-update for CR 979976 [revert of T349665] [14:43:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:34] T349665: Update DNS for Jamf account SSO - https://phabricator.wikimedia.org/T349665 [14:43:37] (03CR) 10Klausman: [C: 03+2] profiles: Remove more ORES leftovers [puppet] - 10https://gerrit.wikimedia.org/r/979916 (https://phabricator.wikimedia.org/T347278) (owner: 10Klausman) [14:44:06] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cp4038.ulsfo.wmnet [14:46:14] !log lucaswerkmeister-wmde@deploy2002 Finished scap: Backport for [[gerrit:977196|Create new namespaces and namespace aliases for bd.wikimedia.org (T351903)]] (duration: 11m 48s) [14:46:18] T351903: Create new namespaces and namespace aliases for bd.wikimedia.org - https://phabricator.wikimedia.org/T351903 [14:46:43] !log UTC afternoon backport+config window done [14:46:44] !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host cp4046.ulsfo.wmnet [14:46:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:10] Lucas_WMDE namespaceDupes? [14:47:32] ah [14:47:33] good point [14:47:42] (03PS2) 10Hnowlan: jobqueue: switch a medium weight job [deployment-charts] - 10https://gerrit.wikimedia.org/r/979395 (https://phabricator.wikimedia.org/T349796) [14:47:44] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1076.eqiad.wmnet with OS bullseye [14:47:50] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host ms-be1076.eqiad.wmnet with OS bullseye [14:47:59] (03PS1) 10Muehlenhoff: Switch cp4046 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/979978 (https://phabricator.wikimedia.org/T349619) [14:48:05] ah. “Unsafe to run at this time. See: T350443” [14:48:05] T350443: namespaceDupes.php doesn't have limit on write queries - https://phabricator.wikimedia.org/T350443 [14:48:56] (03CR) 10Muehlenhoff: [C: 03+2] Switch cp4046 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/979978 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff) [14:48:58] Task seems resolved [14:49:20] yeah, which is unfortunate [14:49:32] given that the revert reenabling the script won’t be deployed for another week [14:49:34] (no train this week) [14:50:23] !log jclark@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host ms-be1077 [14:50:24] !log jclark@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host ms-be1077 [14:50:32] would be nice if I could at least dry-run the script [14:50:37] but it was disabled too forcefully for that [14:51:03] !log upload tcp-mss-clamper 0.4 to apt.wm.o (bookworm) [14:51:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:06] PROBLEM - Check systemd state on build2001 is CRITICAL: CRITICAL - degraded: The following units failed: docker-reporter-k8s-images.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:52:08] Lucas_WMDE so we need to wait until it's fully resolved [14:52:17] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: update kubernetes::clusters in CI [puppet] - 10https://gerrit.wikimedia.org/r/979943 (https://phabricator.wikimedia.org/T351179) (owner: 10Filippo Giunchedi) [14:53:03] I’m trying to see if there’s any way to run the SELECT queries without the script, at least [14:53:19] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host cp4046.ulsfo.wmnet [14:54:04] (JobUnavailable) firing: (10) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:54:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:54:24] (03PS4) 10Brouberol: Define a DNS A record for the dse k8s ingress gateway [dns] - 10https://gerrit.wikimedia.org/r/979891 (https://phabricator.wikimedia.org/T352639) [14:54:55] hmph, 62 rows [14:55:37] jelto: we're back re: contint, puppet runs [14:59:16] (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [14:59:29] MdsShakil: I dumped the titles on the task, not much more that can be done at the moment I think [14:59:38] unless you want to revert the config change [15:00:19] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:00:45] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [15:01:14] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1076'] [15:02:01] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1077'] [15:02:15] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1078'] [15:02:25] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be1077'] [15:02:29] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1077'] [15:03:29] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1079'] [15:04:02] Lucas_WMDE I think we can keep the patch and fixed later dupes issue [15:04:11] alright [15:06:29] 10SRE, 10Wikimedia-production-error: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 (10TheresNoTime) Got a verbose log for `[e7bc3819-b052-43a3-a9e2-438ae9d4b38f] 2023-12-04 15:01:09: Fatal exception of type "Wikimedia\Rdbms\DBQueryError"`, on artic... [15:08:09] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ms-be1078'] [15:08:16] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ms-be1077'] [15:09:24] godog: yes puppet is happy again, thanks! [15:09:35] sure np [15:11:23] (03CR) 10David Caro: [C: 03+2] quarry: use github remote [puppet] - 10https://gerrit.wikimedia.org/r/965514 (https://phabricator.wikimedia.org/T348748) (owner: 10Vivian Rook) [15:12:30] (03PS2) 10Jelto: add wmf-debci image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/979355 (https://phabricator.wikimedia.org/T352003) [15:12:53] (03CR) 10Herron: [C: 03+2] thanos-query: enable auto-downsampling [puppet] - 10https://gerrit.wikimedia.org/r/979163 (owner: 10Herron) [15:12:53] 10SRE, 10Wikimedia-production-error: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 (10A455bcd9) I got 3 different error messages multiple times today while editing: - "Server returned error: HTTP 500." - "[XXXX-XXX-XXX-XXX-XXX] Caught excepti... [15:13:16] (03CR) 10Clément Goubert: [C: 03+1] jobqueue: switch a medium weight job [deployment-charts] - 10https://gerrit.wikimedia.org/r/979395 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [15:13:20] (03PS5) 10Brouberol: Enable ingress for the spark-history server services via the dse ingress gw [dns] - 10https://gerrit.wikimedia.org/r/979892 (https://phabricator.wikimedia.org/T352639) [15:16:52] (03CR) 10Vgutierrez: [C: 03+2] lvs::realserver::ipip: Check that TCP MSS clamping is working [puppet] - 10https://gerrit.wikimedia.org/r/977696 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [15:18:27] (03CR) 10Jelto: [V: 03+2 C: 03+2] add wmf-debci image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/979355 (https://phabricator.wikimedia.org/T352003) (owner: 10Jelto) [15:20:28] (03PS1) 10Bking: wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) [15:20:58] (03CR) 10CI reject: [V: 04-1] wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [15:20:58] 10SRE, 10Wikimedia-production-error: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 (10TheresNoTime) [15:21:08] 10SRE, 10Wikimedia-production-error: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 (10TheresNoTime) [15:21:17] (03CR) 10Hnowlan: [C: 03+2] jobqueue: switch a medium weight job [deployment-charts] - 10https://gerrit.wikimedia.org/r/979395 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [15:22:06] (03Merged) 10jenkins-bot: jobqueue: switch a medium weight job [deployment-charts] - 10https://gerrit.wikimedia.org/r/979395 (https://phabricator.wikimedia.org/T349796) (owner: 10Hnowlan) [15:22:47] (03PS2) 10Bking: wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) [15:26:02] 10SRE, 10Wikimedia-production-error: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 (10Yann) It happened again `[80340636-5581-4e19-a4ce-a0a6b2a7215e] 2023-12-04 15:23:08: Fatal exception of type "Wikimedia\Rdbms\DBQueryError"` [15:28:27] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T348183)', diff saved to https://phabricator.wikimedia.org/P54104 and previous config saved to /var/cache/conftool/dbconfig/20231204-152826-arnaudb.json [15:28:31] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [15:29:13] (03PS3) 10Bking: wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) [15:29:22] (03PS4) 10Jcrespo: Implement batch deletion, restoration and query of files [software/mediabackups] - 10https://gerrit.wikimedia.org/r/979919 (https://phabricator.wikimedia.org/T352655) [15:29:25] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [15:30:37] 10SRE, 10Wikimedia-production-error: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 (10Ladsgroup) At least for TVB, I can't reproduce it anymore: https://en.wikipedia.org/w/index.php?title=TVB_(disambiguation)&action=history Can someone give me a re... [15:32:03] 10SRE, 10Wikimedia-production-error: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 (10Ladsgroup) nvm got it. [15:32:25] (03PS5) 10Jcrespo: Implement batch deletion, restoration and query of files [software/mediabackups] - 10https://gerrit.wikimedia.org/r/979919 (https://phabricator.wikimedia.org/T352655) [15:34:25] 10SRE, 10Cloud-VPS, 10observability, 10Patch-For-Review, and 2 others: ossl rsyslog errors post-migration - https://phabricator.wikimedia.org/T351710 (10fgiunchedi) Current situation: * We have a separate `rsyslog-receiver` unit/instance with only the receiver bits on centrallog hosts * The fleet is runni... [15:35:18] (03PS1) 10Vgutierrez: hiera: Disable rp_filter on ncredir@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/979984 (https://phabricator.wikimedia.org/T351069) [15:35:20] (03PS1) 10Vgutierrez: hiera: Enable IPIP encapsulation on ncredir@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/979985 (https://phabricator.wikimedia.org/T351069) [15:36:56] (03PS1) 10Dreamy Jazz: Enable read new on group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979986 (https://phabricator.wikimedia.org/T341829) [15:38:25] (03PS23) 10D3r1ck01: ClusterConfig: Rename `isTest()` to `isDebug()` for consistency [mediawiki-config] - 10https://gerrit.wikimedia.org/r/976252 (https://phabricator.wikimedia.org/T347366) [15:38:51] (03PS6) 10Jcrespo: Implement batch deletion, restoration and query of files [software/mediabackups] - 10https://gerrit.wikimedia.org/r/979919 (https://phabricator.wikimedia.org/T352655) [15:39:14] 10SRE, 10Wikimedia-production-error: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 (10Ladsgroup) ` Expectation (writeQueryTime <= 1) by MediaWiki::main not met (actual: 7.6661319732666) in trx #1701ce9c66: role-primary: SELECT page_latest FROM `pa... [15:40:18] 10SRE, 10Wikimedia-production-error: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 (10NightWolf1223) This is also happening on https://en.wikipedia.org/wiki/CDDA with the following error: ` [0458d586-c21c-4c1b-bc95-35edbaabe49d] 2023-12-04 15:33:32... [15:41:24] (03PS4) 10Bking: wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) [15:43:33] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P54105 and previous config saved to /var/cache/conftool/dbconfig/20231204-154333-arnaudb.json [15:45:15] 10SRE, 10Wikimedia-production-error: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 (10Ladsgroup) Hi, We get a log error for each one of these, we see them and I'm investigating. No need to paste them here anymore. Thanks! [15:45:47] !log hnowlan@deploy2002 helmfile [staging] START helmfile.d/services/changeprop-jobqueue: apply [15:46:05] !log hnowlan@deploy2002 helmfile [staging] DONE helmfile.d/services/changeprop-jobqueue: apply [15:46:33] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [15:47:14] (03PS5) 10Bking: wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) [15:47:18] !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/changeprop-jobqueue: apply [15:47:35] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [15:47:44] !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/changeprop-jobqueue: apply [15:48:15] !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/services/changeprop-jobqueue: apply [15:48:40] !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/services/changeprop-jobqueue: apply [15:48:52] 10SRE, 10Wikimedia-production-error: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 (10matmarex) The errors increased sharply around 6:30 UTC today: (searching for `exception.class` `Wikimedia\Rdbms\DBTransactionSizeError`) https://logstash.wikimed... [15:49:09] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/814/con" [puppet] - 10https://gerrit.wikimedia.org/r/979984 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [15:49:27] (03CR) 10Vgutierrez: hiera: Disable rp_filter on ncredir@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/979984 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [15:50:46] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/815/con" [puppet] - 10https://gerrit.wikimedia.org/r/979985 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [15:50:52] (03CR) 10Vgutierrez: [C: 03+2] prometheus::sysctl: Support configurable sysctls [puppet] - 10https://gerrit.wikimedia.org/r/979297 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [15:51:36] 10SRE, 10Wikimedia-production-error: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 (10Ladsgroup) The underlying issue is that locking any row in page table is extremely slow now, this one took 7 seconds: https://logstash.wikimedia.org/app/discover#... [15:52:54] (03PS6) 10Bking: wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) [15:52:59] (PuppetFailure) firing: Puppet has failed on elastic1107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [15:53:40] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [15:54:59] (03PS7) 10Bking: wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) [15:55:28] (03CR) 10CI reject: [V: 04-1] wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [15:55:31] (03PS1) 10Awight: [beta] Enable FileImporter Codex mode on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979988 [15:56:14] (03PS8) 10Bking: wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) [15:56:56] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host ms-be1076.eqiad.wmnet with OS bullseye [15:57:34] 10SRE, 10Wikimedia-production-error: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 (10Ladsgroup) innodb_lock_row_wait on master of s1 has skyrocketed but unlike spacex rockets is not going down: https://grafana.wikimedia.org/d/000000273/mysql?orgId... [15:57:35] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmf for mcastro-wmf - https://phabricator.wikimedia.org/T352684 (10Mcastro) [15:58:40] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104', diff saved to https://phabricator.wikimedia.org/P54107 and previous config saved to /var/cache/conftool/dbconfig/20231204-155840-arnaudb.json [15:58:46] (03CR) 10Ssingh: [C: 03+1] hiera: Enable IPIP encapsulation on ncredir@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/979985 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [16:02:06] 10SRE, 10Wikimedia-production-error: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 (10Ladsgroup) a:03Ladsgroup We made a lot of progress. [16:02:48] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [16:02:59] (PuppetFailure) resolved: Puppet has failed on elastic1107:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [16:03:39] (03CR) 10Fabfur: [C: 03+1] "Looks coherent with I24cf4fce8ba2f6517dfe343ea2c127cd26195712" [puppet] - 10https://gerrit.wikimedia.org/r/979985 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [16:03:59] (03CR) 10Fabfur: [C: 03+1] "Looks coherent with I6720e89360c9026ea26a77601d5f490d347a6cba" [puppet] - 10https://gerrit.wikimedia.org/r/979984 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [16:04:44] 10SRE, 10Infrastructure-Foundations: Hide the client IP address in the SMTP Received header for authenticated relay clients - https://phabricator.wikimedia.org/T317574 (10jhathaway) 05Open→03Resolved [16:04:48] 10SRE, 10Infrastructure-Foundations: Update DNS record to allow us to send emails from @wikimedia.org on Qualtrics - https://phabricator.wikimedia.org/T314815 (10jhathaway) [16:05:09] (03CR) 10Vgutierrez: [C: 03+2] hiera: Disable rp_filter on ncredir@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/979984 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [16:05:48] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979990 (https://phabricator.wikimedia.org/T128546) [16:07:41] (03CR) 10Phuedx: Define the corresponding stream for scroll (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977785 (https://phabricator.wikimedia.org/T350883) (owner: 10Kimberly Sarabia) [16:07:53] (03CR) 10Phuedx: [C: 03+1] Define the corresponding stream for scroll [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977785 (https://phabricator.wikimedia.org/T350883) (owner: 10Kimberly Sarabia) [16:11:36] (03CR) 10Svantje Lilienthal: [C: 03+1] [beta] Enable FileImporter Codex mode on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979988 (owner: 10Awight) [16:12:22] (03PS2) 10Svantje Lilienthal: [beta] Enable FileImporter Codex mode on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979988 (https://phabricator.wikimedia.org/T348759) (owner: 10Awight) [16:13:47] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2104 (T348183)', diff saved to https://phabricator.wikimedia.org/P54108 and previous config saved to /var/cache/conftool/dbconfig/20231204-161346-arnaudb.json [16:13:49] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2125.codfw.wmnet with reason: Maintenance [16:13:53] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [16:14:02] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2125.codfw.wmnet with reason: Maintenance [16:14:09] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2125 (T348183)', diff saved to https://phabricator.wikimedia.org/P54109 and previous config saved to /var/cache/conftool/dbconfig/20231204-161408-arnaudb.json [16:14:46] (03Abandoned) 10Peter Fischer: enable page_rerender for commonswiki, frwiki, itwiki, and wikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979970 (owner: 10Peter Fischer) [16:15:05] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Make Spicerack cookbook to resize ganeti VM - https://phabricator.wikimedia.org/T219454 (10MoritzMuehlenhoff) 05Open→03Declined This is a rara operation and basically only requires to run a straight-forward CLI command (followed by running sre.ganeti.r... [16:15:07] 10SRE-tools, 10Ganeti, 10Infrastructure-Foundations: Cookbooks for Ganeti maintenance tasks - https://phabricator.wikimedia.org/T283319 (10MoritzMuehlenhoff) [16:16:55] (03CR) 10Vgutierrez: [V: 03+1 C: 03+2] hiera: Enable IPIP encapsulation on ncredir@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/979985 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [16:17:02] (03PS2) 10Vgutierrez: hiera: Enable IPIP encapsulation on ncredir@eqsin [puppet] - 10https://gerrit.wikimedia.org/r/979985 (https://phabricator.wikimedia.org/T351069) [16:17:27] 10SRE-tools, 10Infrastructure-Foundations, 10homer: Add Homer support to Cookbooks - https://phabricator.wikimedia.org/T265342 (10ayounsi) 05Open→03Invalid Hello past me, not needed anymore. [16:19:45] 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations: Tracking task for DCOps privileged commands - https://phabricator.wikimedia.org/T233685 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This was handled in various other tasks. [16:19:53] (03PS1) 10Vgutierrez: hiera: Enable IPIP on eqsin text|secondary LVS [puppet] - 10https://gerrit.wikimedia.org/r/979994 (https://phabricator.wikimedia.org/T351069) [16:20:05] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T348183)', diff saved to https://phabricator.wikimedia.org/P54110 and previous config saved to /var/cache/conftool/dbconfig/20231204-162005-arnaudb.json [16:20:12] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [16:20:29] 10SRE-tools, 10Infrastructure-Foundations, 10Python3-Porting: Puppet: forbid new Python2 code - https://phabricator.wikimedia.org/T197804 (10joanna_borun) 05Open→03Invalid [16:20:54] 10SRE, 10Traffic, 10Patch-For-Review: Enable IPIP encapsulation for ncredir - https://phabricator.wikimedia.org/T351069 (10Vgutierrez) [16:21:28] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (CORE_DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/" [puppet] - 10https://gerrit.wikimedia.org/r/979994 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [16:22:08] (03CR) 10Peter Fischer: [C: 03+1] "The changes to the kafka topic won't be applied, see https://phabricator.wikimedia.org/T351503" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979155 (https://phabricator.wikimedia.org/T352335) (owner: 10Ebernhardson) [16:22:11] (03PS2) 10Vgutierrez: hiera: Enable IPIP on eqsin text|secondary LVS [puppet] - 10https://gerrit.wikimedia.org/r/979994 (https://phabricator.wikimedia.org/T351069) [16:22:28] 10SRE-tools, 10Infrastructure-Foundations, 10Python3-Porting: Python2: track Py2 softwares - https://phabricator.wikimedia.org/T197803 (10MoritzMuehlenhoff) 05Open→03Declined Bookworm no longer includes Python 2 at all and in Bullseye Python gets uninstalled unless one sets an explicit Hiera flag to keep... [16:22:53] 10SRE-tools, 10Infrastructure-Foundations, 10Python3-Porting: Puppet: forbid new Python2 code - https://phabricator.wikimedia.org/T197804 (10MoritzMuehlenhoff) Bookworm no longer includes Python 2 at all and in Bullseye Python gets uninstalled unless one sets an explicit Hiera flag to keep it (pybal e.g.), w... [16:24:44] (03PS3) 10Svantje Lilienthal: [beta] Enable FileImporter Codex mode on the beta cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979988 (https://phabricator.wikimedia.org/T347453) (owner: 10Awight) [16:25:10] (03CR) 10Ssingh: [C: 03+1] hiera: Enable IPIP on eqsin text|secondary LVS [puppet] - 10https://gerrit.wikimedia.org/r/979994 (https://phabricator.wikimedia.org/T351069) (owner: 10Vgutierrez) [16:29:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:29:58] (03PS9) 10Bking: wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) [16:30:04] jan_drewniak: How many deployers does it take to do Wikimedia Portals Update deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231204T1630). [16:30:27] (03CR) 10CI reject: [V: 04-1] wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [16:34:07] 10SRE-tools, 10Infrastructure-Foundations, 10Orchestrator: Add database host removal from Orchestrator to sre.hosts.decommission cookbook - https://phabricator.wikimedia.org/T287954 (10Volans) p:05Triage→03Low @Marostegui is this request still valid/needed? If we are going to add this steps I would need... [16:34:16] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:35:07] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979990 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:35:12] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P54111 and previous config saved to /var/cache/conftool/dbconfig/20231204-163511-arnaudb.json [16:35:20] 10SRE-tools, 10Infrastructure-Foundations: Long timeout on debmonitor client with server unreachable/unpingable - https://phabricator.wikimedia.org/T302205 (10Volans) It seems that the current defaults are generally working fine. @fgiunchedi have you encounter any specific issue in the last ~2y that still requ... [16:35:30] 10SRE-tools, 10Infrastructure-Foundations: Long timeout on debmonitor client with server unreachable/unpingable - https://phabricator.wikimedia.org/T302205 (10Volans) p:05Triage→03Low [16:35:51] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979990 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [16:39:02] (03CR) 10Clare Ming: [C: 03+1] Define the corresponding stream for scroll (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/977785 (https://phabricator.wikimedia.org/T350883) (owner: 10Kimberly Sarabia) [16:39:44] (03PS10) 10Bking: wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) [16:40:13] (03CR) 10CI reject: [V: 04-1] wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [16:41:50] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:42:30] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:42:38] (03PS2) 10Dreamy Jazz: MediaModeration: Set MediaModerationDeveloperMode to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979969 (owner: 10Kosta Harlan) [16:43:25] (03CR) 10CI reject: [V: 04-1] MediaModeration: Set MediaModerationDeveloperMode to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979969 (owner: 10Kosta Harlan) [16:43:56] (03CR) 10Dreamy Jazz: MediaModeration: Set MediaModerationDeveloperMode to false (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979969 (owner: 10Kosta Harlan) [16:44:52] !log jdrewniak@deploy2002 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:979990| Bumping portals to master (T128546)]] (duration: 06m 40s) [16:44:55] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [16:46:37] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 33604 [16:47:50] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 33604 [16:48:20] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [16:49:10] (03PS1) 10Elukey: slo_template: update SLO sliding window [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/980000 [16:49:25] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [16:50:19] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P54112 and previous config saved to /var/cache/conftool/dbconfig/20231204-165018-arnaudb.json [16:52:38] !log jdrewniak@deploy2002 Synchronized portals: Wikimedia Portals Update: [[gerrit:979990| Bumping portals to master (T128546)]] (duration: 07m 45s) [16:52:41] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [16:53:52] (03PS3) 10Kosta Harlan: MediaModeration: Set MediaModerationDeveloperMode to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979969 [16:54:35] (03CR) 10CI reject: [V: 04-1] MediaModeration: Set MediaModerationDeveloperMode to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979969 (owner: 10Kosta Harlan) [16:54:50] (03PS1) 10Ilias Sarantopoulos: ml-services: fix rest gateway endpoint creation in article descriptions [deployment-charts] - 10https://gerrit.wikimedia.org/r/980002 (https://phabricator.wikimedia.org/T351940) [16:54:53] (03CR) 10Kosta Harlan: MediaModeration: Set MediaModerationDeveloperMode to false (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979969 (owner: 10Kosta Harlan) [16:55:20] (03CR) 10Herron: [C: 03+1] "thanks!" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/980000 (owner: 10Elukey) [16:55:40] (03PS11) 10Bking: wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) [16:55:55] (03PS4) 10Kosta Harlan: MediaModeration: Set MediaModerationDeveloperMode to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979969 [16:56:00] (03PS5) 10Kosta Harlan: MediaModeration: Set MediaModerationDeveloperMode to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979969 [16:56:16] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:56:17] (03PS1) 10Elukey: slo_definitions: restrict Lift Wing metrics with one extr label [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/980004 (https://phabricator.wikimedia.org/T351390) [16:56:50] (03CR) 10Elukey: [V: 03+2 C: 03+2] slo_template: update SLO sliding window [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/980000 (owner: 10Elukey) [16:57:21] (03CR) 10Elukey: [C: 03+1] ml-services: fix rest gateway endpoint creation in article descriptions [deployment-charts] - 10https://gerrit.wikimedia.org/r/980002 (https://phabricator.wikimedia.org/T351940) (owner: 10Ilias Sarantopoulos) [16:58:37] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [16:59:27] (03PS12) 10Bking: wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) [17:01:55] (03CR) 10CI reject: [V: 04-1] wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [17:05:25] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T348183)', diff saved to https://phabricator.wikimedia.org/P54113 and previous config saved to /var/cache/conftool/dbconfig/20231204-170525-arnaudb.json [17:05:28] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2126.codfw.wmnet with reason: Maintenance [17:05:30] (03CR) 10Kevin Bazira: [C: 03+1] ml-services: fix rest gateway endpoint creation in article descriptions (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980002 (https://phabricator.wikimedia.org/T351940) (owner: 10Ilias Sarantopoulos) [17:05:34] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [17:05:42] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2126.codfw.wmnet with reason: Maintenance [17:05:43] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [17:05:58] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [17:06:05] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2126 (T348183)', diff saved to https://phabricator.wikimedia.org/P54114 and previous config saved to /var/cache/conftool/dbconfig/20231204-170604-arnaudb.json [17:07:24] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Thu 15 Feb 2024 02:11:55 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:07:32] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.297 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:08:18] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51007 bytes in 0.265 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:08:29] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1077.eqiad.wmnet with OS bullseye [17:08:35] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host ms-be1077.eqiad.wmnet with OS bullseye [17:09:02] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1078.eqiad.wmnet with OS bullseye [17:09:07] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T348183)', diff saved to https://phabricator.wikimedia.org/P54115 and previous config saved to /var/cache/conftool/dbconfig/20231204-170906-arnaudb.json [17:09:08] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host ms-be1078.eqiad.wmnet with OS bullseye [17:09:11] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be1076'] [17:09:30] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1076'] [17:09:40] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be1076'] [17:11:11] (03Abandoned) 10Elukey: slo_definitions: restrict Lift Wing metrics with one extr label [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/980004 (https://phabricator.wikimedia.org/T351390) (owner: 10Elukey) [17:11:43] (03PS1) 10Jforrester: nlwikivoyage: Drop Listings extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980009 (https://phabricator.wikimedia.org/T352696) [17:12:13] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1076'] [17:12:48] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ms-be1076'] [17:13:48] (03CR) 10Dreamy Jazz: [C: 03+1] MediaModeration: Set MediaModerationDeveloperMode to false [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979969 (owner: 10Kosta Harlan) [17:14:01] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ms-be1079'] [17:14:17] (03PS1) 10Ladsgroup: Category: Stop locking thousands of rows [core] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979692 (https://phabricator.wikimedia.org/T352628) [17:15:04] (03CR) 10Ladsgroup: [C: 03+2] Category: Stop locking thousands of rows [core] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979692 (https://phabricator.wikimedia.org/T352628) (owner: 10Ladsgroup) [17:15:21] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1079'] [17:15:23] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Grant access to nda LDAP group to xqt - https://phabricator.wikimedia.org/T348520 (10Dzahn) Thanks for responding @Xqt. Yes, it's possible to not publish the real name. We will just use "known to legal" in the realname field in the repo. Thanks for confirmin... [17:15:39] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be1079'] [17:15:44] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1079'] [17:15:52] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be1079'] [17:16:05] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1079'] [17:16:08] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['ms-be1079'] [17:18:06] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1079'] [17:18:30] (03CR) 10Ilias Sarantopoulos: ml-services: fix rest gateway endpoint creation in article descriptions (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980002 (https://phabricator.wikimedia.org/T351940) (owner: 10Ilias Sarantopoulos) [17:18:33] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ms-be1079'] [17:18:53] !log jclark@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['ms-be1076'] [17:19:11] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy2002 using scap backport" [core] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979692 (https://phabricator.wikimedia.org/T352628) (owner: 10Ladsgroup) [17:19:15] !log jclark@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['ms-be1076'] [17:20:50] jouncebot: nowandnext [17:20:50] No deployments scheduled for the next 0 hour(s) and 39 minute(s) [17:20:50] In 0 hour(s) and 39 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231204T1800) [17:20:50] In 0 hour(s) and 39 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231204T1800) [17:21:05] (03PS1) 10Hnowlan: mw-jobrunner: increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/980011 (https://phabricator.wikimedia.org/T349796) [17:24:13] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P54116 and previous config saved to /var/cache/conftool/dbconfig/20231204-172413-arnaudb.json [17:25:30] 10SRE, 10Patch-For-Review, 10Wikimedia-production-error: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 (10aaron) One thing to also fix here is that things like SELECT FOR UPDATE, SELECT GET_LOCK()...any SELECT really...should be exempted from the... [17:26:09] 10SRE, 10Patch-For-Review, 10Wikimedia-production-error: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 (10aaron) One thing to fix here is that SELECT FOR UPDATE should be except from the transaction size check in approvePrimaryChanges(). There is... [17:26:35] (03PS1) 10Dzahn: admin: add user xqt to ldap_only admins, volunteer NDA [puppet] - 10https://gerrit.wikimedia.org/r/980013 (https://phabricator.wikimedia.org/T348520) [17:27:11] (03CR) 10CI reject: [V: 04-1] admin: add user xqt to ldap_only admins, volunteer NDA [puppet] - 10https://gerrit.wikimedia.org/r/980013 (https://phabricator.wikimedia.org/T348520) (owner: 10Dzahn) [17:28:06] (03PS2) 10Dzahn: admin: add user xqt to ldap_only admins, volunteer NDA [puppet] - 10https://gerrit.wikimedia.org/r/980013 (https://phabricator.wikimedia.org/T348520) [17:33:07] (03Merged) 10jenkins-bot: Category: Stop locking thousands of rows [core] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979692 (https://phabricator.wikimedia.org/T352628) (owner: 10Ladsgroup) [17:33:18] !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:979692|Category: Stop locking thousands of rows (T352628)]] [17:33:21] T352628: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 [17:34:48] !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:979692|Category: Stop locking thousands of rows (T352628)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [17:34:58] (03CR) 10Kevin Bazira: [C: 03+1] ml-services: fix rest gateway endpoint creation in article descriptions (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/980002 (https://phabricator.wikimedia.org/T351940) (owner: 10Ilias Sarantopoulos) [17:35:06] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10Patch-For-Review: Grant access to nda LDAP group to xqt - https://phabricator.wikimedia.org/T348520 (10Dzahn) a:03Dzahn [17:35:33] !log ladsgroup@deploy2002 ladsgroup: Continuing with sync [17:39:20] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P54117 and previous config saved to /var/cache/conftool/dbconfig/20231204-173919-arnaudb.json [17:39:36] (03CR) 10Ssingh: [C: 03+1] "Verified the task linked and with dzahn." [puppet] - 10https://gerrit.wikimedia.org/r/980013 (https://phabricator.wikimedia.org/T348520) (owner: 10Dzahn) [17:41:16] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:41:25] !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:979692|Category: Stop locking thousands of rows (T352628)]] (duration: 08m 07s) [17:41:28] T352628: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 [17:46:16] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [17:46:42] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10Patch-For-Review: Grant access to nda LDAP group to xqt - https://phabricator.wikimedia.org/T348520 (10Dzahn) >>! In T348520#9335183, @KFrancis wrote: > Hi all, I was finally granted access to see the signature confirmation page. I can confirm https://p... [17:46:48] 10SRE, 10Patch-For-Review, 10Wikimedia-production-error: Fatal exception of type "Wikimedia\Rdbms\DBTransactionSizeError" - https://phabricator.wikimedia.org/T352628 (10Ladsgroup) Right after the patch deployment, contention went to basically zero {F41560311} https://grafana.wikimedia.org/d/000000273/mysql?... [17:54:26] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T348183)', diff saved to https://phabricator.wikimedia.org/P54118 and previous config saved to /var/cache/conftool/dbconfig/20231204-175426-arnaudb.json [17:54:28] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance [17:54:31] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [17:54:42] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2138.codfw.wmnet with reason: Maintenance [17:54:44] 10SRE, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T352653 (10Dzahn) Hi @ArthurTaylor could you please send an email to @KFrancis https://meta.wikimedia.org/wiki/User:KFrancis_(WMF) ofthe Legal department to proceed with the NDA signing? Just so she go... [17:54:49] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2138:3312 (T348183)', diff saved to https://phabricator.wikimedia.org/P54119 and previous config saved to /var/cache/conftool/dbconfig/20231204-175448-arnaudb.json [17:55:15] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host ms-be1078.eqiad.wmnet with OS bullseye [17:55:24] 10SRE, 10LDAP-Access-Requests: Request to be added to the ldap/wmde group for ArthurTaylor - https://phabricator.wikimedia.org/T352653 (10Dzahn) [17:58:01] (03PS1) 10Brion VIBBER: Always load transcode state from db when opting in to primary db [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979693 (https://phabricator.wikimedia.org/T200939) [17:59:15] (03PS2) 10Brion VIBBER: Always load transcode state from db when opting in to primary db [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979693 (https://phabricator.wikimedia.org/T200939) [17:59:56] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1076.eqiad.wmnet with OS bullseye [18:00:03] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host ms-be1076.eqiad.wmnet with OS bullseye [18:00:04] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231204T1800) [18:00:05] ryankemper: It is that lovely time of the day again! You are hereby commanded to deploy Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231204T1800). [18:00:48] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T348183)', diff saved to https://phabricator.wikimedia.org/P54120 and previous config saved to /var/cache/conftool/dbconfig/20231204-180047-arnaudb.json [18:00:57] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [18:01:56] (03PS1) 10Brion VIBBER: Encoding cleanup with remuxing support [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979694 (https://phabricator.wikimedia.org/T68722) [18:02:08] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host ms-be1077.eqiad.wmnet with OS bullseye [18:04:33] (03CR) 10Effie Mouzeli: mcrouter: add helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/979363 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [18:06:25] (03CR) 10Volans: [C: 04-1] "I think I found some smaller issues, please see inline questions/comments" [puppet] - 10https://gerrit.wikimedia.org/r/972929 (https://phabricator.wikimedia.org/T333615) (owner: 10Andrea Denisse) [18:09:21] 10ops-codfw, 10DC-Ops: Q2:rack/setup/install test R760xd host - https://phabricator.wikimedia.org/T352703 (10RobH) [18:15:54] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P54121 and previous config saved to /var/cache/conftool/dbconfig/20231204-181554-arnaudb.json [18:18:30] (03CR) 10Dzahn: [V: 04-1] "waiting for addition to google doc by legal" [puppet] - 10https://gerrit.wikimedia.org/r/980013 (https://phabricator.wikimedia.org/T348520) (owner: 10Dzahn) [18:24:32] (03PS1) 10Dzahn: etherpad: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/980018 [18:25:00] (03CR) 10CI reject: [V: 04-1] etherpad: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/980018 (owner: 10Dzahn) [18:25:58] (03PS3) 10Jforrester: wikifunctions: Drop beta monitoring [puppet] - 10https://gerrit.wikimedia.org/r/952488 (https://phabricator.wikimedia.org/T321099) [18:27:21] (03Abandoned) 10Jforrester: wikifunctions: Add production alerting alongside beta [puppet] - 10https://gerrit.wikimedia.org/r/952486 (owner: 10Jforrester) [18:27:55] (03PS13) 10Bking: wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) [18:28:11] (03PS1) 10Dzahn: peopleweb: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/980020 [18:28:24] (03CR) 10CI reject: [V: 04-1] wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [18:28:53] (03PS2) 10Dzahn: etherpad: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/980018 [18:29:26] (03PS14) 10Bking: wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) [18:29:58] (03CR) 10CI reject: [V: 04-1] wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [18:31:01] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312', diff saved to https://phabricator.wikimedia.org/P54122 and previous config saved to /var/cache/conftool/dbconfig/20231204-183100-arnaudb.json [18:31:06] (03PS15) 10Bking: wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) [18:31:35] (03CR) 10CI reject: [V: 04-1] wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [18:33:00] (03PS16) 10Bking: wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) [18:33:30] (03CR) 10CI reject: [V: 04-1] wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [18:37:19] (03PS17) 10Bking: wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) [18:38:48] (03CR) 10Muehlenhoff: etherpad: replace ferm::service with firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/980018 (owner: 10Dzahn) [18:39:49] (03PS1) 10Andrew Bogott: Horizon: allow image uploading via horizon for users with glance admin [puppet] - 10https://gerrit.wikimedia.org/r/980021 (https://phabricator.wikimedia.org/T326818) [18:40:07] (03CR) 10Muehlenhoff: peopleweb: replace ferm::service with firewall::service (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/980020 (owner: 10Dzahn) [18:41:06] (03PS1) 10Dzahn: firewall::service: spelling fixes, add missing parameter comments [puppet] - 10https://gerrit.wikimedia.org/r/980022 [18:41:54] (03PS3) 10Dzahn: etherpad: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/980018 [18:41:56] (03CR) 10Dzahn: etherpad: replace ferm::service with firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/980018 (owner: 10Dzahn) [18:43:13] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [18:43:29] (03PS2) 10Dzahn: peopleweb: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/980020 [18:43:43] (03CR) 10Dzahn: peopleweb: replace ferm::service with firewall::service (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/980020 (owner: 10Dzahn) [18:44:00] (03CR) 10CI reject: [V: 04-1] peopleweb: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/980020 (owner: 10Dzahn) [18:44:49] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM (sans aligning issue making CI fail)" [puppet] - 10https://gerrit.wikimedia.org/r/980020 (owner: 10Dzahn) [18:45:08] (03PS3) 10Dzahn: peopleweb: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/980020 [18:45:25] (03CR) 10Muehlenhoff: [C: 03+1] etherpad: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/980018 (owner: 10Dzahn) [18:46:08] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2138:3312 (T348183)', diff saved to https://phabricator.wikimedia.org/P54123 and previous config saved to /var/cache/conftool/dbconfig/20231204-184607-arnaudb.json [18:46:09] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2148.codfw.wmnet with reason: Maintenance [18:46:11] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [18:46:24] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2148.codfw.wmnet with reason: Maintenance [18:46:30] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2148 (T348183)', diff saved to https://phabricator.wikimedia.org/P54124 and previous config saved to /var/cache/conftool/dbconfig/20231204-184630-arnaudb.json [18:46:55] (03CR) 10Dzahn: [C: 03+2] vrts: Avoid Ferm-specific syntax [puppet] - 10https://gerrit.wikimedia.org/r/979364 (owner: 10Muehlenhoff) [18:47:31] (03CR) 10Muehlenhoff: firewall::service: spelling fixes, add missing parameter comments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/980022 (owner: 10Dzahn) [18:47:40] (03PS18) 10Bking: wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) [18:50:39] (03CR) 10Bking: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [18:50:53] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1199 - https://phabricator.wikimedia.org/T352238 (10VRiley-WMF) This drive have been replaced. Shipping out faulty drive back as per requested. Completed [18:51:14] 10SRE, 10ops-eqiad, 10DBA: Degraded RAID on db1199 - https://phabricator.wikimedia.org/T352238 (10VRiley-WMF) 05Open→03Resolved [18:51:52] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1079.eqiad.wmnet with OS bullseye [18:51:54] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1078.eqiad.wmnet with OS bullseye [18:51:56] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1077.eqiad.wmnet with OS bullseye [18:51:58] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host ms-be1079.eqiad.wmnet with OS bullseye [18:52:00] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host ms-be1078.eqiad.wmnet with OS bullseye [18:52:02] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host ms-be1077.eqiad.wmnet with OS bullseye [18:52:40] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host ms-be1076.eqiad.wmnet with OS bullseye [18:52:54] (03CR) 10Dzahn: [C: 03+2] "confirmed noop in prod" [puppet] - 10https://gerrit.wikimedia.org/r/979364 (owner: 10Muehlenhoff) [18:54:03] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/980020 (owner: 10Dzahn) [18:55:07] (JobUnavailable) firing: (9) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:55:20] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T348183)', diff saved to https://phabricator.wikimedia.org/P54125 and previous config saved to /var/cache/conftool/dbconfig/20231204-185519-arnaudb.json [18:55:24] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [18:58:51] 10SRE, 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [19:00:19] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [19:00:45] (MDRAIDFailedDisk) firing: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [19:04:30] (03PS2) 10Dzahn: firewall::service: spelling fixes, add missing parameter comments [puppet] - 10https://gerrit.wikimedia.org/r/980022 [19:05:24] (03CR) 10Dzahn: firewall::service: spelling fixes, add missing parameter comments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/980022 (owner: 10Dzahn) [19:06:14] (03CR) 10Dzahn: [C: 03+2] etherpad: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/980018 (owner: 10Dzahn) [19:08:58] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host ms-be1077.eqiad.wmnet with OS bullseye [19:09:05] (03CR) 10Dzahn: [C: 03+2] "noop confirmed" [puppet] - 10https://gerrit.wikimedia.org/r/980018 (owner: 10Dzahn) [19:09:06] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host ms-be1078.eqiad.wmnet with OS bullseye [19:09:38] (03CR) 10Dzahn: peopleweb: replace ferm::service with firewall::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/980020 (owner: 10Dzahn) [19:10:03] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host ms-be1079.eqiad.wmnet with OS bullseye [19:10:06] (03CR) 10Muehlenhoff: firewall::service: spelling fixes, add missing parameter comments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/980022 (owner: 10Dzahn) [19:10:26] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P54126 and previous config saved to /var/cache/conftool/dbconfig/20231204-191026-arnaudb.json [19:18:56] (03PS1) 10Ebernhardson: cirrus updater: Update deployed image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/980024 [19:20:56] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1077.eqiad.wmnet with OS bullseye [19:21:04] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host ms-be1077.eqiad.wmnet with OS bullseye [19:21:06] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1079.eqiad.wmnet with OS bullseye [19:21:12] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host ms-be1079.eqiad.wmnet with OS bullseye [19:21:15] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1078.eqiad.wmnet with OS bullseye [19:21:21] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host ms-be1078.eqiad.wmnet with OS bullseye [19:21:23] !log jclark@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1076.eqiad.wmnet with OS bullseye [19:21:29] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jclark@cumin1001 for host ms-be1076.eqiad.wmnet with OS bullseye [19:21:56] (03CR) 10Dzahn: [C: 03+2] peopleweb: replace ferm::service with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/980020 (owner: 10Dzahn) [19:22:31] (03PS19) 10Ryan Kemper: wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [19:23:37] (03CR) 10Dzahn: [C: 03+2] "typo "firwall" and didn't replace ferm::service in second example. 'doh :)" [puppet] - 10https://gerrit.wikimedia.org/r/980020 (owner: 10Dzahn) [19:25:15] (03CR) 10Gehel: "Minor comments inline, otherwise LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [19:25:33] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P54128 and previous config saved to /var/cache/conftool/dbconfig/20231204-192532-arnaudb.json [19:25:46] (03PS1) 10Dzahn: peoplweb: fix typo after ferm->firewall change [puppet] - 10https://gerrit.wikimedia.org/r/980025 [19:26:13] (03CR) 10Dzahn: [C: 03+2] peoplweb: fix typo after ferm->firewall change [puppet] - 10https://gerrit.wikimedia.org/r/980025 (owner: 10Dzahn) [19:26:25] (03PS2) 10Dzahn: peopleweb: fix typo after ferm->firewall change [puppet] - 10https://gerrit.wikimedia.org/r/980025 [19:31:55] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Update deployed image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/980024 (owner: 10Ebernhardson) [19:32:45] (03Merged) 10jenkins-bot: cirrus updater: Update deployed image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/980024 (owner: 10Ebernhardson) [19:32:47] (03CR) 10Dzahn: [C: 03+2] "ok after follow-up https://gerrit.wikimedia.org/r/c/operations/puppet/+/980025" [puppet] - 10https://gerrit.wikimedia.org/r/980020 (owner: 10Dzahn) [19:35:32] (03PS3) 10Dzahn: firewall::service: spelling fixes, add missing parameter comments [puppet] - 10https://gerrit.wikimedia.org/r/980022 [19:35:35] (03CR) 10Dzahn: firewall::service: spelling fixes, add missing parameter comments (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/980022 (owner: 10Dzahn) [19:37:14] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [19:37:25] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [19:40:39] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T348183)', diff saved to https://phabricator.wikimedia.org/P54129 and previous config saved to /var/cache/conftool/dbconfig/20231204-194039-arnaudb.json [19:40:42] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance [19:40:44] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [19:40:57] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance [19:41:03] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3312 (T348183)', diff saved to https://phabricator.wikimedia.org/P54130 and previous config saved to /var/cache/conftool/dbconfig/20231204-194103-arnaudb.json [19:42:37] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host ms-be1077.eqiad.wmnet with OS bullseye [19:42:44] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host ms-be1078.eqiad.wmnet with OS bullseye [19:42:45] (03CR) 10Gehel: wdqs: Monitor LDF endpoint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [19:42:51] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host ms-be1076.eqiad.wmnet with OS bullseye [19:43:03] !log jclark@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=93) for host ms-be1079.eqiad.wmnet with OS bullseye [19:55:29] (03PS1) 10Bernard Wang: Deploy VectorClientPreferences to beta and pl,fr,ca wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980028 [19:56:19] (03PS2) 10Bernard Wang: Deploy VectorClientPreferences to pl,fr,ca wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980028 [19:56:46] (03PS3) 10Bernard Wang: Deploy VectorClientPreferences to pl,fr,ca wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980028 (https://phabricator.wikimedia.org/T351339) [19:56:50] 10sre-alert-triage, 10Data-Platform-SRE: Alert in need of triage: SmartNotHealthy (instance an-worker1086:9100) - https://phabricator.wikimedia.org/T352168 (10Jclark-ctr) [19:57:07] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Replace 4TB SATA disk in an-worker1086 - https://phabricator.wikimedia.org/T352529 (10Jclark-ctr) 05Open→03Resolved @BTullis Swapped hdd [20:04:12] (03PS4) 10Bernard Wang: Deploy VectorClientPreferences to beta and pl,fr,ca wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980028 [20:04:33] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops: hw troubleshooting: SSD failure (/dev/sde) for aqs1013.eqiad.wmnet - https://phabricator.wikimedia.org/T352344 (10Jclark-ctr) 05Open→03Resolved server is out of warranty. Replaced failed drive with one from recently decommissioned servers [20:04:47] (03PS5) 10Bernard Wang: Deploy VectorClientPreferences to beta and pl,fr,ca wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980028 [20:05:24] (MDRAIDFailedDisk) resolved: MD RAID - Failed disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDFailedDisk [20:14:17] (03PS1) 10Kamila Součková: mw-api-int: increase replicas by 30% [deployment-charts] - 10https://gerrit.wikimedia.org/r/980032 [20:19:00] RECOVERY - Dell PowerEdge RAID Controller on db1199 is OK: communication: 0 OK https://wikitech.wikimedia.org/wiki/PERCCli%23Monitoring [20:19:22] (MDRAIDNotEnoughDisks) firing: (2) MD RAID - insufficient active disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDNotEnoughDisks [20:23:52] 10SRE, 10ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (10phaultfinder) [20:27:22] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T348183)', diff saved to https://phabricator.wikimedia.org/P54131 and previous config saved to /var/cache/conftool/dbconfig/20231204-202722-arnaudb.json [20:27:29] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [20:36:41] !log pt1979@cumin1001 START - Cookbook sre.hosts.reimage for host ms-be1077.eqiad.wmnet with OS bullseye [20:36:47] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin1001 for host ms-be1077.eqiad.wmnet with OS bullseye [20:39:22] (MDRAIDNotEnoughDisks) resolved: (2) MD RAID - insufficient active disk(s) on aqs1013:9100 - https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook#Hardware_Raid_Information_Gathering - TODO - https://alerts.wikimedia.org/?q=alertname%3DMDRAIDNotEnoughDisks [20:42:29] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P54132 and previous config saved to /var/cache/conftool/dbconfig/20231204-204228-arnaudb.json [20:49:25] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [20:50:25] (03PS1) 10Kamila Součková: Move mw api servers to kubernetes workers [puppet] - 10https://gerrit.wikimedia.org/r/980039 (https://phabricator.wikimedia.org/T351074) [20:50:28] !log pt1979@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on ms-be1077.eqiad.wmnet with reason: host reimage [20:53:38] (03PS1) 10Kamila Součková: Move mw api servers to kubernetes workers [homer/public] - 10https://gerrit.wikimedia.org/r/980040 (https://phabricator.wikimedia.org/T351074) [20:53:45] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on ms-be1077.eqiad.wmnet with reason: host reimage [20:57:36] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P54133 and previous config saved to /var/cache/conftool/dbconfig/20231204-205735-arnaudb.json [21:00:04] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231204T2100). [21:00:04] bvibber and ebernhardson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:14] \o/ whee [21:00:55] \o [21:06:17] !log T351503 Setting partition count to 5: `ryankemper@kafka-main1001:~$ kafka topics --alter --topic eqiad.mediawiki.cirrussearch.page_rerender.v1 --partitions 5` [21:06:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:32] T351503: Enable mediawiki.cirrussearch.page_rerender.v1 on all public wikis - https://phabricator.wikimedia.org/T351503 [21:09:01] !log T351503 Setting partition count to 5: `ryankemper@kafka-main1001:~$ kafka topics --alter --topic codfw.mediawiki.cirrussearch.page_rerender.v1 --partitions 5` [21:09:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:09:16] I can't deploy this evening, sorry! Hopefully someone else will be along shortly [21:10:46] no worries [21:12:42] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T348183)', diff saved to https://phabricator.wikimedia.org/P54134 and previous config saved to /var/cache/conftool/dbconfig/20231204-211241-arnaudb.json [21:12:44] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2175.codfw.wmnet with reason: Maintenance [21:12:48] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [21:12:59] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2175.codfw.wmnet with reason: Maintenance [21:13:06] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2175 (T348183)', diff saved to https://phabricator.wikimedia.org/P54135 and previous config saved to /var/cache/conftool/dbconfig/20231204-211305-arnaudb.json [21:13:35] (03PS2) 10Kamila Součková: mobileapps: 45% to mw on k8s [puppet] - 10https://gerrit.wikimedia.org/r/976221 (https://phabricator.wikimedia.org/T350846) (owner: 10Giuseppe Lavagetto) [21:14:05] !log pt1979@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin1001" [21:18:03] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T348183)', diff saved to https://phabricator.wikimedia.org/P54136 and previous config saved to /var/cache/conftool/dbconfig/20231204-211803-arnaudb.json [21:18:10] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [21:18:33] (03PS1) 10Ebernhardson: cirrus updater: Remove kafka start offset [deployment-charts] - 10https://gerrit.wikimedia.org/r/980043 [21:19:07] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1005-cloudelastic-omega-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [21:19:13] !log pt1979@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin1001" [21:19:19] !log pt1979@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host ms-be1077.eqiad.wmnet with OS bullseye [21:19:25] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin1001 for host ms-be1077.eqiad.wmnet with OS bullseye completed: - ms-be... [21:22:43] so no deployer this window? :( [21:23:11] hmm, i can probably do it i suppose [21:23:26] (03CR) 10Ebernhardson: [C: 03+2] cirrus updater: Remove kafka start offset [deployment-charts] - 10https://gerrit.wikimedia.org/r/980043 (owner: 10Ebernhardson) [21:23:29] yay [21:23:45] thanks :D [21:24:13] (03Merged) 10jenkins-bot: cirrus updater: Remove kafka start offset [deployment-charts] - 10https://gerrit.wikimedia.org/r/980043 (owner: 10Ebernhardson) [21:24:27] bvibber: can ship your two patches together? [21:25:32] they can deploy together yeah [21:25:39] one will only affect backend job queue scripts though :D [21:25:59] (03CR) 10Ebernhardson: [C: 03+2] Always load transcode state from db when opting in to primary db [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979693 (https://phabricator.wikimedia.org/T200939) (owner: 10Brion VIBBER) [21:26:07] woohoo [21:26:07] (03CR) 10Ebernhardson: [C: 03+2] Encoding cleanup with remuxing support [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979694 (https://phabricator.wikimedia.org/T68722) (owner: 10Brion VIBBER) [21:27:26] !log ebernhardson@deploy2002 helmfile [staging] START helmfile.d/services/cirrus-streaming-updater: apply [21:27:44] !log ebernhardson@deploy2002 helmfile [staging] DONE helmfile.d/services/cirrus-streaming-updater: apply [21:28:12] (03CR) 10Eevans: [C: 03+2] restbase: set production role and add config for restbase2028 [puppet] - 10https://gerrit.wikimedia.org/r/979161 (https://phabricator.wikimedia.org/T352468) (owner: 10Eevans) [21:32:39] (03CR) 10Effie Mouzeli: [C: 03+1] "Happy to have a go at this" [software/tegola] (wmf/v0.19.x) - 10https://gerrit.wikimedia.org/r/978030 (https://phabricator.wikimedia.org/T347717) (owner: 10Jgiannelos) [21:32:59] (03PS1) 10Jforrester: Drop Listings extension from Wikivoyages where unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980047 (https://phabricator.wikimedia.org/T352719) [21:33:10] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P54137 and previous config saved to /var/cache/conftool/dbconfig/20231204-213309-arnaudb.json [21:38:53] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ebernhardson@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979155 (https://phabricator.wikimedia.org/T352335) (owner: 10Ebernhardson) [21:39:36] (03Merged) 10jenkins-bot: cirrus: Enable event bus bridge on more wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/979155 (https://phabricator.wikimedia.org/T352335) (owner: 10Ebernhardson) [21:39:51] !log ebernhardson@deploy2002 Started scap: Backport for [[gerrit:979155|cirrus: Enable event bus bridge on more wikis (T352335)]] [21:39:55] T352335: Deploy the new Cirrus Updater to update select wikis in cloudelastic - https://phabricator.wikimedia.org/T352335 [21:41:07] !log ebernhardson@deploy2002 ebernhardson: Backport for [[gerrit:979155|cirrus: Enable event bus bridge on more wikis (T352335)]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:42:39] !log ebernhardson@deploy2002 ebernhardson: Continuing with sync [21:44:06] (03Merged) 10jenkins-bot: Always load transcode state from db when opting in to primary db [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979693 (https://phabricator.wikimedia.org/T200939) (owner: 10Brion VIBBER) [21:44:24] (03Merged) 10jenkins-bot: Encoding cleanup with remuxing support [extensions/TimedMediaHandler] (wmf/1.42.0-wmf.7) - 10https://gerrit.wikimedia.org/r/979694 (https://phabricator.wikimedia.org/T68722) (owner: 10Brion VIBBER) [21:45:26] yay [21:46:18] (03PS1) 10Herron: grafana: add dashboard graphite usage exporter [puppet] - 10https://gerrit.wikimedia.org/r/980048 (https://phabricator.wikimedia.org/T350591) [21:47:28] !log T351503 Setting partition count to 5: `ryankemper@kafka-main2001:~$ kafka topics --alter --topic eqiad.mediawiki.cirrussearch.page_rerender.v1 --partitions 5` [21:47:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:33] T351503: Enable mediawiki.cirrussearch.page_rerender.v1 on all public wikis - https://phabricator.wikimedia.org/T351503 [21:47:37] !log T351503 Setting partition count to 5: `ryankemper@kafka-main2001:~$ kafka topics --alter --topic codfw.mediawiki.cirrussearch.page_rerender.v1 --partitions 5` [21:47:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:17] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P54138 and previous config saved to /var/cache/conftool/dbconfig/20231204-214816-arnaudb.json [21:49:15] !log ebernhardson@deploy2002 Finished scap: Backport for [[gerrit:979155|cirrus: Enable event bus bridge on more wikis (T352335)]] (duration: 09m 23s) [21:49:23] T352335: Deploy the new Cirrus Updater to update select wikis in cloudelastic - https://phabricator.wikimedia.org/T352335 [21:50:08] !log ebernhardson@deploy2002 Started scap: Backport for [[gerrit:979693|Always load transcode state from db when opting in to primary db]] [21:51:09] PROBLEM - Check systemd state on mw2261 is CRITICAL: CRITICAL - degraded: The following units failed: php7.4-fpm_check_restart.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:51:32] !log ebernhardson@deploy2002 ebernhardson and brion: Backport for [[gerrit:979693|Always load transcode state from db when opting in to primary db]] synced to the testservers (https://wikitech.wikimedia.org/wiki/Mwdebug) [21:52:01] bvibber: its loaded onto test servers [21:52:16] testing... [21:52:47] perfect [21:52:52] alright, continuing [21:52:52] looks good ebernhardson :D [21:52:53] !log ebernhardson@deploy2002 ebernhardson and brion: Continuing with sync [21:58:46] !log ebernhardson@deploy2002 Finished scap: Backport for [[gerrit:979693|Always load transcode state from db when opting in to primary db]] (duration: 08m 37s) [22:00:05] Reedy, sbassett, Maryum, and manfredi: OwO what's this, a deployment window?? Weekly Security deployment window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20231204T2200). nyaa~ [22:00:16] decent timing, backport window is now complete [22:00:36] thanks very much ebernhardson ! :D [22:01:09] (03PS6) 10Bernard Wang: Deploy VectorClientPreferences to beta and pl,fr,ca,fa wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980028 [22:01:59] np [22:03:23] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T348183)', diff saved to https://phabricator.wikimedia.org/P54140 and previous config saved to /var/cache/conftool/dbconfig/20231204-220322-arnaudb.json [22:03:25] !log arnaudb@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2189.codfw.wmnet with reason: Maintenance [22:03:29] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [22:03:39] !log arnaudb@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2189.codfw.wmnet with reason: Maintenance [22:03:46] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Depooling db2189 (T348183)', diff saved to https://phabricator.wikimedia.org/P54141 and previous config saved to /var/cache/conftool/dbconfig/20231204-220345-arnaudb.json [22:04:46] PROBLEM - cassandra-a CQL 10.192.16.237:9042 on restbase2028 is CRITICAL: connect to address 10.192.16.237 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [22:07:06] PROBLEM - cassandra-a SSL 10.192.16.237:7000 on restbase2028 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [22:08:17] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T348183)', diff saved to https://phabricator.wikimedia.org/P54142 and previous config saved to /var/cache/conftool/dbconfig/20231204-220817-arnaudb.json [22:11:56] PROBLEM - cassandra-b CQL 10.192.16.238:9042 on restbase2028 is CRITICAL: connect to address 10.192.16.238 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [22:12:03] 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops, 10Data-Persistence: Q1:rack/setup/install ms-be refresh - https://phabricator.wikimedia.org/T349840 (10Papaul) [22:14:24] PROBLEM - cassandra-b SSL 10.192.16.238:7000 on restbase2028 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [22:15:01] (03PS1) 10Eevans: restbase: migrate restbase2028 to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980049 (https://phabricator.wikimedia.org/T352468) [22:19:16] PROBLEM - cassandra-c CQL 10.192.16.239:9042 on restbase2028 is CRITICAL: connect to address 10.192.16.239 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [22:21:40] PROBLEM - cassandra-c SSL 10.192.16.239:7000 on restbase2028 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Cassandra%23Installing_and_generating_certificates [22:23:24] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P54144 and previous config saved to /var/cache/conftool/dbconfig/20231204-222323-arnaudb.json [22:33:48] PROBLEM - Restbase root url on restbase2028 is CRITICAL: connect to address 10.192.16.64 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase [22:38:30] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2189', diff saved to https://phabricator.wikimedia.org/P54145 and previous config saved to /var/cache/conftool/dbconfig/20231204-223830-arnaudb.json [22:52:32] (03CR) 10Ladsgroup: [C: 03+1] Drop Listings extension from Wikivoyages where unused [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980047 (https://phabricator.wikimedia.org/T352719) (owner: 10Jforrester) [22:52:40] (03CR) 10Ladsgroup: [C: 03+1] nlwikivoyage: Drop Listings extension [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980009 (https://phabricator.wikimedia.org/T352696) (owner: 10Jforrester) [22:53:37] !log arnaudb@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2189 (T348183)', diff saved to https://phabricator.wikimedia.org/P54146 and previous config saved to /var/cache/conftool/dbconfig/20231204-225336-arnaudb.json [22:53:40] T348183: Apply schema change for changing img_size, oi_size, us_size, and fa_size to BIGINT - https://phabricator.wikimedia.org/T348183 [22:59:04] (JobUnavailable) firing: (9) Reduced availability for job liberica in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:59:27] (03PS2) 10EoghanGaffney: [admin] Add user account for xiaoxiao to data.yaml [puppet] - 10https://gerrit.wikimedia.org/r/979389 (https://phabricator.wikimedia.org/T352098) [22:59:29] (03PS1) 10EoghanGaffney: [admin] Add ecarg shell account [puppet] - 10https://gerrit.wikimedia.org/r/980060 (https://phabricator.wikimedia.org/T350918) [23:00:19] (PuppetFailure) firing: Puppet has failed on lists1003:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [23:00:58] (03CR) 10CI reject: [V: 04-1] [admin] Add ecarg shell account [puppet] - 10https://gerrit.wikimedia.org/r/980060 (https://phabricator.wikimedia.org/T350918) (owner: 10EoghanGaffney) [23:03:19] (03PS2) 10EoghanGaffney: [admin] Add ecarg shell account [puppet] - 10https://gerrit.wikimedia.org/r/980060 (https://phabricator.wikimedia.org/T350918) [23:03:59] (03CR) 10Dzahn: [C: 03+1] restbase: migrate restbase2028 to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980049 (https://phabricator.wikimedia.org/T352468) (owner: 10Eevans) [23:06:46] (03PS3) 10EoghanGaffney: [admin] Add ecarg shell account [puppet] - 10https://gerrit.wikimedia.org/r/980060 (https://phabricator.wikimedia.org/T350918) [23:07:59] (PuppetFailure) firing: Puppet has failed on restbase2028:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure [23:11:22] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:11:32] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:12:20] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:15:38] PROBLEM - BGP status on cr3-ulsfo is CRITICAL: BGP CRITICAL - AS64605/IPv6: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [23:20:16] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:20:26] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:21:14] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:32:18] (03PS1) 10Kimberly Sarabia: Remove readability survey tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980063 (https://phabricator.wikimedia.org/T349337) [23:33:23] (03PS20) 10Bking: wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) [23:33:32] (03CR) 10Bking: wdqs: Monitor LDF endpoint (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) (owner: 10Bking) [23:33:37] (03PS21) 10Effie Mouzeli: mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) [23:37:38] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:37:45] (03PS22) 10Effie Mouzeli: mcrouter: add chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) [23:38:10] PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:38:20] PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:38:49] (03CR) 10Effie Mouzeli: mcrouter: add chart (038 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/961743 (https://phabricator.wikimedia.org/T346690) (owner: 10Effie Mouzeli) [23:38:59] (03CR) 10Jdlrobson: [C: 04-1] "I just wanted to check the approach you are taking is consistent with my understanding:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980063 (https://phabricator.wikimedia.org/T349337) (owner: 10Kimberly Sarabia) [23:39:58] RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:40:38] RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:41:10] RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:41:29] (03CR) 10Jdlrobson: [C: 04-1] Deploy VectorClientPreferences to beta and pl,fr,ca,fa wikis (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980028 (owner: 10Bernard Wang) [23:42:15] (03CR) 10Bking: [C: 03+1] restbase: migrate restbase2028 to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980049 (https://phabricator.wikimedia.org/T352468) (owner: 10Eevans) [23:43:25] (03PS21) 10Bking: wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) [23:43:43] (03PS7) 10Jdlrobson: Deploy VectorClientPreferences to beta and pl,fr,ca,fa wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/980028 (https://phabricator.wikimedia.org/T351339) (owner: 10Bernard Wang) [23:44:07] (03PS22) 10Bking: wdqs: Monitor LDF endpoint [puppet] - 10https://gerrit.wikimedia.org/r/979983 (https://phabricator.wikimedia.org/T347355) [23:51:31] (03CR) 10Eevans: [C: 03+2] restbase: migrate restbase2028 to puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/980049 (https://phabricator.wikimedia.org/T352468) (owner: 10Eevans) [23:51:46] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:52:51] !log eevans@cumin1001 START - Cookbook sre.puppet.migrate-host for host restbase2028.codfw.wmnet [23:53:06] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.310 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [23:53:27] !log eevans@cumin1001 END (FAIL) - Cookbook sre.puppet.migrate-host (exit_code=99) for host restbase2028.codfw.wmnet [23:54:02] PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [23:55:25] 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10Patch-For-Review: Grant access to nda LDAP group to xqt - https://phabricator.wikimedia.org/T348520 (10KFrancis) Done, thanks!