[00:00:23] <wikibugs>	 (03PS2) 10C. Scott Ananian: Turn on DT visual enhancements on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991039 (https://phabricator.wikimedia.org/T355374)
[00:00:25] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+1] "https://puppet-compiler.wmflabs.org/output/991680/1164/miscweb1003.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/991680 (https://phabricator.wikimedia.org/T354658) (owner: 10Ryan Kemper)
[00:00:49] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+1] "this should remove the monitoring alerts over the weekend, and until we add to the cert, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/991680 (https://phabricator.wikimedia.org/T354658) (owner: 10Ryan Kemper)
[00:01:15] <wikibugs>	 (03CR) 10Ryan Kemper: [V: 03+2 C: 03+2] wdqs graph-split: disable microsite [puppet] - 10https://gerrit.wikimedia.org/r/991680 (https://phabricator.wikimedia.org/T354658) (owner: 10Ryan Kemper)
[00:02:23] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2100.codfw.wmnet with OS bullseye
[00:04:09] <icinga-wm>	 RECOVERY - Query Service HTTP Port on wdqs1020 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.291 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[00:04:49] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1020 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 0.088 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[00:05:14] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2101.codfw.wmnet with OS bullseye
[00:05:39] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2098.codfw.wmnet with reason: host reimage
[00:08:55] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2098.codfw.wmnet with reason: host reimage
[00:09:58] <jinxer-wm>	 (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs1020:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[00:11:45] <wikibugs>	 (03PS3) 10C. Scott Ananian: Turn on DT visual enhancements on wikitech [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991039 (https://phabricator.wikimedia.org/T355374)
[00:12:19] <inflatador>	 !log bking@wdqs1020 depool host to catch up on lag
[00:12:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:12:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1238', diff saved to https://phabricator.wikimedia.org/P54971 and previous config saved to /var/cache/conftool/dbconfig/20240119-001226-ladsgroup.json
[00:13:31] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs1020.eqiad.wmnet with reason: needs to catch up from its lag
[00:13:47] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs1020.eqiad.wmnet with reason: needs to catch up from its lag
[00:14:06] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2099.codfw.wmnet with reason: host reimage
[00:17:18] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2099.codfw.wmnet with reason: host reimage
[00:17:20] <icinga-wm>	 RECOVERY - WDQS SPARQL on wdqs1019 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 0.186 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
[00:17:28] <icinga-wm>	 RECOVERY - Query Service HTTP Port on wdqs1019 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.043 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service
[00:18:58] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2100.codfw.wmnet with reason: host reimage
[00:21:27] <wikibugs>	 (03PS1) 10Dwisehaupt: Fix deployment diff issue and clean up presentation [puppet] - 10https://gerrit.wikimedia.org/r/991681 (https://phabricator.wikimedia.org/T343486)
[00:21:50] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2101.codfw.wmnet with reason: host reimage
[00:22:06] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2100.codfw.wmnet with reason: host reimage
[00:24:58] <jinxer-wm>	 (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs1019:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[00:25:02] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2101.codfw.wmnet with reason: host reimage
[00:26:32] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2098.codfw.wmnet with OS bullseye
[00:26:49] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2096.codfw.wmnet with OS bullseye
[00:27:33] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1238 (T352010)', diff saved to https://phabricator.wikimedia.org/P54972 and previous config saved to /var/cache/conftool/dbconfig/20240119-002733-ladsgroup.json
[00:27:35] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1241.eqiad.wmnet with reason: Maintenance
[00:27:38] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[00:27:49] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1241.eqiad.wmnet with reason: Maintenance
[00:27:56] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1241 (T352010)', diff saved to https://phabricator.wikimedia.org/P54973 and previous config saved to /var/cache/conftool/dbconfig/20240119-002755-ladsgroup.json
[00:30:57] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2097.codfw.wmnet with OS bullseye
[00:31:07] <wikibugs>	 (03Abandoned) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/991449 (owner: 10TrainBranchBot)
[00:33:45] <jinxer-wm>	 (ProbeDown) resolved: (6) Service miscweb1003:443 has failed probes (http_query_full_experimental_wikidata_org_collab_ip4)  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[00:34:26] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2099.codfw.wmnet with OS bullseye
[00:39:13] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/991461
[00:39:19] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/991461 (owner: 10TrainBranchBot)
[00:40:34] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2100.codfw.wmnet with OS bullseye
[00:42:13] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2101.codfw.wmnet with OS bullseye
[00:43:08] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2096.codfw.wmnet with reason: host reimage
[00:46:17] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2096.codfw.wmnet with reason: host reimage
[00:47:23] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic2097.codfw.wmnet with reason: host reimage
[00:49:03] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2094.codfw.wmnet with OS bullseye
[00:50:30] <tzatziki>	 !log removing 1 file for legal compliance
[00:50:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:50:39] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic2097.codfw.wmnet with reason: host reimage
[00:57:32] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2088.codfw.wmnet with OS bullseye
[01:01:43] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/991461 (owner: 10TrainBranchBot)
[01:03:49] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2096.codfw.wmnet with OS bullseye
[01:08:02] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic2097.codfw.wmnet with OS bullseye
[01:28:25] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic1103.eqiad.wmnet with OS bullseye
[01:42:24] <tzatziki>	 !log removing 3 files for legal compliance
[01:42:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:01:00] <tzatziki>	 !log removing 4 files for legal compliance
[02:01:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:06:03] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic1104.eqiad.wmnet with OS bullseye
[02:09:20] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2094.codfw.wmnet with OS bullseye
[02:09:25] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic1105.eqiad.wmnet with OS bullseye
[02:12:52] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic1106.eqiad.wmnet with OS bullseye
[02:17:40] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2088.codfw.wmnet with OS bullseye
[02:18:24] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2094.codfw.wmnet with OS bullseye
[02:21:37] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1104.eqiad.wmnet with reason: host reimage
[02:24:29] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1104.eqiad.wmnet with reason: host reimage
[02:24:51] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1105.eqiad.wmnet with reason: host reimage
[02:26:22] <wikibugs>	 (03CR) 10Andrea Denisse: "Thanks for the patch." [puppet] - 10https://gerrit.wikimedia.org/r/991542 (https://phabricator.wikimedia.org/T352665) (owner: 10Filippo Giunchedi)
[02:28:06] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1105.eqiad.wmnet with reason: host reimage
[02:28:50] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1106.eqiad.wmnet with reason: host reimage
[02:31:55] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1106.eqiad.wmnet with reason: host reimage
[02:35:19] <wikibugs>	 (03PS2) 10Varnent: Added Diff to approved list of RSS feeds for Foundation Governance Wiki and removed inoperative feed. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/991100 (https://phabricator.wikimedia.org/T354790)
[02:39:19] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:41:34] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1104.eqiad.wmnet with OS bullseye
[02:45:11] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1105.eqiad.wmnet with OS bullseye
[02:48:42] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1106.eqiad.wmnet with OS bullseye
[02:49:39] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic1103.eqiad.wmnet with OS bullseye
[03:09:19] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[03:38:42] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2094.codfw.wmnet with OS bullseye
[04:24:59] <jinxer-wm>	 (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs1019:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[04:32:25] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Thumbor, 10Traffic: Cache thumbs in our caching infrastructure (e.g. ATS) - https://phabricator.wikimedia.org/T345334 (10Midleading) 05Open→03Stalled Thumbor is currently heavily overloaded (T337649). As a result, traffic to thumbor should be reduced as much as possible un...
[04:40:38] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1083 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:40:52] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1083 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[04:46:02] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1107 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[04:46:08] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1107 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:49:40] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1083 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:49:52] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1083 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[04:50:02] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1146 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:50:18] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1146 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[04:50:45] <wikibugs>	 (03CR) 10Andrea Denisse: grafana: Create Grafana sysuser and home directory (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/990795 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse)
[05:02:12] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1146 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:02:28] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1146 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[05:10:12] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1107 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[05:10:20] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1107 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:32:45] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T352010)', diff saved to https://phabricator.wikimedia.org/P54974 and previous config saved to /var/cache/conftool/dbconfig/20240119-053244-ladsgroup.json
[05:32:50] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[05:47:52] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P54975 and previous config saved to /var/cache/conftool/dbconfig/20240119-054751-ladsgroup.json
[05:53:12] <icinga-wm>	 PROBLEM - Check systemd state on an-worker1153 is CRITICAL: CRITICAL - degraded: The following units failed: hadoop-yarn-nodemanager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:54:12] <icinga-wm>	 PROBLEM - Hadoop NodeManager on an-worker1153 is CRITICAL: PROCS CRITICAL: 0 processes with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[06:02:59] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1241', diff saved to https://phabricator.wikimedia.org/P54976 and previous config saved to /var/cache/conftool/dbconfig/20240119-060258-ladsgroup.json
[06:18:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1241 (T352010)', diff saved to https://phabricator.wikimedia.org/P54977 and previous config saved to /var/cache/conftool/dbconfig/20240119-061805-ladsgroup.json
[06:18:07] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1242.eqiad.wmnet with reason: Maintenance
[06:18:11] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[06:18:21] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1242.eqiad.wmnet with reason: Maintenance
[06:18:27] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1242 (T352010)', diff saved to https://phabricator.wikimedia.org/P54978 and previous config saved to /var/cache/conftool/dbconfig/20240119-061827-ladsgroup.json
[06:19:44] <icinga-wm>	 RECOVERY - Hadoop NodeManager on an-worker1153 is OK: PROCS OK: 1 process with command name java, args org.apache.hadoop.yarn.server.nodemanager.NodeManager https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Alerts%23Yarn_Nodemanager_process
[06:20:14] <icinga-wm>	 RECOVERY - Check systemd state on an-worker1153 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:28:41] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[06:28:47] <logmsgbot>	 !log marostegui@cumin1002 END (ERROR) - Cookbook sre.hosts.downtime (exit_code=97) for 8:00:00 on db1158.eqiad.wmnet with reason: Maintenance
[06:30:00] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1227.eqiad.wmnet with reason: Maintenance
[06:30:14] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1227.eqiad.wmnet with reason: Maintenance
[06:30:18] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:30:20] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1227 (T354336)', diff saved to https://phabricator.wikimedia.org/P54979 and previous config saved to /var/cache/conftool/dbconfig/20240119-063020-marostegui.json
[06:30:24] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[06:31:24] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 121, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:37:28] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:37:54] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 46, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:38:09] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1227.eqiad.wmnet with reason: Maintenance
[06:38:12] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1227.eqiad.wmnet with reason: Maintenance
[06:39:06] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1227.eqiad.wmnet with reason: Maintenance
[06:39:08] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1227.eqiad.wmnet with reason: Maintenance
[06:57:08] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[06:57:21] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1181.eqiad.wmnet with reason: Maintenance
[06:58:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1227 (re)pooling @ 10%: T354336', diff saved to https://phabricator.wikimedia.org/P54981 and previous config saved to /var/cache/conftool/dbconfig/20240119-065808-root.json
[06:58:13] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[06:58:47] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2098.codfw.wmnet with reason: Maintenance
[06:59:01] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2098.codfw.wmnet with reason: Maintenance
[06:59:18] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2100.codfw.wmnet with reason: Maintenance
[06:59:33] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2100.codfw.wmnet with reason: Maintenance
[06:59:50] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2108.codfw.wmnet with reason: Maintenance
[07:00:04] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2108.codfw.wmnet with reason: Maintenance
[07:00:05] <jouncebot>	 Deploy window MediaWiki infrastructure (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240119T0700)
[07:00:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2108 (T354336)', diff saved to https://phabricator.wikimedia.org/P54982 and previous config saved to /var/cache/conftool/dbconfig/20240119-070009-marostegui.json
[07:02:33] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T354336)', diff saved to https://phabricator.wikimedia.org/P54983 and previous config saved to /var/cache/conftool/dbconfig/20240119-070233-marostegui.json
[07:13:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1227 (re)pooling @ 25%: T354336', diff saved to https://phabricator.wikimedia.org/P54984 and previous config saved to /var/cache/conftool/dbconfig/20240119-071313-root.json
[07:13:18] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[07:17:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P54985 and previous config saved to /var/cache/conftool/dbconfig/20240119-071739-marostegui.json
[07:28:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1227 (re)pooling @ 50%: T354336', diff saved to https://phabricator.wikimedia.org/P54986 and previous config saved to /var/cache/conftool/dbconfig/20240119-072818-root.json
[07:28:23] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[07:32:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2108', diff saved to https://phabricator.wikimedia.org/P54987 and previous config saved to /var/cache/conftool/dbconfig/20240119-073245-marostegui.json
[07:43:03] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+1] Skip switch interface if no untagged_vlan when finding bgp peers [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/991619 (https://phabricator.wikimedia.org/T355225) (owner: 10Cathal Mooney)
[07:43:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1227 (re)pooling @ 75%: T354336', diff saved to https://phabricator.wikimedia.org/P54988 and previous config saved to /var/cache/conftool/dbconfig/20240119-074323-root.json
[07:43:28] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[07:44:34] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, and 2 others: Update puppet's topology.kubernetes.io/zone logic to take into account the new setup - https://phabricator.wikimedia.org/T352893 (10ayounsi) Nice !!  The v6 one is probably just a fluke, we should investigate it only if it happ...
[07:47:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2108 (T354336)', diff saved to https://phabricator.wikimedia.org/P54989 and previous config saved to /var/cache/conftool/dbconfig/20240119-074752-marostegui.json
[07:47:54] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2120.codfw.wmnet with reason: Maintenance
[07:48:19] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2120.codfw.wmnet with reason: Maintenance
[07:48:25] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2120 (T354336)', diff saved to https://phabricator.wikimedia.org/P54990 and previous config saved to /var/cache/conftool/dbconfig/20240119-074825-marostegui.json
[07:48:30] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[07:51:50] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T354336)', diff saved to https://phabricator.wikimedia.org/P54991 and previous config saved to /var/cache/conftool/dbconfig/20240119-075149-marostegui.json
[07:58:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'db1227 (re)pooling @ 100%: T354336', diff saved to https://phabricator.wikimedia.org/P54992 and previous config saved to /var/cache/conftool/dbconfig/20240119-075828-root.json
[07:58:33] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[08:00:05] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20240119T0800)
[08:05:11] <wikibugs>	 (03PS7) 10Ayounsi: [WIP] Puppet: Routed Ganeti support [puppet] - 10https://gerrit.wikimedia.org/r/990968 (https://phabricator.wikimedia.org/T300152)
[08:06:56] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P54993 and previous config saved to /var/cache/conftool/dbconfig/20240119-080655-marostegui.json
[08:08:34] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: admin: temporarily revoke legoktm's ssh key [puppet] - 10https://gerrit.wikimedia.org/r/991698
[08:09:10] <wikibugs>	 (03PS8) 10Ayounsi: [WIP] Puppet: Routed Ganeti support [puppet] - 10https://gerrit.wikimedia.org/r/990968 (https://phabricator.wikimedia.org/T300152)
[08:09:12] <wikibugs>	 (03PS1) 10Ayounsi: Bird: move firewall and default neighbor to module [puppet] - 10https://gerrit.wikimedia.org/r/991699 (https://phabricator.wikimedia.org/T300152)
[08:10:05] <wikibugs>	 (03PS1) 10Marostegui: db1224: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/991700 (https://phabricator.wikimedia.org/T354506)
[08:10:29] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Bird: move firewall and default neighbor to module [puppet] - 10https://gerrit.wikimedia.org/r/991699 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[08:11:14] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1224: Remove package declaration [puppet] - 10https://gerrit.wikimedia.org/r/991700 (https://phabricator.wikimedia.org/T354506) (owner: 10Marostegui)
[08:11:26] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host sessionstore1004.eqiad.wmnet
[08:12:45] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch sessionstore1004 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/991702 (https://phabricator.wikimedia.org/T349619)
[08:13:12] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [WIP] Puppet: Routed Ganeti support [puppet] - 10https://gerrit.wikimedia.org/r/990968 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[08:14:29] <wikibugs>	 (03CR) 10Ayounsi: "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/990968 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[08:16:55] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch sessionstore1004 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/991702 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[08:19:31] <wikibugs>	 (03PS2) 10Ayounsi: Bird: move firewall and default neighbor to module [puppet] - 10https://gerrit.wikimedia.org/r/991699 (https://phabricator.wikimedia.org/T300152)
[08:19:32] <wikibugs>	 (03PS9) 10Ayounsi: [WIP] Puppet: Routed Ganeti support [puppet] - 10https://gerrit.wikimedia.org/r/990968 (https://phabricator.wikimedia.org/T300152)
[08:20:55] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host sessionstore1004.eqiad.wmnet
[08:22:03] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2120', diff saved to https://phabricator.wikimedia.org/P54994 and previous config saved to /var/cache/conftool/dbconfig/20240119-082202-marostegui.json
[08:22:33] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host sessionstore1005.eqiad.wmnet
[08:24:19] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/991699 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[08:24:21] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch sessionstore1005 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/991736 (https://phabricator.wikimedia.org/T349619)
[08:24:59] <jinxer-wm>	 (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs1019:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[08:26:32] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch sessionstore1005 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/991736 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[08:32:26] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host sessionstore1005.eqiad.wmnet
[08:37:09] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2120 (T354336)', diff saved to https://phabricator.wikimedia.org/P54995 and previous config saved to /var/cache/conftool/dbconfig/20240119-083709-marostegui.json
[08:37:11] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2121.codfw.wmnet with reason: Maintenance
[08:37:14] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[08:37:24] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2121.codfw.wmnet with reason: Maintenance
[08:37:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2121 (T354336)', diff saved to https://phabricator.wikimedia.org/P54996 and previous config saved to /var/cache/conftool/dbconfig/20240119-083730-marostegui.json
[08:39:30] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host sessionstore1006.eqiad.wmnet
[08:39:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T354336)', diff saved to https://phabricator.wikimedia.org/P54997 and previous config saved to /var/cache/conftool/dbconfig/20240119-083954-marostegui.json
[08:40:57] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch sessionstore1006 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/991738 (https://phabricator.wikimedia.org/T349619)
[08:42:07] <wikibugs>	 10SRE, 10serviceops, 10Patch-For-Review: Too many mw versions caused out of disk space on ~30 mw hosts - https://phabricator.wikimedia.org/T355117 (10CodeReviewBot) jnuche merged https://gitlab.wikimedia.org/repos/releng/scap/-/merge_requests/204  prune old inactive branches as first step of staging a train
[08:42:45] <wikibugs>	 (03PS3) 10Ayounsi: Bird: move firewall and default neighbor to module [puppet] - 10https://gerrit.wikimedia.org/r/991699 (https://phabricator.wikimedia.org/T300152)
[08:42:47] <wikibugs>	 (03PS10) 10Ayounsi: [WIP] Puppet: Routed Ganeti support [puppet] - 10https://gerrit.wikimedia.org/r/990968 (https://phabricator.wikimedia.org/T300152)
[08:46:06] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch sessionstore1006 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/991738 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[08:47:56] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/991699 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[08:50:41] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host sessionstore1006.eqiad.wmnet
[08:53:14] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host sessionstore2004.codfw.wmnet
[08:55:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P54998 and previous config saved to /var/cache/conftool/dbconfig/20240119-085500-marostegui.json
[08:55:18] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch sessionstore2004 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/991739 (https://phabricator.wikimedia.org/T349619)
[08:59:13] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch sessionstore2004 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/991739 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[09:03:17] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host sessionstore2004.codfw.wmnet
[09:03:45] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host sessionstore2005.codfw.wmnet
[09:05:04] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch sessionstore2005 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/991740 (https://phabricator.wikimedia.org/T349619)
[09:05:38] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/991652 (https://phabricator.wikimedia.org/T355356) (owner: 10Majavah)
[09:09:19] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Haven't received the request myself, but looks good to temporarily disable in any case." [puppet] - 10https://gerrit.wikimedia.org/r/991698 (owner: 10Giuseppe Lavagetto)
[09:09:54] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] admin: temporarily revoke legoktm's ssh key [puppet] - 10https://gerrit.wikimedia.org/r/991698 (owner: 10Giuseppe Lavagetto)
[09:10:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121', diff saved to https://phabricator.wikimedia.org/P54999 and previous config saved to /var/cache/conftool/dbconfig/20240119-091007-marostegui.json
[09:10:08] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch sessionstore2005 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/991740 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[09:13:53] <wikibugs>	 (03CR) 10David Caro: "Hmmm, I'm not that comfortable changing from skipping to overwriting the creds on every run unconditionally, it's still running on top of " [puppet] - 10https://gerrit.wikimedia.org/r/991653 (https://phabricator.wikimedia.org/T355356) (owner: 10Majavah)
[09:14:49] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host sessionstore2005.codfw.wmnet
[09:15:25] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/991654 (https://phabricator.wikimedia.org/T355356) (owner: 10Majavah)
[09:15:29] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.puppet.migrate-host for host sessionstore2006.codfw.wmnet
[09:16:08] <wikibugs>	 (03CR) 10Majavah: replica_cnf_api: Do not check for file existence (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/991653 (https://phabricator.wikimedia.org/T355356) (owner: 10Majavah)
[09:16:46] <wikibugs>	 (03PS1) 10Muehlenhoff: Switch sessionstore2006 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/991741 (https://phabricator.wikimedia.org/T349619)
[09:17:03] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] maintain-dbusers: Ignore some account deletion failures [puppet] - 10https://gerrit.wikimedia.org/r/991654 (https://phabricator.wikimedia.org/T355356) (owner: 10Majavah)
[09:17:25] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] replica_cnf_api: Reduce code duplication [puppet] - 10https://gerrit.wikimedia.org/r/991652 (https://phabricator.wikimedia.org/T355356) (owner: 10Majavah)
[09:19:28] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Switch sessionstore2006 to Puppet 7 [puppet] - 10https://gerrit.wikimedia.org/r/991741 (https://phabricator.wikimedia.org/T349619) (owner: 10Muehlenhoff)
[09:20:44] <wikibugs>	 (03CR) 10Hashar: [C: 03+2] Update Gerrit to v3.7.6 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/990138 (https://phabricator.wikimedia.org/T354885) (owner: 10Hashar)
[09:21:20] <wikibugs>	 (03Merged) 10jenkins-bot: Update Gerrit to v3.7.6 and rebuild plugins [software/gerrit] (deploy/wmf/stable-3.7) - 10https://gerrit.wikimedia.org/r/990138 (https://phabricator.wikimedia.org/T354885) (owner: 10Hashar)
[09:24:12] <logmsgbot>	 !log jnuche@deploy2002 Installing scap version "4.65.2" for 531 hosts
[09:24:38] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.migrate-host (exit_code=0) for host sessionstore2006.codfw.wmnet
[09:25:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2121 (T354336)', diff saved to https://phabricator.wikimedia.org/P55000 and previous config saved to /var/cache/conftool/dbconfig/20240119-092513-marostegui.json
[09:25:14] <logmsgbot>	 !log jnuche@deploy2002 Installation of scap version "4.65.2" completed for 531 hosts
[09:25:16] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2122.codfw.wmnet with reason: Maintenance
[09:25:19] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[09:25:29] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2122.codfw.wmnet with reason: Maintenance
[09:25:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2122 (T354336)', diff saved to https://phabricator.wikimedia.org/P55001 and previous config saved to /var/cache/conftool/dbconfig/20240119-092535-marostegui.json
[09:26:22] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting server access for MFischer (WMF) - https://phabricator.wikimedia.org/T355395 (10Nahid)
[09:27:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T354336)', diff saved to https://phabricator.wikimedia.org/P55002 and previous config saved to /var/cache/conftool/dbconfig/20240119-092758-marostegui.json
[09:41:09] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting server access for MFischer (WMF) - https://phabricator.wikimedia.org/T355395 (10JanWMF) approved, thank you all :)
[09:43:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P55003 and previous config saved to /var/cache/conftool/dbconfig/20240119-094305-marostegui.json
[09:45:57] <jinxer-wm>	 (ProbeDown) firing: (11) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:46:12] <godog>	 checking
[09:46:15] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at codfw: 6.771% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[09:46:54] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2013 is CRITICAL: PYBAL CRITICAL - CRITICAL - appservers-https_443: Servers mw2313.codfw.wmnet, mw2271.codfw.wmnet, mw2338.codfw.wmnet, mw2409.codfw.wmnet, mw2378.codfw.wmnet, mw2273.codfw.wmnet, mw2415.codfw.wmnet, mw2379.codfw.wmnet, mw2312.codfw.wmnet, mw2375.codfw.wmnet, mw2310.codfw.wmnet, mw2449.codfw.wmnet, mw2413.codfw.wmnet, mw2316.codfw.wmnet, mw2447.codfw.wmnet, mw2325.codfw.wmnet, mw
[09:46:54] <icinga-wm>	 fw.wmnet, mw2386.codfw.wmnet, mw2275.codfw.wmnet, mw2361.codfw.wmnet, mw2369.codfw.wmnet, mw2303.codfw.wmnet, mw2365.codfw.wmnet, mw2406.codfw.wmnet, mw2315.codfw.wmnet, mw2327.codfw.wmnet, mw2433.codfw.wmnet, mw2270.codfw.wmnet, mw2441.codfw.wmnet, mw2339.codfw.wmnet, mw2272.codfw.wmnet, mw2377.codfw.wmnet, mw2385.codfw.wmnet, mw2331.codfw.wmnet, mw2277.codfw.wmnet, mw2384.codfw.wmnet, mw2305.codfw.wmnet, mw2388.codfw.wmnet, mw2337.codfw
[09:46:54] <icinga-wm>	 mw2383.codfw.wmnet, mw2301.codfw.wmnet, mw2336.codfw.wmnet, mw2335.codfw.wmnet, mw2363.codfw.wmnet, mw2432.codfw.wmnet, mw2329.codfw.wmnet, mw2391.codfw.wmnet, mw2387.codfw.wmnet, mw243 https://wikitech.wikimedia.org/wiki/PyBal
[09:47:03] <wikibugs>	 (03PS1) 10Btullis: Revert "varnish: enrich X-Analytics for browser prefetch / prerender / preview" [puppet] - 10https://gerrit.wikimedia.org/r/991563
[09:47:15] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle php7.4-fpm.service workers for Mediawiki appserver at codfw #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=appserver&var-site=codfw&viewPanel=64 - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[09:47:19] * hnowlan here
[09:47:37] <godog>	 ok so appservers in trouble clearly
[09:47:43] <jinxer-wm>	 (VarnishUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
[09:47:48] <jinxer-wm>	 (HaproxyUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[09:47:56] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs2014 is CRITICAL: PYBAL CRITICAL - CRITICAL - mw-web_4450: Servers kubernetes2060.codfw.wmnet, kubernetes2056.codfw.wmnet, kubernetes2034.codfw.wmnet, kubernetes2039.codfw.wmnet, kubernetes2012.codfw.wmnet, kubernetes2016.codfw.wmnet, kubernetes2020.codfw.wmnet, kubernetes2024.codfw.wmnet, kubernetes2011.codfw.wmnet, kubernetes2014.codfw.wmnet, kubernetes2048.codfw.wmnet, kubernetes2059.codfw.wm
[09:47:56] <icinga-wm>	 ernetes2010.codfw.wmnet, kubernetes2013.codfw.wmnet, kubernetes2047.codfw.wmnet, kubernetes2036.codfw.wmnet, kubernetes2029.codfw.wmnet, kubernetes2040.codfw.wmnet, kubernetes2019.codfw.wmnet, kubernetes2005.codfw.wmnet, kubernetes2045.codfw.wmnet, kubernetes2057.codfw.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[09:48:15] <jinxer-wm>	 (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[09:48:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "varnish: enrich X-Analytics for browser prefetch / prerender / preview" [puppet] - 10https://gerrit.wikimedia.org/r/991563 (owner: 10Btullis)
[09:48:20] <wikibugs>	 (03PS1) 10Muehlenhoff: Add Arthur Taylor to restricted group [puppet] - 10https://gerrit.wikimedia.org/r/991743 (https://phabricator.wikimedia.org/T354049)
[09:48:24] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2013 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[09:48:38] <godog>	 checking logs too
[09:49:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: codfw appserver GET/200: 12.899200238415133s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[09:49:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: codfw api_appserver GET/200: 0.5649918780062225s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyE
[09:49:26] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs2014 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[09:49:56] <godog>	 mmhh looks like we are recovering
[09:50:06] <godog>	 trying to understand what happened
[09:50:23] <wikibugs>	 (03PS2) 10Btullis: Revert "varnish: enrich X-Analytics for browser prefetch / prerender / preview" [puppet] - 10https://gerrit.wikimedia.org/r/991563 (https://phabricator.wikimedia.org/T355391)
[09:50:25] <hnowlan>	 yeah 5xx and response times on the way down 
[09:50:57] <jinxer-wm>	 (ProbeDown) resolved: (13) Service appservers-https:443 has failed probes (http_appservers-https_ip4) #page  - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[09:51:15] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle PHP-FPM workers for Mediawiki mw-api-ext at codfw: 34.24% idle - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=84&var-dc=codfw%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-ext&var-container_name=All - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[09:51:29] <jelto>	 wikikube codfw had quite a lot of "context deadline exceeded (Client.Timeout exceeded while awaiting headers)" for around 5 minutes
[09:51:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Add Arthur Taylor to restricted group [puppet] - 10https://gerrit.wikimedia.org/r/991743 (https://phabricator.wikimedia.org/T354049) (owner: 10Muehlenhoff)
[09:52:10] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/991462
[09:52:15] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle php7.4-fpm.service workers for Mediawiki appserver at codfw #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?from=now-3h&orgId=1&to=now&var-cluster=appserver&var-site=codfw&viewPanel=64 - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[09:52:43] <jinxer-wm>	 (VarnishUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DVarnishUnavailable
[09:52:48] <jinxer-wm>	 (HaproxyUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/HAProxy#HAProxy_for_edge_caching - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyUnavailable
[09:53:15] <jinxer-wm>	 (MediaWikiHighErrorRate) resolved: (3) Elevated rate of MediaWiki errors - appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook  - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate
[09:54:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: codfw appserver GET/200: 0.5312047506180468s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceede
[09:54:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: codfw api_appserver GET/200: ...
[09:54:15] <jinxer-wm>	 0.5649918780062225s - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=codfw&var-cluster=api_appserver&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[09:57:13] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] Revert "varnish: enrich X-Analytics for browser prefetch / prerender / preview" [puppet] - 10https://gerrit.wikimedia.org/r/991563 (https://phabricator.wikimedia.org/T355391) (owner: 10Btullis)
[09:58:12] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122', diff saved to https://phabricator.wikimedia.org/P55004 and previous config saved to /var/cache/conftool/dbconfig/20240119-095811-marostegui.json
[10:03:10] <wikibugs>	 (03CR) 10Fabfur: [C: 03+1] "looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/991563 (https://phabricator.wikimedia.org/T355391) (owner: 10Btullis)
[10:07:13] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Revert "varnish: enrich X-Analytics for browser prefetch / prerender / preview" [puppet] - 10https://gerrit.wikimedia.org/r/991563 (https://phabricator.wikimedia.org/T355391) (owner: 10Btullis)
[10:07:56] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review, 10User-ItamarWMDE: Requesting access to <restricted> for Arthur Taylor - https://phabricator.wikimedia.org/T354049 (10MoritzMuehlenhoff) 05In progress→03Resolved a:03MoritzMuehlenhoff @ArthurTaylor I've enabled your access. It takes up to 30 minutes u...
[10:13:00] <wikibugs>	 (03Abandoned) 10Jbond: Revert "rsyslog: update to use pki certificates" [puppet] - 10https://gerrit.wikimedia.org/r/961224 (owner: 10Jbond)
[10:13:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2122 (T354336)', diff saved to https://phabricator.wikimedia.org/P55005 and previous config saved to /var/cache/conftool/dbconfig/20240119-101318-marostegui.json
[10:13:20] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2150.codfw.wmnet with reason: Maintenance
[10:13:23] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[10:13:34] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2150.codfw.wmnet with reason: Maintenance
[10:13:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2150 (T354336)', diff saved to https://phabricator.wikimedia.org/P55006 and previous config saved to /var/cache/conftool/dbconfig/20240119-101340-marostegui.json
[10:29:52] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting server access for MFischer (WMF) - https://phabricator.wikimedia.org/T355395 (10BTullis) I have reviewed the request and I believe that the most appropriate data access level is that set out here: [[https://wikitech.wikimedia.org/wiki/Analytics/Data_access#ssh_login_to_...
[10:31:44] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting server access for MFischer (WMF) - https://phabricator.wikimedia.org/T355395 (10BTullis)
[10:40:00] <wikibugs>	 (03PS3) 10Muehlenhoff: mediawiki::cgroup: Enanble v1 cgroups on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/991347 (https://phabricator.wikimedia.org/T325228)
[10:40:18] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Skip switch interface if no untagged_vlan when finding bgp peers [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/991619 (https://phabricator.wikimedia.org/T355225) (owner: 10Cathal Mooney)
[10:40:30] <wikibugs>	 (03CR) 10Muehlenhoff: mediawiki::cgroup: Enanble v1 cgroups on bullseye (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/991347 (https://phabricator.wikimedia.org/T325228) (owner: 10Muehlenhoff)
[10:41:30] <wikibugs>	 (03PS1) 10Jelto: miscweb: update design-style-guide to show deprecation notice [deployment-charts] - 10https://gerrit.wikimedia.org/r/991748 (https://phabricator.wikimedia.org/T347754)
[10:42:58] <logmsgbot>	 !log cmooney@cumin1002 START - Cookbook sre.deploy.python-code homer to cumin2002.codfw.wmnet,cumin[1001-1002].eqiad.wmnet with reason: Release v0.6.5 - cmooney@cumin1002
[10:45:24] <logmsgbot>	 !log cmooney@cumin1002 END (PASS) - Cookbook sre.deploy.python-code (exit_code=0) homer to cumin2002.codfw.wmnet,cumin[1001-1002].eqiad.wmnet with reason: Release v0.6.5 - cmooney@cumin1002
[10:56:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T354336)', diff saved to https://phabricator.wikimedia.org/P55007 and previous config saved to /var/cache/conftool/dbconfig/20240119-105621-marostegui.json
[10:56:26] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[10:57:40] <wikibugs>	 (03PS1) 10Muehlenhoff: profile::openstack::codfw1dev::db: Convert ferm::rule into firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/991752
[10:58:13] <wikibugs>	 (03PS4) 10Muehlenhoff: mediawiki::cgroup: Enable v1 cgroups on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/991347 (https://phabricator.wikimedia.org/T325228)
[10:59:05] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T352010)', diff saved to https://phabricator.wikimedia.org/P55008 and previous config saved to /var/cache/conftool/dbconfig/20240119-105904-ladsgroup.json
[10:59:09] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[11:02:17] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/991752 (owner: 10Muehlenhoff)
[11:10:34] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+1] miscweb: update design-style-guide to show deprecation notice [deployment-charts] - 10https://gerrit.wikimedia.org/r/991748 (https://phabricator.wikimedia.org/T347754) (owner: 10Jelto)
[11:11:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P55009 and previous config saved to /var/cache/conftool/dbconfig/20240119-111127-marostegui.json
[11:14:12] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P55010 and previous config saved to /var/cache/conftool/dbconfig/20240119-111411-ladsgroup.json
[11:20:38] <wikibugs>	 (03PS1) 10Filippo Giunchedi: o11y: report consumer group for LogstashKafkaConsumerLag [alerts] - 10https://gerrit.wikimedia.org/r/991756
[11:22:27] <wikibugs>	 (03PS3) 10Andrea Denisse: grafana: Create the grafana sysuser with a reserved UID/GID [puppet] - 10https://gerrit.wikimedia.org/r/990795 (https://phabricator.wikimedia.org/T352665)
[11:25:51] <wikibugs>	 (03CR) 10Andrea Denisse: "Hello, I created the grafana user with the same parameters the postinst script does however, I reserved UID/GID 928 for the grafana sysuse" [puppet] - 10https://gerrit.wikimedia.org/r/990795 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse)
[11:26:34] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150', diff saved to https://phabricator.wikimedia.org/P55011 and previous config saved to /var/cache/conftool/dbconfig/20240119-112634-marostegui.json
[11:27:02] <wikibugs>	 (03CR) 10Andrea Denisse: [C: 03+1] "LGTM, thank you!" [alerts] - 10https://gerrit.wikimedia.org/r/991756 (owner: 10Filippo Giunchedi)
[11:29:18] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1242', diff saved to https://phabricator.wikimedia.org/P55012 and previous config saved to /var/cache/conftool/dbconfig/20240119-112917-ladsgroup.json
[11:41:41] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2150 (T354336)', diff saved to https://phabricator.wikimedia.org/P55013 and previous config saved to /var/cache/conftool/dbconfig/20240119-114140-marostegui.json
[11:41:44] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2159.codfw.wmnet with reason: Maintenance
[11:41:46] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[11:41:57] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2159.codfw.wmnet with reason: Maintenance
[11:41:59] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on db2187.codfw.wmnet with reason: Maintenance
[11:42:13] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on db2187.codfw.wmnet with reason: Maintenance
[11:42:19] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2159 (T354336)', diff saved to https://phabricator.wikimedia.org/P55014 and previous config saved to /var/cache/conftool/dbconfig/20240119-114219-marostegui.json
[11:44:25] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1242 (T352010)', diff saved to https://phabricator.wikimedia.org/P55015 and previous config saved to /var/cache/conftool/dbconfig/20240119-114424-ladsgroup.json
[11:44:26] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1243.eqiad.wmnet with reason: Maintenance
[11:44:29] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[11:44:40] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1243.eqiad.wmnet with reason: Maintenance
[11:44:42] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T354336)', diff saved to https://phabricator.wikimedia.org/P55016 and previous config saved to /var/cache/conftool/dbconfig/20240119-114442-marostegui.json
[11:44:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1243 (T352010)', diff saved to https://phabricator.wikimedia.org/P55017 and previous config saved to /var/cache/conftool/dbconfig/20240119-114452-ladsgroup.json
[11:55:20] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] o11y: report consumer group for LogstashKafkaConsumerLag [alerts] - 10https://gerrit.wikimedia.org/r/991756 (owner: 10Filippo Giunchedi)
[11:59:49] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P55018 and previous config saved to /var/cache/conftool/dbconfig/20240119-115948-marostegui.json
[12:12:40] <wikibugs>	 (03PS11) 10Ayounsi: Puppet: Routed Ganeti support [puppet] - 10https://gerrit.wikimedia.org/r/990968 (https://phabricator.wikimedia.org/T300152)
[12:14:40] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] miscweb: update design-style-guide to show deprecation notice [deployment-charts] - 10https://gerrit.wikimedia.org/r/991748 (https://phabricator.wikimedia.org/T347754) (owner: 10Jelto)
[12:14:55] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159', diff saved to https://phabricator.wikimedia.org/P55019 and previous config saved to /var/cache/conftool/dbconfig/20240119-121455-marostegui.json
[12:15:46] <wikibugs>	 (03Merged) 10jenkins-bot: miscweb: update design-style-guide to show deprecation notice [deployment-charts] - 10https://gerrit.wikimedia.org/r/991748 (https://phabricator.wikimedia.org/T347754) (owner: 10Jelto)
[12:17:31] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, one nit inline" [puppet] - 10https://gerrit.wikimedia.org/r/990795 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse)
[12:17:34] <wikibugs>	 (03CR) 10Majavah: [C: 03+1] profile::openstack::codfw1dev::db: Convert ferm::rule into firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/991752 (owner: 10Muehlenhoff)
[12:19:38] <wikibugs>	 (03Abandoned) 10Majavah: P:openstack: expose remaining APIs to the internet [puppet] - 10https://gerrit.wikimedia.org/r/844457 (https://phabricator.wikimedia.org/T319312) (owner: 10Majavah)
[12:20:33] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] P:openstack::cinder: use cloud-private for memcached access [puppet] - 10https://gerrit.wikimedia.org/r/991592 (owner: 10Majavah)
[12:22:28] <icinga-wm>	 PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_mw-api-ext_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:25:10] <icinga-wm>	 PROBLEM - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[12:25:13] <jinxer-wm>	 (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs1019:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[12:26:58] <wikibugs>	 (03CR) 10Ayounsi: Puppet: Routed Ganeti support (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/990968 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[12:27:40] <wikibugs>	 (03PS1) 10Jelto: vrts: test delaying blackbox::check::http [puppet] - 10https://gerrit.wikimedia.org/r/991765 (https://phabricator.wikimedia.org/T354479)
[12:27:48] <wikibugs>	 10SRE, 10MW-on-K8s, 10serviceops: Reclaim jobrunner hardware for k8s - https://phabricator.wikimedia.org/T354791 (10hnowlan)
[12:28:49] <wikibugs>	 (03CR) 10Ayounsi: "I guess if we want to go one step further/cleaner, we could move:" [puppet] - 10https://gerrit.wikimedia.org/r/991699 (https://phabricator.wikimedia.org/T300152) (owner: 10Ayounsi)
[12:29:57] <logmsgbot>	 !log gmodena@deploy2002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply
[12:29:59] <logmsgbot>	 !log gmodena@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[12:30:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2159 (T354336)', diff saved to https://phabricator.wikimedia.org/P55020 and previous config saved to /var/cache/conftool/dbconfig/20240119-123001-marostegui.json
[12:30:04] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2168.codfw.wmnet with reason: Maintenance
[12:30:08] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[12:30:17] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2168.codfw.wmnet with reason: Maintenance
[12:30:23] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (NOOP 1 DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1165/" [puppet] - 10https://gerrit.wikimedia.org/r/991765 (https://phabricator.wikimedia.org/T354479) (owner: 10Jelto)
[12:30:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2168:3317 (T354336)', diff saved to https://phabricator.wikimedia.org/P55021 and previous config saved to /var/cache/conftool/dbconfig/20240119-123023-marostegui.json
[12:32:36] <logmsgbot>	 !log gmodena@deploy2002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply
[12:32:40] <logmsgbot>	 !log gmodena@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[12:33:48] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T354336)', diff saved to https://phabricator.wikimedia.org/P55022 and previous config saved to /var/cache/conftool/dbconfig/20240119-123347-marostegui.json
[12:36:12] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Migrate IP gateway for public1-a-codfw to spine switches - https://phabricator.wikimedia.org/T351532 (10cmooney) p:05Medium→03Low Going to delay this for now.  We have enough disruptive changes planned not to burden wider SRE with this one in the next few we...
[12:36:16] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Migrate IP gateway for private1-b-codfw to spine switches - https://phabricator.wikimedia.org/T351534 (10cmooney) p:05Triage→03Low Going to delay this for now.  We have enough disruptive changes planned not to burden wider SRE with this one in the next few w...
[12:38:23] <wikibugs>	 (03CR) 10Jelto: [V: 03+1] "What do you think about trying this new feature on VRTS? I can monitor Prometheus alerts and config and revert after successful test. We c" [puppet] - 10https://gerrit.wikimedia.org/r/991765 (https://phabricator.wikimedia.org/T354479) (owner: 10Jelto)
[12:39:27] <wikibugs>	 (03PS1) 10Majavah: P:openstack: glance: use cloud-private for memcached access [puppet] - 10https://gerrit.wikimedia.org/r/991769 (https://phabricator.wikimedia.org/T355417)
[12:39:29] <wikibugs>	 (03PS1) 10Majavah: P:openstack: placement: use cloud-private for memcached access [puppet] - 10https://gerrit.wikimedia.org/r/991770 (https://phabricator.wikimedia.org/T355417)
[12:39:31] <wikibugs>	 (03PS1) 10Majavah: P:openstack: keystone: use cloud-private for memcached access [puppet] - 10https://gerrit.wikimedia.org/r/991771 (https://phabricator.wikimedia.org/T355417)
[12:39:33] <wikibugs>	 (03PS1) 10Majavah: P:openstack: nova: use cloud-private for memcached access [puppet] - 10https://gerrit.wikimedia.org/r/991772 (https://phabricator.wikimedia.org/T355417)
[12:39:35] <wikibugs>	 (03PS1) 10Majavah: P:openstack: neutron: use cloud-private for memcached access [puppet] - 10https://gerrit.wikimedia.org/r/991773 (https://phabricator.wikimedia.org/T355417)
[12:41:57] <logmsgbot>	 !log jelto@deploy2002 helmfile [staging] START helmfile.d/services/miscweb: apply
[12:42:23] <logmsgbot>	 !log jelto@deploy2002 helmfile [staging] DONE helmfile.d/services/miscweb: apply
[12:42:45] <wikibugs>	 (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (CORE_DIFF 5): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1166/co" [puppet] - 10https://gerrit.wikimedia.org/r/991773 (https://phabricator.wikimedia.org/T355417) (owner: 10Majavah)
[12:43:31] <logmsgbot>	 !log jelto@deploy2002 helmfile [eqiad] START helmfile.d/services/miscweb: apply
[12:44:16] <logmsgbot>	 !log jelto@deploy2002 helmfile [eqiad] DONE helmfile.d/services/miscweb: apply
[12:44:52] <logmsgbot>	 !log jelto@deploy2002 helmfile [codfw] START helmfile.d/services/miscweb: apply
[12:45:15] <logmsgbot>	 !log jelto@deploy2002 helmfile [codfw] DONE helmfile.d/services/miscweb: apply
[12:48:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P55023 and previous config saved to /var/cache/conftool/dbconfig/20240119-124853-marostegui.json
[12:52:45] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] P:openstack: glance: use cloud-private for memcached access [puppet] - 10https://gerrit.wikimedia.org/r/991769 (https://phabricator.wikimedia.org/T355417) (owner: 10Majavah)
[12:52:52] <wikibugs>	 (03CR) 10Majavah: [C: 03+2] P:openstack: placement: use cloud-private for memcached access [puppet] - 10https://gerrit.wikimedia.org/r/991770 (https://phabricator.wikimedia.org/T355417) (owner: 10Majavah)
[12:54:15] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Thumbor, 10Traffic: Cache thumbs in our caching infrastructure (e.g. ATS) - https://phabricator.wikimedia.org/T345334 (10hnowlan) >>! In T345334#9471632, @Midleading wrote: > Thumbor is currently heavily overloaded (T337649). As a result, traffic to thumbor should be reduced a...
[12:58:11] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] grafana: temp disable rsync stunnel for puppet7 migration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/991542 (https://phabricator.wikimedia.org/T352665) (owner: 10Filippo Giunchedi)
[12:58:21] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Thumbor, 10Traffic: Cache thumbs in our caching infrastructure (e.g. ATS) - https://phabricator.wikimedia.org/T345334 (10taavi) 05Stalled→03Open
[13:01:00] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove Marko from a few groups no longer needed/used [puppet] - 10https://gerrit.wikimedia.org/r/991774
[13:04:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317', diff saved to https://phabricator.wikimedia.org/P55024 and previous config saved to /var/cache/conftool/dbconfig/20240119-130400-marostegui.json
[13:14:48] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] grafana: Create the grafana sysuser with a reserved UID/GID (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/990795 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse)
[13:18:28] <icinga-wm>	 RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[13:19:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2168:3317 (T354336)', diff saved to https://phabricator.wikimedia.org/P55026 and previous config saved to /var/cache/conftool/dbconfig/20240119-131906-marostegui.json
[13:19:09] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2169.codfw.wmnet with reason: Maintenance
[13:19:23] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[13:19:23] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2169.codfw.wmnet with reason: Maintenance
[13:19:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2169:3317 (T354336)', diff saved to https://phabricator.wikimedia.org/P55027 and previous config saved to /var/cache/conftool/dbconfig/20240119-131929-marostegui.json
[13:21:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T354336)', diff saved to https://phabricator.wikimedia.org/P55028 and previous config saved to /var/cache/conftool/dbconfig/20240119-132153-marostegui.json
[13:28:14] <icinga-wm>	 RECOVERY - Check unit status of httpbb_kubernetes_mw-api-ext_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_mw-api-ext_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[13:32:11] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc1046.eqiad.wmnet
[13:32:17] <logmsgbot>	 !log jiji@cumin1002 START - Cookbook sre.hosts.reboot-single for host mc2046.codfw.wmnet
[13:36:58] <wikibugs>	 10SRE, 10ops-codfw, 10DBA, 10DC-Ops: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350 (10Marostegui)
[13:37:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P55029 and previous config saved to /var/cache/conftool/dbconfig/20240119-133659-marostegui.json
[13:37:11] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q#:rack/setup/install db2196-db2220 - https://phabricator.wikimedia.org/T355350 (10Marostegui)
[13:37:19] <wikibugs>	 (03CR) 10Herron: "good idea thanks for this" [alerts] - 10https://gerrit.wikimedia.org/r/991756 (owner: 10Filippo Giunchedi)
[13:38:05] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc1046.eqiad.wmnet
[13:38:14] <logmsgbot>	 !log jiji@cumin1002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host mc2046.codfw.wmnet
[13:39:29] <wikibugs>	 10SRE, 10ops-codfw, 10DC-Ops, 10Data-Persistence: Q3:rack/setup/install es[2035-2040] - https://phabricator.wikimedia.org/T355343 (10Marostegui)
[13:43:32] <logmsgbot>	 !log gmodena@deploy2002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply
[13:45:58] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Codfw row A/B top-of-rack switch refresh - https://phabricator.wikimedia.org/T327938 (10cmooney)
[13:46:04] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10netops: Add new codfw private vlan sub-interfaces to lvs2013 and lvs2014 - https://phabricator.wikimedia.org/T348225 (10cmooney) 05Open→03Resolved Done under {{T348218}}
[13:46:17] <logmsgbot>	 !log gmodena@deploy2002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply
[13:46:20] <logmsgbot>	 !log gmodena@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[13:46:45] <wikibugs>	 (03PS4) 10Muehlenhoff: failoid: Remove system::role [puppet] - 10https://gerrit.wikimedia.org/r/983687
[13:47:32] <wikibugs>	 (03PS4) 10Andrea Denisse: grafana: Create the grafana sysuser with a reserved UID/GID [puppet] - 10https://gerrit.wikimedia.org/r/990795 (https://phabricator.wikimedia.org/T352665)
[13:48:35] <wikibugs>	 (03CR) 10Andrea Denisse: grafana: Create the grafana sysuser with a reserved UID/GID (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/990795 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse)
[13:52:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317', diff saved to https://phabricator.wikimedia.org/P55030 and previous config saved to /var/cache/conftool/dbconfig/20240119-135206-marostegui.json
[13:55:02] <wikibugs>	 (03CR) 10Andrea Denisse: grafana: temp disable rsync stunnel for puppet7 migration (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/991542 (https://phabricator.wikimedia.org/T352665) (owner: 10Filippo Giunchedi)
[13:55:04] <wikibugs>	 (03CR) 10Effie Mouzeli: "(disclaimer: I have limited understanding of cassandra at WMF and in general)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/991027 (https://phabricator.wikimedia.org/T350507) (owner: 10Hnowlan)
[13:57:48] <logmsgbot>	 !log gmodena@deploy2002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply
[13:57:52] <logmsgbot>	 !log gmodena@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[14:00:08] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/983687 (owner: 10Muehlenhoff)
[14:00:15] <wikibugs>	 (03CR) 10Jgreen: [C: 03+1] Fix deployment diff issue and clean up presentation [puppet] - 10https://gerrit.wikimedia.org/r/991681 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt)
[14:00:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/990795 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse)
[14:02:43] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic1103.eqiad.wmnet with OS bullseye
[14:06:52] <logmsgbot>	 !log gmodena@deploy2002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
[14:06:59] <logmsgbot>	 !log gmodena@deploy2002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[14:07:13] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2169:3317 (T354336)', diff saved to https://phabricator.wikimedia.org/P55031 and previous config saved to /var/cache/conftool/dbconfig/20240119-140712-marostegui.json
[14:07:15] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2182.codfw.wmnet with reason: Maintenance
[14:07:40] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2182.codfw.wmnet with reason: Maintenance
[14:07:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db2182 (T354336)', diff saved to https://phabricator.wikimedia.org/P55032 and previous config saved to /var/cache/conftool/dbconfig/20240119-140746-marostegui.json
[14:07:57] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[14:12:17] <logmsgbot>	 !log gmodena@deploy2002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply
[14:12:20] <logmsgbot>	 !log gmodena@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[14:13:44] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] failoid: Remove system::role [puppet] - 10https://gerrit.wikimedia.org/r/983687 (owner: 10Muehlenhoff)
[14:13:59] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic1107.eqiad.wmnet with OS bullseye
[14:14:11] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T354336)', diff saved to https://phabricator.wikimedia.org/P55033 and previous config saved to /var/cache/conftool/dbconfig/20240119-141411-marostegui.json
[14:14:27] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[14:17:37] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1103.eqiad.wmnet with reason: host reimage
[14:18:54] <wikibugs>	 10SRE, 10ops-codfw, 10serviceops: Broken CPU on mw2394 - https://phabricator.wikimedia.org/T354193 (10Jhancock.wm) 05Open→03Resolved @Clement_Goubert I replaced the DIMM and the error has cleared. You should be able to add back.
[14:20:49] <wikibugs>	 (03PS1) 10Gmodena: mw-page-content-change-enrich: increase max.request.size [deployment-charts] - 10https://gerrit.wikimedia.org/r/991781 (https://phabricator.wikimedia.org/T355426)
[14:20:49] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1103.eqiad.wmnet with reason: host reimage
[14:21:29] <logmsgbot>	 !log gmodena@deploy2002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply
[14:21:33] <logmsgbot>	 !log gmodena@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[14:22:13] <wikibugs>	 (03CR) 10Brouberol: [C: 03+1] mw-page-content-change-enrich: increase max.request.size [deployment-charts] - 10https://gerrit.wikimedia.org/r/991781 (https://phabricator.wikimedia.org/T355426) (owner: 10Gmodena)
[14:23:14] <wikibugs>	 (03CR) 10Gmodena: [C: 03+2] mw-page-content-change-enrich: increase max.request.size [deployment-charts] - 10https://gerrit.wikimedia.org/r/991781 (https://phabricator.wikimedia.org/T355426) (owner: 10Gmodena)
[14:24:12] <wikibugs>	 (03Merged) 10jenkins-bot: mw-page-content-change-enrich: increase max.request.size [deployment-charts] - 10https://gerrit.wikimedia.org/r/991781 (https://phabricator.wikimedia.org/T355426) (owner: 10Gmodena)
[14:27:33] <logmsgbot>	 !log gmodena@deploy2002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply
[14:27:36] <logmsgbot>	 !log gmodena@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply
[14:29:18] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P55034 and previous config saved to /var/cache/conftool/dbconfig/20240119-142917-marostegui.json
[14:29:19] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on elastic1107.eqiad.wmnet with reason: host reimage
[14:30:29] <wikibugs>	 (03CR) 10Filippo Giunchedi: "See inline, LGTM overall" [puppet] - 10https://gerrit.wikimedia.org/r/990795 (https://phabricator.wikimedia.org/T352665) (owner: 10Andrea Denisse)
[14:31:47] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2088.codfw.wmnet']
[14:33:05] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on elastic1107.eqiad.wmnet with reason: host reimage
[14:34:23] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['elastic2088.codfw.wmnet']
[14:34:39] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2088.codfw.wmnet']
[14:34:58] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['elastic2088.codfw.wmnet']
[14:35:19] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2088.codfw.wmnet with OS bullseye
[14:37:37] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1103.eqiad.wmnet with OS bullseye
[14:39:19] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:44:24] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182', diff saved to https://phabricator.wikimedia.org/P55036 and previous config saved to /var/cache/conftool/dbconfig/20240119-144423-marostegui.json
[14:44:51] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting server access for MFischer (WMF) - https://phabricator.wikimedia.org/T355395 (10MFischer)
[14:45:55] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting server access for MFischer (WMF) - https://phabricator.wikimedia.org/T355395 (10MFischer) I have added the key, thank you everyone  :)
[14:49:37] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm, PCC diff looks reasonable." [puppet] - 10https://gerrit.wikimedia.org/r/987135 (https://phabricator.wikimedia.org/T257741) (owner: 10EoghanGaffney)
[14:50:30] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host elastic1107.eqiad.wmnet with OS bullseye
[14:51:58] <wikibugs>	 (03PS1) 10Ssingh: P:lvs: set monitoring enabled for IPIP-related services [puppet] - 10https://gerrit.wikimedia.org/r/991785 (https://phabricator.wikimedia.org/T351069)
[14:54:27] <icinga-wm>	 PROBLEM - MariaDB Replica SQL: analytics_meta on db1208 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1064, Errmsg: Error You have an error in your SQL syntax: check the manual that corresponds to your MariaDB server version for the right syntax to use near offset INTEGER) at line 1 on query. Default database: superset_staging. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_
[14:55:44] <wikibugs>	 (03PS5) 10Klausman: helmfile/rbac: Allow deploy users to debug pods in experimental [deployment-charts] - 10https://gerrit.wikimedia.org/r/991309 (https://phabricator.wikimedia.org/T354516)
[14:56:16] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2094.codfw.wmnet with OS bullseye
[14:57:21] <wikibugs>	 (03PS1) 10Muehlenhoff: Deprecate system::role for IF services (batch one) [puppet] - 10https://gerrit.wikimedia.org/r/991786
[14:59:19] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:59:31] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db2182 (T354336)', diff saved to https://phabricator.wikimedia.org/P55038 and previous config saved to /var/cache/conftool/dbconfig/20240119-145930-marostegui.json
[14:59:35] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[14:59:41] <wikibugs>	 (03PS1) 10Jgiannelos: mobileapps: Use core page html on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/991787
[15:00:02] <wikibugs>	 (03PS2) 10Jgiannelos: mobileapps: Use core page html on staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/991787 (https://phabricator.wikimedia.org/T339865)
[15:00:39] <wikibugs>	 (03CR) 10Klausman: helmfile/rbac: Allow deploy users to debug pods in experimental (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/991309 (https://phabricator.wikimedia.org/T354516) (owner: 10Klausman)
[15:00:52] <wikibugs>	 (03PS2) 10Ssingh: P:lvs: set monitoring enabled for IPIP-related services [puppet] - 10https://gerrit.wikimedia.org/r/991785 (https://phabricator.wikimedia.org/T351069)
[15:01:12] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db2118.codfw.wmnet with reason: Maintenance
[15:01:26] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db2118.codfw.wmnet with reason: Maintenance
[15:01:55] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/991309 (https://phabricator.wikimedia.org/T354516) (owner: 10Klausman)
[15:02:01] <wikibugs>	 (03CR) 10Ssingh: [V: 03+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/label=puppet5-compiler-node/1168/co" [puppet] - 10https://gerrit.wikimedia.org/r/991785 (https://phabricator.wikimedia.org/T351069) (owner: 10Ssingh)
[15:03:14] <wikibugs>	 (03PS1) 10Bking: cloudelastic: bring cloudelastic10[07-10] into svc [puppet] - 10https://gerrit.wikimedia.org/r/991788 (https://phabricator.wikimedia.org/T351354)
[15:03:50] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting server access for MFischer (WMF) - https://phabricator.wikimedia.org/T355395 (10MoritzMuehlenhoff)
[15:05:40] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[15:05:45] <wikibugs>	 (03CR) 10Hnowlan: modules: add cassandra client module (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/991027 (https://phabricator.wikimedia.org/T350507) (owner: 10Hnowlan)
[15:05:54] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1140.eqiad.wmnet with reason: Maintenance
[15:09:24] <wikibugs>	 (03PS1) 10Muehlenhoff: Add mfischer to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/991789 (https://phabricator.wikimedia.org/T355395)
[15:09:39] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[15:09:48] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting server access for MFischer (WMF) - https://phabricator.wikimedia.org/T355395 (10MoritzMuehlenhoff)
[15:09:53] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1150.eqiad.wmnet with reason: Maintenance
[15:09:57] <wikibugs>	 (03CR) 10Gehel: "It looks like we missing entries in:" [puppet] - 10https://gerrit.wikimedia.org/r/991788 (https://phabricator.wikimedia.org/T351354) (owner: 10Bking)
[15:13:54] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[15:14:07] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1166.eqiad.wmnet with reason: Maintenance
[15:14:08] <wikibugs>	 (03PS2) 10Bking: cloudelastic: bring cloudelastic10[07-10] into svc [puppet] - 10https://gerrit.wikimedia.org/r/991788 (https://phabricator.wikimedia.org/T351354)
[15:14:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1166 (T354336)', diff saved to https://phabricator.wikimedia.org/P55039 and previous config saved to /var/cache/conftool/dbconfig/20240119-151413-marostegui.json
[15:14:18] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[15:16:17] <wikibugs>	 (03PS3) 10Bking: cloudelastic: bring cloudelastic10[07-10] into svc [puppet] - 10https://gerrit.wikimedia.org/r/991788 (https://phabricator.wikimedia.org/T351354)
[15:16:42] <wikibugs>	 (03CR) 10Bking: cloudelastic: bring cloudelastic10[07-10] into svc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/991788 (https://phabricator.wikimedia.org/T351354) (owner: 10Bking)
[15:19:40] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T354336)', diff saved to https://phabricator.wikimedia.org/P55040 and previous config saved to /var/cache/conftool/dbconfig/20240119-151940-marostegui.json
[15:19:46] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[15:20:23] <wikibugs>	 (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/991788 (https://phabricator.wikimedia.org/T351354) (owner: 10Bking)
[15:21:44] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[15:22:04] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Disk (sda) failed in ms-be2072 - https://phabricator.wikimedia.org/T355330 (10Jhancock.wm) a:03Jhancock.wm @MatthewVernon replaced the drive from stock.  leaving ticket open until we get replacement drive from Dell.   please @ me or papaul if any new err...
[15:28:38] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Add data1.usrdm1.scatter.red to rsync config for dumps [puppet] - 10https://gerrit.wikimedia.org/r/989217 (https://phabricator.wikimedia.org/T354679) (owner: 10Xcollazo)
[15:29:10] <wikibugs>	 (03CR) 10Bking: [C: 03+2] cloudelastic: bring cloudelastic10[07-10] into svc [puppet] - 10https://gerrit.wikimedia.org/r/991788 (https://phabricator.wikimedia.org/T351354) (owner: 10Bking)
[15:34:47] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P55041 and previous config saved to /var/cache/conftool/dbconfig/20240119-153446-marostegui.json
[15:37:29] <wikibugs>	 (03PS1) 10Muehlenhoff: udp2log: Replace ferm rules with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/991793
[15:38:18] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB Replica SQL: analytics_meta on db1208 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1064, Errmsg: Error You have an error in your SQL syntax: check the manual that corresponds to your MariaDB server version for the right syntax to use near offset INTEGER) at line 1 on query. Default database: superset_staging. [Query snipped] Marostegui https://phabricator.wikimedia.org/T355435 https://wikit
[15:38:18] <icinga-wm>	 media.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:38:19] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Disk (sda) failed in ms-be2072 - https://phabricator.wikimedia.org/T355330 (10Jhancock.wm) SR183661764
[15:38:38] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] udp2log: Replace ferm rules with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/991793 (owner: 10Muehlenhoff)
[15:39:25] <jinxer-wm>	 (SystemdUnitFailed) firing: nginx.service Failed on cloudelastic1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:39:45] <wikibugs>	 (03PS2) 10Muehlenhoff: udp2log: Replace ferm rules with firewall::service [puppet] - 10https://gerrit.wikimedia.org/r/991793
[15:40:18] <wikibugs>	 (03CR) 10JHathaway: [C: 03+1] "looks good, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/991786 (owner: 10Muehlenhoff)
[15:42:36] <wikibugs>	 (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/991793 (owner: 10Muehlenhoff)
[15:45:27] <wikibugs>	 10ops-codfw: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10Jhancock.wm)
[15:46:16] <logmsgbot>	 !log gmodena@deploy2002 Started deploy [airflow-dags/analytics@f32c06e]: (no justification provided)
[15:46:47] <logmsgbot>	 !log gmodena@deploy2002 Finished deploy [airflow-dags/analytics@f32c06e]: (no justification provided) (duration: 00m 30s)
[15:47:00] <wikibugs>	 (03PS1) 10Xcollazo: Update public_mirrors.html with new mirror info as per change id: I6b720348be1600ce5a706dbacee4e0af2673139c. [puppet] - 10https://gerrit.wikimedia.org/r/991794 (https://phabricator.wikimedia.org/T354679)
[15:48:11] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Update public_mirrors.html with new mirror info as per change id: I6b720348be1600ce5a706dbacee4e0af2673139c. [puppet] - 10https://gerrit.wikimedia.org/r/991794 (https://phabricator.wikimedia.org/T354679) (owner: 10Xcollazo)
[15:48:29] <wikibugs>	 10ops-codfw: Relocating servers out of A1 in codfw - https://phabricator.wikimedia.org/T355437 (10Jhancock.wm)
[15:49:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) nginx.service Failed on cloudelastic1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[15:49:42] <wikibugs>	 (03PS2) 10Xcollazo: Update public_mirrors.html with new mirror info. [puppet] - 10https://gerrit.wikimedia.org/r/991794 (https://phabricator.wikimedia.org/T354679)
[15:49:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P55042 and previous config saved to /var/cache/conftool/dbconfig/20240119-154953-marostegui.json
[15:54:25] <icinga-wm>	 RECOVERY - MariaDB Replica SQL: analytics_meta on db1208 is OK: OK slave_sql_state Slave_SQL_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[15:57:24] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2088.codfw.wmnet with OS bullseye
[15:58:33] <wikibugs>	 (03PS1) 10PipelineBot: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/991806
[15:59:25] <icinga-wm>	 PROBLEM - Check systemd state on cloudelastic1010 is CRITICAL: CRITICAL - degraded: The following units failed: nginx.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:59:39] <icinga-wm>	 PROBLEM - Check systemd state on cloudelastic1008 is CRITICAL: CRITICAL - degraded: The following units failed: nginx.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:59:59] <icinga-wm>	 PROBLEM - Check systemd state on cloudelastic1007 is CRITICAL: CRITICAL - degraded: The following units failed: nginx.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:03:41] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS for cloudelastic-chi-eqiad-ro on cloudelastic1008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search
[16:03:41] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS for cloudelastic-psi-eqiad on cloudelastic1010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search
[16:03:41] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS for cloudelastic-psi-eqiad on cloudelastic1007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search
[16:04:41] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] helmfile/rbac: Allow deploy users to debug pods in experimental (032 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/991309 (https://phabricator.wikimedia.org/T354516) (owner: 10Klausman)
[16:05:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T354336)', diff saved to https://phabricator.wikimedia.org/P55043 and previous config saved to /var/cache/conftool/dbconfig/20240119-160459-marostegui.json
[16:05:02] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[16:05:09] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[16:05:15] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1175.eqiad.wmnet with reason: Maintenance
[16:05:22] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1175 (T354336)', diff saved to https://phabricator.wikimedia.org/P55044 and previous config saved to /var/cache/conftool/dbconfig/20240119-160521-marostegui.json
[16:06:05] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS for cloudelastic-psi-eqiad-ro on cloudelastic1007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search
[16:06:05] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS for cloudelastic-psi-eqiad-ro on cloudelastic1010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search
[16:06:05] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS for cloudelastic-omega-eqiad on cloudelastic1008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search
[16:08:27] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS for cloudelastic-omega-eqiad-ro on cloudelastic1008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search
[16:10:29] <inflatador>	 ^^ looking at this one now
[16:10:46] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T354336)', diff saved to https://phabricator.wikimedia.org/P55045 and previous config saved to /var/cache/conftool/dbconfig/20240119-161046-marostegui.json
[16:10:55] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS for cloudelastic-psi-eqiad on cloudelastic1008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search
[16:10:56] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[16:11:45] <jinxer-wm>	 (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads  - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[16:13:23] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS for cloudelastic-chi-eqiad on cloudelastic1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search
[16:13:23] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS for cloudelastic-psi-eqiad-ro on cloudelastic1008 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search
[16:15:49] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS for cloudelastic-chi-eqiad-ro on cloudelastic1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search
[16:16:33] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2094.codfw.wmnet with OS bullseye
[16:18:05] <icinga-wm>	 PROBLEM - Check systemd state on cloudelastic1009 is CRITICAL: CRITICAL - degraded: The following units failed: nginx.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:18:15] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS for cloudelastic-omega-eqiad on cloudelastic1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search
[16:20:15] <wikibugs>	 (03PS1) 10Bking: cloudelastic: allow new hosts to request TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/991797 (https://phabricator.wikimedia.org/T351354)
[16:20:19] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM! Much nicer" [puppet] - 10https://gerrit.wikimedia.org/r/991793 (owner: 10Muehlenhoff)
[16:20:41] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS for cloudelastic-omega-eqiad-ro on cloudelastic1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search
[16:22:26] <wikibugs>	 (03PS3) 10BCornwall: dns: Don't disable puppet/bird on restarting wdns [cookbooks] - 10https://gerrit.wikimedia.org/r/991637 (https://phabricator.wikimedia.org/T353779)
[16:23:09] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS for cloudelastic-psi-eqiad on cloudelastic1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search
[16:23:28] <wikibugs>	 (03PS4) 10BCornwall: dns: Don't disable puppet/bird on restarting wdns [cookbooks] - 10https://gerrit.wikimedia.org/r/991637 (https://phabricator.wikimedia.org/T353779)
[16:25:13] <jinxer-wm>	 (RdfStreamingUpdaterHighConsumerUpdateLag) firing: wdqs1019:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[16:25:35] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS for cloudelastic-psi-eqiad-ro on cloudelastic1009 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search
[16:25:35] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS for cloudelastic-chi-eqiad on cloudelastic1007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search
[16:25:35] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS for cloudelastic-chi-eqiad on cloudelastic1010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search
[16:25:53] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P55046 and previous config saved to /var/cache/conftool/dbconfig/20240119-162552-marostegui.json
[16:26:34] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] "looks good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/991797 (https://phabricator.wikimedia.org/T351354) (owner: 10Bking)
[16:27:17] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] cloudelastic: allow new hosts to request TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/991797 (https://phabricator.wikimedia.org/T351354) (owner: 10Bking)
[16:27:26] <wikibugs>	 (03CR) 10Bking: [C: 03+2] cloudelastic: allow new hosts to request TLS certs [puppet] - 10https://gerrit.wikimedia.org/r/991797 (https://phabricator.wikimedia.org/T351354) (owner: 10Bking)
[16:28:03] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS for cloudelastic-chi-eqiad-ro on cloudelastic1007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search
[16:28:05] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS for cloudelastic-chi-eqiad-ro on cloudelastic1010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search
[16:30:33] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS for cloudelastic-omega-eqiad on cloudelastic1007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search
[16:30:33] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS for cloudelastic-omega-eqiad on cloudelastic1010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search
[16:31:33] <icinga-wm>	 RECOVERY - Elasticsearch HTTPS for cloudelastic-psi-eqiad-ro on cloudelastic1009 is OK: SSL OK - Certificate cloudelastic.wikimedia.org valid until 2024-04-18 14:47:51 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/Search
[16:31:40] <Emperor>	 !log mark new drive as non-RAID, mount, restore to service with puppet ms-be2072 T355330
[16:31:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:31:51] <stashbot>	 T355330: Disk (sda) failed in ms-be2072 - https://phabricator.wikimedia.org/T355330
[16:31:57] <icinga-wm>	 RECOVERY - Elasticsearch HTTPS for cloudelastic-psi-eqiad on cloudelastic1009 is OK: SSL OK - Certificate cloudelastic.wikimedia.org valid until 2024-04-18 14:47:51 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/Search
[16:32:07] <wikibugs>	 (03PS5) 10BCornwall: dns: Don't disable puppet/bird on restarting wdns [cookbooks] - 10https://gerrit.wikimedia.org/r/991637 (https://phabricator.wikimedia.org/T353779)
[16:32:15] <icinga-wm>	 RECOVERY - Elasticsearch HTTPS for cloudelastic-omega-eqiad-ro on cloudelastic1009 is OK: SSL OK - Certificate cloudelastic.wikimedia.org valid until 2024-04-18 14:47:51 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/Search
[16:32:15] <icinga-wm>	 RECOVERY - Check systemd state on cloudelastic1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:32:27] <icinga-wm>	 RECOVERY - Elasticsearch HTTPS for cloudelastic-omega-eqiad on cloudelastic1009 is OK: SSL OK - Certificate cloudelastic.wikimedia.org valid until 2024-04-18 14:47:51 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/Search
[16:32:33] <icinga-wm>	 RECOVERY - Elasticsearch HTTPS for cloudelastic-chi-eqiad-ro on cloudelastic1009 is OK: SSL OK - Certificate cloudelastic.wikimedia.org valid until 2024-04-18 14:47:51 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/Search
[16:32:33] <icinga-wm>	 RECOVERY - Elasticsearch HTTPS for cloudelastic-psi-eqiad-ro on cloudelastic1008 is OK: SSL OK - Certificate cloudelastic.wikimedia.org valid until 2024-04-18 14:47:51 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/Search
[16:32:33] <icinga-wm>	 RECOVERY - Elasticsearch HTTPS for cloudelastic-chi-eqiad on cloudelastic1009 is OK: SSL OK - Certificate cloudelastic.wikimedia.org valid until 2024-04-18 14:47:51 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/Search
[16:32:49] <icinga-wm>	 RECOVERY - Elasticsearch HTTPS for cloudelastic-chi-eqiad-ro on cloudelastic1008 is OK: SSL OK - Certificate cloudelastic.wikimedia.org valid until 2024-04-18 14:47:51 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/Search
[16:32:49] <icinga-wm>	 RECOVERY - Check systemd state on cloudelastic1008 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:32:55] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS for cloudelastic-omega-eqiad-ro on cloudelastic1007 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search
[16:32:57] <icinga-wm>	 PROBLEM - Elasticsearch HTTPS for cloudelastic-omega-eqiad-ro on cloudelastic1010 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Search
[16:33:21] <icinga-wm>	 RECOVERY - Elasticsearch HTTPS for cloudelastic-omega-eqiad on cloudelastic1008 is OK: SSL OK - Certificate cloudelastic.wikimedia.org valid until 2024-04-18 14:47:51 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/Search
[16:33:30] <inflatador>	 ^^ these should clear very soon
[16:33:39] <icinga-wm>	 RECOVERY - Elasticsearch HTTPS for cloudelastic-omega-eqiad-ro on cloudelastic1008 is OK: SSL OK - Certificate cloudelastic.wikimedia.org valid until 2024-04-18 14:47:51 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/Search
[16:33:55] <icinga-wm>	 RECOVERY - Elasticsearch HTTPS for cloudelastic-psi-eqiad on cloudelastic1008 is OK: SSL OK - Certificate cloudelastic.wikimedia.org valid until 2024-04-18 14:47:51 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/Search
[16:34:02] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] "I am not sure if we need a post_action here but I will leave that to volans for review." [cookbooks] - 10https://gerrit.wikimedia.org/r/991637 (https://phabricator.wikimedia.org/T353779) (owner: 10BCornwall)
[16:34:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) nginx.service Failed on cloudelastic1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:34:41] <wikibugs>	 (03CR) 10BCornwall: [C: 03+2] pybal: Disable Pint promql/series checks [alerts] - 10https://gerrit.wikimedia.org/r/987499 (https://phabricator.wikimedia.org/T353760) (owner: 10BCornwall)
[16:34:51] <icinga-wm>	 RECOVERY - Elasticsearch HTTPS for cloudelastic-psi-eqiad-ro on cloudelastic1007 is OK: SSL OK - Certificate cloudelastic.wikimedia.org valid until 2024-04-18 14:47:51 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/Search
[16:34:53] <icinga-wm>	 RECOVERY - Check systemd state on cloudelastic1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[16:35:05] <icinga-wm>	 RECOVERY - Elasticsearch HTTPS for cloudelastic-omega-eqiad on cloudelastic1007 is OK: SSL OK - Certificate cloudelastic.wikimedia.org valid until 2024-04-18 14:47:51 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/Search
[16:35:37] <icinga-wm>	 RECOVERY - Elasticsearch HTTPS for cloudelastic-chi-eqiad-ro on cloudelastic1007 is OK: SSL OK - Certificate cloudelastic.wikimedia.org valid until 2024-04-18 14:47:51 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/Search
[16:35:49] <icinga-wm>	 RECOVERY - Elasticsearch HTTPS for cloudelastic-psi-eqiad on cloudelastic1007 is OK: SSL OK - Certificate cloudelastic.wikimedia.org valid until 2024-04-18 14:47:51 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/Search
[16:35:57] <icinga-wm>	 RECOVERY - Elasticsearch HTTPS for cloudelastic-omega-eqiad-ro on cloudelastic1007 is OK: SSL OK - Certificate cloudelastic.wikimedia.org valid until 2024-04-18 14:47:51 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/Search
[16:36:05] <icinga-wm>	 RECOVERY - Elasticsearch HTTPS for cloudelastic-chi-eqiad on cloudelastic1007 is OK: SSL OK - Certificate cloudelastic.wikimedia.org valid until 2024-04-18 14:47:51 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/Search
[16:38:20] <wikibugs>	 (03PS1) 10Filippo Giunchedi: icinga: remove legacy check_nagios_paging [puppet] - 10https://gerrit.wikimedia.org/r/991801 (https://phabricator.wikimedia.org/T321808)
[16:38:42] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2088.codfw.wmnet with OS bullseye
[16:38:55] <wikibugs>	 (03PS6) 10BCornwall: dns: Don't disable puppet on restarting wdns [cookbooks] - 10https://gerrit.wikimedia.org/r/991637 (https://phabricator.wikimedia.org/T353779)
[16:39:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (4) nginx.service Failed on cloudelastic1007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[16:40:59] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P55047 and previous config saved to /var/cache/conftool/dbconfig/20240119-164058-marostegui.json
[16:41:34] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T352010)', diff saved to https://phabricator.wikimedia.org/P55048 and previous config saved to /var/cache/conftool/dbconfig/20240119-164133-ladsgroup.json
[16:41:43] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[16:41:49] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2094.codfw.wmnet with OS bullseye
[16:43:49] <wikibugs>	 10SRE-swift-storage, 10UploadWizard: Problem with uploading large files (2 GB) - https://phabricator.wikimedia.org/T355433 (10Bugreporter)
[16:45:51] <wikibugs>	 (03PS2) 10Dwisehaupt: Fix deployment diff issue and clean up presentation [puppet] - 10https://gerrit.wikimedia.org/r/991681 (https://phabricator.wikimedia.org/T343486)
[16:49:12] <wikibugs>	 (03CR) 10Jgreen: [C: 03+1] Fix deployment diff issue and clean up presentation [puppet] - 10https://gerrit.wikimedia.org/r/991681 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt)
[16:50:00] <jinxer-wm>	 (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: wdqs1019:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag
[16:56:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T354336)', diff saved to https://phabricator.wikimedia.org/P55049 and previous config saved to /var/cache/conftool/dbconfig/20240119-165605-marostegui.json
[16:56:07] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1189.eqiad.wmnet with reason: Maintenance
[16:56:10] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[16:56:21] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1189.eqiad.wmnet with reason: Maintenance
[16:56:28] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1189 (T354336)', diff saved to https://phabricator.wikimedia.org/P55050 and previous config saved to /var/cache/conftool/dbconfig/20240119-165627-marostegui.json
[16:56:40] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P55051 and previous config saved to /var/cache/conftool/dbconfig/20240119-165639-ladsgroup.json
[17:01:46] <jinxer-wm>	 (SwiftTooManyMediaUploads) resolved: Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=eqiad - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
[17:01:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T354336)', diff saved to https://phabricator.wikimedia.org/P55052 and previous config saved to /var/cache/conftool/dbconfig/20240119-170154-marostegui.json
[17:01:59] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[17:04:28] <logmsgbot>	 !log bking@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host elastic2088.codfw.wmnet with OS bullseye
[17:06:52] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2088.codfw.wmnet with OS bullseye
[17:11:47] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1243', diff saved to https://phabricator.wikimedia.org/P55053 and previous config saved to /var/cache/conftool/dbconfig/20240119-171146-ladsgroup.json
[17:17:01] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P55054 and previous config saved to /var/cache/conftool/dbconfig/20240119-171700-marostegui.json
[17:17:49] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on cloudelastic1010:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[17:22:28] <logmsgbot>	 !log bking@cumin2002 conftool action : set/pooled=yes; selector: name=cloudelastic1007.wikimedia.org
[17:23:04] <logmsgbot>	 !log bking@cumin2002 conftool action : set/pooled=yes; selector: name=cloudelastic1008.wikimedia.org
[17:23:11] <logmsgbot>	 !log bking@cumin2002 conftool action : set/pooled=yes; selector: name=cloudelastic1009.wikimedia.org
[17:23:18] <logmsgbot>	 !log bking@cumin2002 conftool action : set/pooled=yes; selector: name=cloudelastic1010.wikimedia.org
[17:25:42] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.downtime for 4:00:00 on cloudelastic1010.wikimedia.org with reason: need to fix regex certs
[17:25:57] <logmsgbot>	 !log bking@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on cloudelastic1010.wikimedia.org with reason: need to fix regex certs
[17:26:53] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1243 (T352010)', diff saved to https://phabricator.wikimedia.org/P55055 and previous config saved to /var/cache/conftool/dbconfig/20240119-172652-ladsgroup.json
[17:26:55] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1247.eqiad.wmnet with reason: Maintenance
[17:26:57] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[17:27:09] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1247.eqiad.wmnet with reason: Maintenance
[17:27:16] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1247 (T352010)', diff saved to https://phabricator.wikimedia.org/P55056 and previous config saved to /var/cache/conftool/dbconfig/20240119-172715-ladsgroup.json
[17:32:07] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P55057 and previous config saved to /var/cache/conftool/dbconfig/20240119-173207-marostegui.json
[17:39:06] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] vrts: test delaying blackbox::check::http [puppet] - 10https://gerrit.wikimedia.org/r/991765 (https://phabricator.wikimedia.org/T354479) (owner: 10Jelto)
[17:39:32] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "I like the idea of testing it with vrts1002" [puppet] - 10https://gerrit.wikimedia.org/r/991765 (https://phabricator.wikimedia.org/T354479) (owner: 10Jelto)
[17:44:46] <wikibugs>	 (03PS2) 10Dzahn: phabricator: repo-sync test, use a machine in other DC [puppet] - 10https://gerrit.wikimedia.org/r/991677 (https://phabricator.wikimedia.org/T334519)
[17:47:14] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T354336)', diff saved to https://phabricator.wikimedia.org/P55058 and previous config saved to /var/cache/conftool/dbconfig/20240119-174713-marostegui.json
[17:47:16] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1198.eqiad.wmnet with reason: Maintenance
[17:47:19] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[17:47:29] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1198.eqiad.wmnet with reason: Maintenance
[17:47:36] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1198 (T354336)', diff saved to https://phabricator.wikimedia.org/P55059 and previous config saved to /var/cache/conftool/dbconfig/20240119-174735-marostegui.json
[17:53:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T354336)', diff saved to https://phabricator.wikimedia.org/P55060 and previous config saved to /var/cache/conftool/dbconfig/20240119-175301-marostegui.json
[17:53:13] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[18:02:06] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2094.codfw.wmnet with OS bullseye
[18:08:08] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P55061 and previous config saved to /var/cache/conftool/dbconfig/20240119-180808-marostegui.json
[18:20:45] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/output/991677/1170/gitlab2003.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/991677 (https://phabricator.wikimedia.org/T334519) (owner: 10Dzahn)
[18:21:18] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] phabricator: repo-sync test, use a machine in other DC [puppet] - 10https://gerrit.wikimedia.org/r/991677 (https://phabricator.wikimedia.org/T334519) (owner: 10Dzahn)
[18:23:15] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P55062 and previous config saved to /var/cache/conftool/dbconfig/20240119-182314-marostegui.json
[18:25:58] <wikibugs>	 (03PS1) 10Dzahn: phabricator: clean up repo sync class after test [puppet] - 10https://gerrit.wikimedia.org/r/991829 (https://phabricator.wikimedia.org/T334519)
[18:27:09] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host elastic2088.codfw.wmnet with OS bullseye
[18:29:06] <wikibugs>	 (03PS1) 10Dzahn: phabricator: delete unused repo sync class after test [puppet] - 10https://gerrit.wikimedia.org/r/991830 (https://phabricator.wikimedia.org/T334519)
[18:38:21] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T354336)', diff saved to https://phabricator.wikimedia.org/P55063 and previous config saved to /var/cache/conftool/dbconfig/20240119-183821-marostegui.json
[18:38:24] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1212.eqiad.wmnet with reason: Maintenance
[18:38:30] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[18:38:38] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1212.eqiad.wmnet with reason: Maintenance
[18:38:39] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 16:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[18:38:56] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 16:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
[18:39:02] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1212 (T354336)', diff saved to https://phabricator.wikimedia.org/P55064 and previous config saved to /var/cache/conftool/dbconfig/20240119-183902-marostegui.json
[18:44:47] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T354336)', diff saved to https://phabricator.wikimedia.org/P55065 and previous config saved to /var/cache/conftool/dbconfig/20240119-184446-marostegui.json
[18:44:58] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[18:54:01] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:54:27] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:55:21] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 51305 bytes in 0.071 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:55:47] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.270 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[18:59:54] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P55066 and previous config saved to /var/cache/conftool/dbconfig/20240119-185953-marostegui.json
[19:08:08] <wikibugs>	 (03PS2) 10Eevans: sessionstore: provision sessionstore1004 (new) [puppet] - 10https://gerrit.wikimedia.org/r/989628 (https://phabricator.wikimedia.org/T353402)
[19:08:10] <wikibugs>	 (03PS2) 10Eevans: sessionstore: provision sessionstore1005 (new) [puppet] - 10https://gerrit.wikimedia.org/r/989629 (https://phabricator.wikimedia.org/T353402)
[19:08:12] <wikibugs>	 (03PS2) 10Eevans: sessionstore: provision sessionstore1006 (new) [puppet] - 10https://gerrit.wikimedia.org/r/989630 (https://phabricator.wikimedia.org/T353402)
[19:08:14] <wikibugs>	 (03PS2) 10Eevans: sessionstore: configure new hosts to reuse /srv [puppet] - 10https://gerrit.wikimedia.org/r/989631 (https://phabricator.wikimedia.org/T353402)
[19:15:00] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P55067 and previous config saved to /var/cache/conftool/dbconfig/20240119-191459-marostegui.json
[19:30:06] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T354336)', diff saved to https://phabricator.wikimedia.org/P55068 and previous config saved to /var/cache/conftool/dbconfig/20240119-193006-marostegui.json
[19:30:08] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on db1223.eqiad.wmnet with reason: Maintenance
[19:30:22] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1223.eqiad.wmnet with reason: Maintenance
[19:30:29] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Depooling db1223 (T354336)', diff saved to https://phabricator.wikimedia.org/P55069 and previous config saved to /var/cache/conftool/dbconfig/20240119-193028-marostegui.json
[19:30:29] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[19:31:15] <wikibugs>	 (03CR) 10JHathaway: [C: 03+2] Fix deployment diff issue and clean up presentation [puppet] - 10https://gerrit.wikimedia.org/r/991681 (https://phabricator.wikimedia.org/T343486) (owner: 10Dwisehaupt)
[19:36:10] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T354336)', diff saved to https://phabricator.wikimedia.org/P55070 and previous config saved to /var/cache/conftool/dbconfig/20240119-193610-marostegui.json
[19:36:19] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[19:38:02] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['elastic2088.codfw.wmnet']
[19:43:37] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['elastic2088.codfw.wmnet']
[19:45:44] <logmsgbot>	 !log bking@cumin2002 START - Cookbook sre.hosts.reimage for host elastic2088.codfw.wmnet with OS bullseye
[19:51:17] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P55071 and previous config saved to /var/cache/conftool/dbconfig/20240119-195116-marostegui.json
[20:06:23] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P55072 and previous config saved to /var/cache/conftool/dbconfig/20240119-200622-marostegui.json
[20:21:30] <logmsgbot>	 !log marostegui@cumin1002 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T354336)', diff saved to https://phabricator.wikimedia.org/P55073 and previous config saved to /var/cache/conftool/dbconfig/20240119-202129-marostegui.json
[20:21:31] <logmsgbot>	 !log marostegui@cumin1002 START - Cookbook sre.hosts.downtime for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[20:21:35] <stashbot>	 T354336: Add columns cul_result_id and cul_result_plaintext_id to cu_log - https://phabricator.wikimedia.org/T354336
[20:21:45] <logmsgbot>	 !log marostegui@cumin1002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
[20:39:17] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 226, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[21:01:46] <wikibugs>	 (03PS1) 10Andrew Bogott: Galera: switch codfw1dev clustering to private IPs [puppet] - 10https://gerrit.wikimedia.org/r/991839 (https://phabricator.wikimedia.org/T355418)
[21:08:05] <wikibugs>	 (03PS2) 10Andrew Bogott: Galera: switch codfw1dev clustering to private IPs [puppet] - 10https://gerrit.wikimedia.org/r/991839 (https://phabricator.wikimedia.org/T355418)
[21:10:38] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/991839 (https://phabricator.wikimedia.org/T355418) (owner: 10Andrew Bogott)
[21:12:10] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Galera: switch codfw1dev clustering to private IPs [puppet] - 10https://gerrit.wikimedia.org/r/991839 (https://phabricator.wikimedia.org/T355418) (owner: 10Andrew Bogott)
[21:13:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) firing: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[21:15:35] <wikibugs>	 (03PS3) 10Andrew Bogott: Galera: switch codfw1dev clustering to private IPs [puppet] - 10https://gerrit.wikimedia.org/r/991839 (https://phabricator.wikimedia.org/T355418)
[21:18:15] <jinxer-wm>	 (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad mw-api-int (k8s) - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/U7JT--knk/mw-on-k8s?orgId=1&viewPanel=55&var-dc=eqiad%20prometheus/k8s&var-service=mediawiki&var-namespace=mw-api-int - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
[21:22:38] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Galera: switch codfw1dev clustering to private IPs [puppet] - 10https://gerrit.wikimedia.org/r/991839 (https://phabricator.wikimedia.org/T355418) (owner: 10Andrew Bogott)
[21:27:49] <jinxer-wm>	 (PuppetFailure) firing: Puppet has failed on cloudelastic1010:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure
[21:29:25] <jinxer-wm>	 (SystemdUnitFailed) firing: nginx.service Failed on cloudelastic1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[21:50:48] <wikibugs>	 (03PS1) 10Bking: cloudelastic: cleanup allowed_regexes [puppet] - 10https://gerrit.wikimedia.org/r/991845 (https://phabricator.wikimedia.org/T351354)
[22:05:14] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "noop: https://puppet-compiler.wmflabs.org/output/991651/1172/" [puppet] - 10https://gerrit.wikimedia.org/r/991651 (https://phabricator.wikimedia.org/T354221) (owner: 10Dzahn)
[22:05:48] <ryankemper>	 !log [WDQS] Repooled `wdqs10[19,20]` (caught up on lag)
[22:05:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:06:12] <wikibugs>	 (03CR) 10Dwisehaupt: [C: 03+2] "Looks good. Verified the matching with a test in regexr." [puppet] - 10https://gerrit.wikimedia.org/r/991845 (https://phabricator.wikimedia.org/T351354) (owner: 10Bking)
[22:06:57] <wikibugs>	 (03PS1) 10Andrew Bogott: Galera: switch codfw1dev nodes to replicate on private address [puppet] - 10https://gerrit.wikimedia.org/r/991866 (https://phabricator.wikimedia.org/T355418)
[22:08:01] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/991866 (https://phabricator.wikimedia.org/T355418) (owner: 10Andrew Bogott)
[22:08:53] <wikibugs>	 (03CR) 10Dzahn: [V: 03+1 C: 03+2] "confirmed it's a noop on clouddumps1001/1002" [puppet] - 10https://gerrit.wikimedia.org/r/991651 (https://phabricator.wikimedia.org/T354221) (owner: 10Dzahn)
[22:13:24] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T352010)', diff saved to https://phabricator.wikimedia.org/P55074 and previous config saved to /var/cache/conftool/dbconfig/20240119-221324-ladsgroup.json
[22:13:35] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[22:14:49] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] phabricator: clean up repo sync class after test [puppet] - 10https://gerrit.wikimedia.org/r/991829 (https://phabricator.wikimedia.org/T334519) (owner: 10Dzahn)
[22:19:15] <wikibugs>	 (03PS2) 10Andrew Bogott: Galera: switch codfw1dev nodes to replicate on private address [puppet] - 10https://gerrit.wikimedia.org/r/991866 (https://phabricator.wikimedia.org/T355418)
[22:20:49] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/991866 (https://phabricator.wikimedia.org/T355418) (owner: 10Andrew Bogott)
[22:20:58] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] phabricator: delete unused repo sync class after test [puppet] - 10https://gerrit.wikimedia.org/r/991830 (https://phabricator.wikimedia.org/T334519) (owner: 10Dzahn)
[22:21:04] <wikibugs>	 (03PS2) 10Dzahn: phabricator: delete unused repo sync class after test [puppet] - 10https://gerrit.wikimedia.org/r/991830 (https://phabricator.wikimedia.org/T334519)
[22:27:55] <wikibugs>	 (03PS3) 10Andrew Bogott: Galera: switch codfw1dev nodes to replicate on private address [puppet] - 10https://gerrit.wikimedia.org/r/991866 (https://phabricator.wikimedia.org/T355418)
[22:28:31] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P55075 and previous config saved to /var/cache/conftool/dbconfig/20240119-222830-ladsgroup.json
[22:37:38] <wikibugs>	 (03PS4) 10Andrew Bogott: Galera: switch codfw1dev nodes to replicate on private address [puppet] - 10https://gerrit.wikimedia.org/r/991866 (https://phabricator.wikimedia.org/T355418)
[22:37:48] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/991866 (https://phabricator.wikimedia.org/T355418) (owner: 10Andrew Bogott)
[22:42:45] <wikibugs>	 (03PS5) 10Andrew Bogott: Galera: switch codfw1dev nodes to replicate on private address [puppet] - 10https://gerrit.wikimedia.org/r/991866 (https://phabricator.wikimedia.org/T355418)
[22:43:24] <wikibugs>	 (03CR) 10Andrew Bogott: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/991866 (https://phabricator.wikimedia.org/T355418) (owner: 10Andrew Bogott)
[22:43:38] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1247', diff saved to https://phabricator.wikimedia.org/P55076 and previous config saved to /var/cache/conftool/dbconfig/20240119-224337-ladsgroup.json
[22:45:01] <wikibugs>	 (03CR) 10Dzahn: phabricator: use same db server regardless of DC of phab server (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/989537 (owner: 10Dzahn)
[22:55:56] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Galera: switch codfw1dev nodes to replicate on private address [puppet] - 10https://gerrit.wikimedia.org/r/991866 (https://phabricator.wikimedia.org/T355418) (owner: 10Andrew Bogott)
[22:58:44] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1247 (T352010)', diff saved to https://phabricator.wikimedia.org/P55077 and previous config saved to /var/cache/conftool/dbconfig/20240119-225844-ladsgroup.json
[22:58:46] <logmsgbot>	 !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1248.eqiad.wmnet with reason: Maintenance
[22:58:52] <stashbot>	 T352010: Gradually drop old pagelinks columns - https://phabricator.wikimedia.org/T352010
[22:59:00] <logmsgbot>	 !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1248.eqiad.wmnet with reason: Maintenance
[22:59:07] <logmsgbot>	 !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1248 (T352010)', diff saved to https://phabricator.wikimedia.org/P55078 and previous config saved to /var/cache/conftool/dbconfig/20240119-225906-ladsgroup.json
[23:22:48] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting server access for MFischer (WMF) - https://phabricator.wikimedia.org/T355395 (10odimitrijevic) Approved
[23:25:56] <wikibugs>	 10SRE, 10SRE Observability: smart-data-dump should fail loudly when it can't gather metrics - https://phabricator.wikimedia.org/T267135 (10colewhite)
[23:27:10] <wikibugs>	 10SRE, 10SRE Observability: smart-data-dump should fail loudly when it can't gather metrics - https://phabricator.wikimedia.org/T267135 (10colewhite)
[23:27:16] <wikibugs>	 10SRE, 10Observability-Alerting, 10Epic: Monitor and alarm on SMART attributes [tracking] - https://phabricator.wikimedia.org/T86552 (10colewhite)
[23:35:35] <icinga-wm>	 RECOVERY - Check systemd state on cloudelastic1010 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[23:35:45] <icinga-wm>	 RECOVERY - Elasticsearch HTTPS for cloudelastic-psi-eqiad on cloudelastic1010 is OK: SSL OK - Certificate cloudelastic.wikimedia.org valid until 2024-04-18 14:47:51 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/Search
[23:35:53] <icinga-wm>	 RECOVERY - Elasticsearch HTTPS for cloudelastic-omega-eqiad-ro on cloudelastic1010 is OK: SSL OK - Certificate cloudelastic.wikimedia.org valid until 2024-04-18 14:47:51 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/Search
[23:36:03] <icinga-wm>	 RECOVERY - Elasticsearch HTTPS for cloudelastic-chi-eqiad on cloudelastic1010 is OK: SSL OK - Certificate cloudelastic.wikimedia.org valid until 2024-04-18 14:47:51 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/Search
[23:36:15] <icinga-wm>	 RECOVERY - Elasticsearch HTTPS for cloudelastic-psi-eqiad-ro on cloudelastic1010 is OK: SSL OK - Certificate cloudelastic.wikimedia.org valid until 2024-04-18 14:47:51 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/Search
[23:36:31] <icinga-wm>	 RECOVERY - Elasticsearch HTTPS for cloudelastic-omega-eqiad on cloudelastic1010 is OK: SSL OK - Certificate cloudelastic.wikimedia.org valid until 2024-04-18 14:47:51 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/Search
[23:37:05] <icinga-wm>	 RECOVERY - Elasticsearch HTTPS for cloudelastic-chi-eqiad-ro on cloudelastic1010 is OK: SSL OK - Certificate cloudelastic.wikimedia.org valid until 2024-04-18 14:47:51 +0000 (expires in 89 days) https://wikitech.wikimedia.org/wiki/Search
[23:39:25] <jinxer-wm>	 (SystemdUnitFailed) resolved: nginx.service Failed on cloudelastic1010:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[23:57:48] <jinxer-wm>	 (PuppetFailure) resolved: Puppet has failed on cloudelastic1010:9100 - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet - https://alerts.wikimedia.org/?q=alertname%3DPuppetFailure