[00:03:47] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P47711 and previous config saved to /var/cache/conftool/dbconfig/20230505-000346-ladsgroup.json [00:08:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P47712 and previous config saved to /var/cache/conftool/dbconfig/20230505-000832-ladsgroup.json [00:11:24] (03CR) 10Dzahn: [V: 03+1 C: 03+2] gerrit: disable replication from gerrit1003 [puppet] - 10https://gerrit.wikimedia.org/r/915830 (https://phabricator.wikimedia.org/T335730) (owner: 10Dzahn) [00:16:48] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:17:15] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "noop on 1001 and 2002, on 1003 the remote section stays removed, "onStartup" is set to true again but nothing happens in replication_log o" [puppet] - 10https://gerrit.wikimedia.org/r/915830 (https://phabricator.wikimedia.org/T335730) (owner: 10Dzahn) [00:18:53] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129', diff saved to https://phabricator.wikimedia.org/P47713 and previous config saved to /var/cache/conftool/dbconfig/20230505-001853-ladsgroup.json [00:21:00] (03CR) 10Andrea Denisse: [C: 03+2] prometheus: Add label to prometheus5002 data blocks to prevent data duplication [puppet] - 10https://gerrit.wikimedia.org/r/912385 (https://phabricator.wikimedia.org/T335406) (owner: 10Andrea Denisse) [00:23:39] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125', diff saved to https://phabricator.wikimedia.org/P47714 and previous config saved to /var/cache/conftool/dbconfig/20230505-002339-ladsgroup.json [00:23:49] (03PS2) 10Andrea Denisse: prometheus: Failover DNS from prometheus5001 to prometheus5002 in eqsin [dns] - 10https://gerrit.wikimedia.org/r/913196 (https://phabricator.wikimedia.org/T309979) [00:25:26] (03CR) 10Andrea Denisse: [C: 03+2] prometheus: Add label to prometheus6002 data blocks to prevent data duplication [puppet] - 10https://gerrit.wikimedia.org/r/912409 (https://phabricator.wikimedia.org/T335406) (owner: 10Andrea Denisse) [00:27:04] RECOVERY - puppet last run on prometheus5002 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [00:27:33] (03CR) 10Andrea Denisse: [C: 03+2] prometheus: Failover DNS from prometheus5001 to prometheus5002 in eqsin [dns] - 10https://gerrit.wikimedia.org/r/913196 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [00:29:14] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:33:40] RECOVERY - puppet last run on prometheus6002 is OK: OK: Puppet is currently disabled (Disabling Puppet, Prometheus, and Thanos sidecar on the Buster host to migrate Prometheus hosts to Bullseye - T309979 - denisse), not alerting. Last run 40 seconds ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [00:34:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1129 (T335845)', diff saved to https://phabricator.wikimedia.org/P47715 and previous config saved to /var/cache/conftool/dbconfig/20230505-003359-ladsgroup.json [00:34:04] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1139.eqiad.wmnet with reason: Maintenance [00:34:17] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1139.eqiad.wmnet with reason: Maintenance [00:37:25] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance [00:37:38] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1156.eqiad.wmnet with reason: Maintenance [00:37:40] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [00:37:44] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1014,1018,1021].eqiad.wmnet,db1155.eqiad.wmnet with reason: Maintenance [00:37:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1156 (T335845)', diff saved to https://phabricator.wikimedia.org/P47716 and previous config saved to /var/cache/conftool/dbconfig/20230505-003749-ladsgroup.json [00:38:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2125 (T335845)', diff saved to https://phabricator.wikimedia.org/P47717 and previous config saved to /var/cache/conftool/dbconfig/20230505-003845-ladsgroup.json [00:38:52] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2126.codfw.wmnet with reason: Maintenance [00:39:05] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2126.codfw.wmnet with reason: Maintenance [00:39:06] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [00:39:08] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2187.codfw.wmnet with reason: Maintenance [00:39:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2126 (T335845)', diff saved to https://phabricator.wikimedia.org/P47718 and previous config saved to /var/cache/conftool/dbconfig/20230505-003914-ladsgroup.json [00:39:24] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/915787 [00:39:28] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/915787 (owner: 10TrainBranchBot) [00:44:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T335845)', diff saved to https://phabricator.wikimedia.org/P47719 and previous config saved to /var/cache/conftool/dbconfig/20230505-004408-ladsgroup.json [00:46:30] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:46:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T335845)', diff saved to https://phabricator.wikimedia.org/P47720 and previous config saved to /var/cache/conftool/dbconfig/20230505-004648-ladsgroup.json [00:54:59] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/915787 (owner: 10TrainBranchBot) [00:59:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P47721 and previous config saved to /var/cache/conftool/dbconfig/20230505-005914-ladsgroup.json [00:59:18] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:01:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P47722 and previous config saved to /var/cache/conftool/dbconfig/20230505-010154-ladsgroup.json [01:02:57] (03PS2) 10Andrea Denisse: prometheus: Failover DNS from prometheus6001 to prometheus6002 in drmrs [dns] - 10https://gerrit.wikimedia.org/r/913198 (https://phabricator.wikimedia.org/T309979) [01:03:04] (03CR) 10Andrea Denisse: [C: 03+2] prometheus: Failover DNS from prometheus6001 to prometheus6002 in drmrs [dns] - 10https://gerrit.wikimedia.org/r/913198 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [01:03:06] (03CR) 10Andrea Denisse: [V: 03+2 C: 03+2] prometheus: Failover DNS from prometheus6001 to prometheus6002 in drmrs [dns] - 10https://gerrit.wikimedia.org/r/913198 (https://phabricator.wikimedia.org/T309979) (owner: 10Andrea Denisse) [01:14:21] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156', diff saved to https://phabricator.wikimedia.org/P47723 and previous config saved to /var/cache/conftool/dbconfig/20230505-011421-ladsgroup.json [01:16:32] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:16:55] !log denisse@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus1005.eqiad.wmnet [01:17:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126', diff saved to https://phabricator.wikimedia.org/P47724 and previous config saved to /var/cache/conftool/dbconfig/20230505-011700-ladsgroup.json [01:18:57] !log denisse@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM prometheus3002.esams.wmnet [01:20:58] !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1005.eqiad.wmnet [01:21:30] !log denisse@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus2005.codfw.wmnet [01:25:17] !log denisse@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM prometheus3002.esams.wmnet [01:26:03] !log denisse@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM prometheus4002.ulsfo.wmnet [01:26:10] (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [01:28:06] PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:28:16] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: monitor_refine_event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:28:58] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:29:27] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1156 (T335845)', diff saved to https://phabricator.wikimedia.org/P47725 and previous config saved to /var/cache/conftool/dbconfig/20230505-012927-ladsgroup.json [01:29:32] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [01:29:45] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1170.eqiad.wmnet with reason: Maintenance [01:29:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3312 (T335845)', diff saved to https://phabricator.wikimedia.org/P47726 and previous config saved to /var/cache/conftool/dbconfig/20230505-012950-ladsgroup.json [01:30:47] !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2005.codfw.wmnet [01:31:05] !log denisse@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus1006.eqiad.wmnet [01:31:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1170:3317 (T335845)', diff saved to https://phabricator.wikimedia.org/P47727 and previous config saved to /var/cache/conftool/dbconfig/20230505-013108-ladsgroup.json [01:32:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2126 (T335845)', diff saved to https://phabricator.wikimedia.org/P47728 and previous config saved to /var/cache/conftool/dbconfig/20230505-013206-ladsgroup.json [01:32:11] !log denisse@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM prometheus4002.ulsfo.wmnet [01:32:14] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2148.codfw.wmnet with reason: Maintenance [01:32:27] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2148.codfw.wmnet with reason: Maintenance [01:32:33] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2148 (T335845)', diff saved to https://phabricator.wikimedia.org/P47729 and previous config saved to /var/cache/conftool/dbconfig/20230505-013232-ladsgroup.json [01:32:37] !log denisse@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM prometheus5002.eqsin.wmnet [01:33:13] (ThanosSidecarNoConnectionToStartedPrometheus) firing: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [01:39:03] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T335845)', diff saved to https://phabricator.wikimedia.org/P47730 and previous config saved to /var/cache/conftool/dbconfig/20230505-013903-ladsgroup.json [01:39:04] !log denisse@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM prometheus5002.eqsin.wmnet [01:39:22] !log denisse@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM prometheus6002.drmrs.wmnet [01:40:21] !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus1006.eqiad.wmnet [01:41:18] !log denisse@cumin1001 START - Cookbook sre.hosts.reboot-single for host prometheus2006.codfw.wmnet [01:43:13] (ThanosSidecarNoConnectionToStartedPrometheus) firing: (2) Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [01:45:37] !log denisse@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM prometheus6002.drmrs.wmnet [01:47:30] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:48:30] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:49:54] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49992 bytes in 0.105 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [01:49:58] !log denisse@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host prometheus2006.codfw.wmnet [01:54:09] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P47731 and previous config saved to /var/cache/conftool/dbconfig/20230505-015409-ladsgroup.json [01:58:22] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:54] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:08:31] (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [02:09:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148', diff saved to https://phabricator.wikimedia.org/P47732 and previous config saved to /var/cache/conftool/dbconfig/20230505-020915-ladsgroup.json [02:13:48] PROBLEM - Check systemd state on gitlab1003 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:14:56] PROBLEM - Check systemd state on gitlab1004 is CRITICAL: CRITICAL - degraded: The following units failed: backup-restore.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:17:02] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:24:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2148 (T335845)', diff saved to https://phabricator.wikimedia.org/P47733 and previous config saved to /var/cache/conftool/dbconfig/20230505-022421-ladsgroup.json [02:24:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance [02:24:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2170.codfw.wmnet with reason: Maintenance [02:24:46] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3311 (T335845)', diff saved to https://phabricator.wikimedia.org/P47734 and previous config saved to /var/cache/conftool/dbconfig/20230505-022446-ladsgroup.json [02:25:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2170:3312 (T335845)', diff saved to https://phabricator.wikimedia.org/P47735 and previous config saved to /var/cache/conftool/dbconfig/20230505-022510-ladsgroup.json [02:27:52] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:29:33] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot - ryankemper@cumin1001 - T335835 [02:29:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [02:30:43] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.reboot [02:30:43] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.reboot (exit_code=99) [02:30:50] !log ryankemper@cumin1001 START - Cookbook sre.wdqs.reboot [02:31:18] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T335845)', diff saved to https://phabricator.wikimedia.org/P47736 and previous config saved to /var/cache/conftool/dbconfig/20230505-023118-ladsgroup.json [02:39:56] PROBLEM - Check systemd state on elastic1087 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:40:24] PROBLEM - Check systemd state on elastic1059 is CRITICAL: CRITICAL - starting: Late bootup, before the job queue becomes idle for the first time, or one of the rescue targets are reached. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:41:13] (SystemdUnitFailed) firing: (7) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic1059:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:41:27] (SystemdUnitFailed) firing: (7) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic1059:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:41:28] RECOVERY - Check systemd state on elastic1087 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:41:31] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot - ryankemper@cumin1001 - T335835 [02:41:56] RECOVERY - Check systemd state on elastic1059 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:42:39] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.wdqs.reboot (exit_code=0) [02:46:13] (SystemdUnitFailed) resolved: (8) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic1059:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [02:46:24] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:46:25] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P47737 and previous config saved to /var/cache/conftool/dbconfig/20230505-024624-ladsgroup.json [02:54:18] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot - ryankemper@cumin1001 - T335835 [02:58:58] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:01:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311', diff saved to https://phabricator.wikimedia.org/P47738 and previous config saved to /var/cache/conftool/dbconfig/20230505-030130-ladsgroup.json [03:06:20] PROBLEM - Check systemd state on elastic1080 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:06:26] PROBLEM - Check systemd state on elastic1088 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:07:00] PROBLEM - Check systemd state on elastic1058 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:07:54] RECOVERY - Check systemd state on elastic1080 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:07:58] RECOVERY - Check systemd state on elastic1088 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:08:31] (NodeTextfileStale) firing: Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [03:08:34] RECOVERY - Check systemd state on elastic1058 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:15:56] PROBLEM - Check systemd state on elastic1085 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:16:13] (SystemdUnitFailed) firing: (6) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic1077:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:16:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3311 (T335845)', diff saved to https://phabricator.wikimedia.org/P47739 and previous config saved to /var/cache/conftool/dbconfig/20230505-031637-ladsgroup.json [03:17:30] RECOVERY - Check systemd state on elastic1085 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:17:42] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:19:18] PROBLEM - Persistent high iowait on clouddumps1001 is CRITICAL: 11.1 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005 [03:21:13] (SystemdUnitFailed) resolved: (6) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic1077:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:22:54] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T335845)', diff saved to https://phabricator.wikimedia.org/P47740 and previous config saved to /var/cache/conftool/dbconfig/20230505-032253-ladsgroup.json [03:27:20] PROBLEM - Check systemd state on elastic1056 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:28:38] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:28:54] RECOVERY - Check systemd state on elastic1056 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:31:46] PROBLEM - Persistent high iowait on clouddumps1001 is CRITICAL: 15.09 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005 [03:34:52] RECOVERY - Persistent high iowait on clouddumps1001 is OK: (C)10 ge (W)5 ge 2.9 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005 [03:38:00] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P47741 and previous config saved to /var/cache/conftool/dbconfig/20230505-033800-ladsgroup.json [03:47:16] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:48:12] PROBLEM - Check systemd state on elastic1061 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:48:32] PROBLEM - Check systemd state on elastic1063 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:49:46] RECOVERY - Check systemd state on elastic1061 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:50:06] RECOVERY - Check systemd state on elastic1063 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:53:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312', diff saved to https://phabricator.wikimedia.org/P47742 and previous config saved to /var/cache/conftool/dbconfig/20230505-035306-ladsgroup.json [03:58:22] PROBLEM - Persistent high iowait on clouddumps1001 is CRITICAL: 16.83 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005 [03:58:22] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:02:34] PROBLEM - Check systemd state on elastic1062 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:04:06] RECOVERY - Check systemd state on elastic1062 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:04:08] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot - ryankemper@cumin1001 - T335835 [04:08:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2170:3312 (T335845)', diff saved to https://phabricator.wikimedia.org/P47743 and previous config saved to /var/cache/conftool/dbconfig/20230505-040812-ladsgroup.json [04:08:18] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2175.codfw.wmnet with reason: Maintenance [04:08:31] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2175.codfw.wmnet with reason: Maintenance [04:08:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2175 (T335845)', diff saved to https://phabricator.wikimedia.org/P47744 and previous config saved to /var/cache/conftool/dbconfig/20230505-040837-ladsgroup.json [04:14:48] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T335845)', diff saved to https://phabricator.wikimedia.org/P47745 and previous config saved to /var/cache/conftool/dbconfig/20230505-041448-ladsgroup.json [04:16:54] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:17:29] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot - ryankemper@cumin1001 - T335835 [04:18:13] (ThanosSidecarNoConnectionToStartedPrometheus) firing: (2) Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [04:18:24] PROBLEM - Persistent high iowait on clouddumps1001 is CRITICAL: 15.17 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005 [04:18:40] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot - ryankemper@cumin1001 - T335835 [04:20:12] RECOVERY - Query Service HTTP Port on wdqs1007 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.039 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [04:20:16] RECOVERY - WDQS SPARQL on wdqs1007 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.066 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [04:21:47] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot - ryankemper@cumin1001 - T335835 [04:23:13] (ThanosSidecarNoConnectionToStartedPrometheus) firing: (2) Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [04:29:32] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:29:55] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P47746 and previous config saved to /var/cache/conftool/dbconfig/20230505-042954-ladsgroup.json [04:37:22] PROBLEM - Persistent high iowait on clouddumps1001 is CRITICAL: 11.55 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005 [04:40:40] PROBLEM - Check systemd state on elastic1093 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:40:48] PROBLEM - Check systemd state on elastic1094 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:41:13] (SystemdUnitFailed) firing: (6) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic1093:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:42:16] RECOVERY - Check systemd state on elastic1093 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:42:24] RECOVERY - Check systemd state on elastic1094 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:44:21] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "Good job, doing something along these lines was in my plans to complete the transition to the new scaffolding." [deployment-charts] - 10https://gerrit.wikimedia.org/r/915593 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [04:45:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175', diff saved to https://phabricator.wikimedia.org/P47747 and previous config saved to /var/cache/conftool/dbconfig/20230505-044500-ladsgroup.json [04:45:46] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "If you are just upping a patch version, then your change should be backwards compatible and you can just move the file when changing it." [deployment-charts] - 10https://gerrit.wikimedia.org/r/915457 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [04:46:13] (SystemdUnitFailed) resolved: (6) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic1093:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:47:02] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:47:51] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "We had a file configured here as we assumed that we wouldn't want to send to logstash all the requests for metrics from prometheus, which " [deployment-charts] - 10https://gerrit.wikimedia.org/r/915458 (https://phabricator.wikimedia.org/T303231) (owner: 10JMeybohm) [04:49:33] (03CR) 10Giuseppe Lavagetto: [C: 03+1] mesh/configuration: Refactor common_http_protocol_options [deployment-charts] - 10https://gerrit.wikimedia.org/r/915672 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [04:50:10] PROBLEM - Check systemd state on elastic1074 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:50:34] PROBLEM - Check systemd state on elastic1075 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:51:13] (SystemdUnitFailed) firing: (12) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic1074:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:51:46] RECOVERY - Check systemd state on elastic1074 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:52:10] RECOVERY - Check systemd state on elastic1075 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:54:48] PROBLEM - Persistent high iowait on clouddumps1001 is CRITICAL: 10.97 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005 [04:56:13] (SystemdUnitFailed) resolved: (12) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic1074:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [04:59:32] PROBLEM - Persistent high iowait on clouddumps1001 is CRITICAL: 10.06 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005 [04:59:40] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:00:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2175 (T335845)', diff saved to https://phabricator.wikimedia.org/P47748 and previous config saved to /var/cache/conftool/dbconfig/20230505-050007-ladsgroup.json [05:01:13] (SystemdUnitFailed) firing: (18) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic1057:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:01:14] PROBLEM - Check systemd state on elastic1057 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:02:48] RECOVERY - Check systemd state on elastic1057 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:04:12] PROBLEM - Persistent high iowait on clouddumps1001 is CRITICAL: 10.04 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005 [05:06:13] (SystemdUnitFailed) resolved: (12) prometheus-wmf-elasticsearch-exporter-9200.service Failed on elastic1057:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [05:08:56] PROBLEM - Persistent high iowait on clouddumps1001 is CRITICAL: 13.41 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005 [05:12:02] PROBLEM - Check systemd state on elastic1068 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:12:04] PROBLEM - Check systemd state on elastic1054 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:12:14] PROBLEM - Check systemd state on elastic1073 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9600.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:13:36] RECOVERY - Check systemd state on elastic1068 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:13:36] RECOVERY - Check systemd state on elastic1054 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:13:46] RECOVERY - Check systemd state on elastic1073 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:16:56] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:17:46] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 140, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:18:08] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 241, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:18:18] PROBLEM - Persistent high iowait on clouddumps1001 is CRITICAL: 18.11 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005 [05:21:56] PROBLEM - Check systemd state on elastic1098 is CRITICAL: CRITICAL - degraded: The following units failed: prometheus-wmf-elasticsearch-exporter-9200.service,prometheus-wmf-elasticsearch-exporter-9400.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:22:22] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REBOOT (3 nodes at a time) for ElasticSearch cluster search_eqiad: eqiad cluster reboot - ryankemper@cumin1001 - T335835 [05:23:30] RECOVERY - Check systemd state on elastic1098 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:26:25] (ThanosRuleHighRuleEvaluationFailures) firing: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [05:27:50] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:36:56] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:38:10] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 141, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:46:34] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:47:54] PROBLEM - Persistent high iowait on clouddumps1001 is CRITICAL: 13.28 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005 [05:48:13] (ThanosSidecarNoConnectionToStartedPrometheus) firing: (2) Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [05:49:28] RECOVERY - Persistent high iowait on clouddumps1001 is OK: (C)10 ge (W)5 ge 2.736 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005 [05:58:13] (ThanosSidecarNoConnectionToStartedPrometheus) firing: (2) Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [05:59:02] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:00:04] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230505T0600) [06:08:06] PROBLEM - Persistent high iowait on clouddumps1001 is CRITICAL: 18.96 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005 [06:08:31] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:08:31] (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [06:15:43] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host urldownloader1003.wikimedia.org [06:17:42] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:19:30] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader1003.wikimedia.org [06:23:09] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host urldownloader1004.wikimedia.org [06:23:10] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I initally looked at the puppet implementation too, but I have a more fundamental problem with this patch - I think we're mixing stuff tha" [puppet] - 10https://gerrit.wikimedia.org/r/909687 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [06:25:28] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:26:57] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader1004.wikimedia.org [06:27:36] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host urldownloader2003.wikimedia.org [06:28:20] RECOVERY - Persistent high iowait on clouddumps1001 is OK: (C)10 ge (W)5 ge 4.691 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005 [06:30:04] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [06:32:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader2003.wikimedia.org [06:32:27] 10SRE, 10Infrastructure-Foundations, 10observability, 10User-MoritzMuehlenhoff: ipmiseld not running reliably - https://phabricator.wikimedia.org/T305147 (10Marostegui) @herron we've seen this alert being flapping on db2180 a lot lately: ` [08:15:42] (SystemdUnitFailed) firing: ipmiseld.servi... [06:35:40] !log jmm@cumin2002 START - Cookbook sre.ganeti.makevm for new host netflow2003.codfw.wmnet [06:35:41] !log jmm@cumin2002 START - Cookbook sre.dns.netbox [06:37:10] !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 136907 [06:38:31] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow2003.codfw.wmnet - jmm@cumin2002" [06:38:42] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host urldownloader2004.wikimedia.org [06:39:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM netflow2003.codfw.wmnet - jmm@cumin2002" [06:39:35] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [06:39:35] !log jmm@cumin2002 START - Cookbook sre.dns.wipe-cache netflow2003.codfw.wmnet on all recursors [06:39:38] !log jmm@cumin2002 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) netflow2003.codfw.wmnet on all recursors [06:39:42] !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 136907 [06:43:08] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host urldownloader2004.wikimedia.org [06:43:56] PROBLEM - Persistent high iowait on clouddumps1001 is CRITICAL: 23.52 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005 [06:44:16] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:44:27] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testvm2004.codfw.wmnet [06:46:46] (03PS1) 10Marostegui: wmnet: Failover m2-master [dns] - 10https://gerrit.wikimedia.org/r/916091 [06:48:13] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testvm2004.codfw.wmnet [06:49:23] !log jmm@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow2003.codfw.wmnet - jmm@cumin2002" [06:50:12] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host testvm2005.codfw.wmnet [06:50:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM netflow2003.codfw.wmnet - jmm@cumin2002" [06:50:27] !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host netflow2003.codfw.wmnet [06:51:40] !log jmm@cumin2002 START - Cookbook sre.ganeti.reimage for host netflow2003.codfw.wmnet with OS bookworm [06:51:44] 10SRE-tools, 10Infrastructure-Foundations: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage was started by jmm@cumin2002 for host netflow2003.codfw.wmnet with OS bookworm [06:53:56] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host testvm2005.codfw.wmnet [06:56:02] (03CR) 10Ayounsi: Adjust Netbox PuppetDB import script to set bridge dev and vlan tags (031 comment) [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/822439 (https://phabricator.wikimedia.org/T296832) (owner: 10Cathal Mooney) [06:58:43] (03PS1) 10Muehlenhoff: fastnetmon: Remove absented Icinga resource [puppet] - 10https://gerrit.wikimedia.org/r/916096 [06:59:48] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:00:06] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230505T0700) [07:08:31] (NodeTextfileStale) firing: Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [07:16:48] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:24:48] RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:29:26] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:31:49] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12 days, 12:00:00 on db1132.eqiad.wmnet with reason: Maintenance [07:31:51] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12 days, 12:00:00 on db1132.eqiad.wmnet with reason: Maintenance [07:31:57] !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 12 days, 12:00:00 on db1106.eqiad.wmnet with reason: Maintenance [07:32:10] !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12 days, 12:00:00 on db1106.eqiad.wmnet with reason: Maintenance [07:34:06] (03PS1) 10Marostegui: install_server: Do not reimage db1218 [puppet] - 10https://gerrit.wikimedia.org/r/916110 [07:34:44] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db1218 [puppet] - 10https://gerrit.wikimedia.org/r/916110 (owner: 10Marostegui) [07:36:49] (03CR) 10Ayounsi: [C: 03+1] fastnetmon: Remove absented Icinga resource [puppet] - 10https://gerrit.wikimedia.org/r/916096 (owner: 10Muehlenhoff) [07:37:53] (03CR) 10Marostegui: [C: 03+1] Define dummy pass for passwords::excimer_ui_server (031 comment) [labs/private] - 10https://gerrit.wikimedia.org/r/910842 (https://phabricator.wikimedia.org/T291015) (owner: 10Krinkle) [07:38:36] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41059/console" [puppet] - 10https://gerrit.wikimedia.org/r/911920 (owner: 10Dzahn) [07:46:46] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:48:40] (03CR) 10Jelto: [V: 03+1 C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/911920 (owner: 10Dzahn) [07:49:10] (03CR) 10Ayounsi: [C: 03+1] "To be extra safe it might be worth disabling puppet on all the hosts that use this profile and testing the change on one or two hosts befo" [puppet] - 10https://gerrit.wikimedia.org/r/914772 (https://phabricator.wikimedia.org/T335760) (owner: 10Arturo Borrero Gonzalez) [07:52:05] (03CR) 10JMeybohm: [V: 03+1] Make kubernetes::clusters the central place for k8s config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909687 (https://phabricator.wikimedia.org/T325268) (owner: 10JMeybohm) [07:56:18] (03CR) 10Filippo Giunchedi: [C: 03+1] fastnetmon: Remove absented Icinga resource [puppet] - 10https://gerrit.wikimedia.org/r/916096 (owner: 10Muehlenhoff) [07:59:18] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:59:31] (03PS1) 10David Caro: labstore: avoid paging for iowait/throughput alerts [puppet] - 10https://gerrit.wikimedia.org/r/916158 [07:59:55] (03CR) 10CI reject: [V: 04-1] labstore: avoid paging for iowait/throughput alerts [puppet] - 10https://gerrit.wikimedia.org/r/916158 (owner: 10David Caro) [08:01:38] (03PS2) 10David Caro: labstore: avoid paging for iowait/throughput alerts [puppet] - 10https://gerrit.wikimedia.org/r/916158 [08:04:59] !log hashar@deploy1002 Started deploy [integration/docroot@78e6f40]: build: Updating eslint-config-wikimedia to 0.25.0 [08:05:06] PROBLEM - Persistent high iowait on clouddumps1001 is CRITICAL: 15.81 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005 [08:05:12] !log hashar@deploy1002 Finished deploy [integration/docroot@78e6f40]: build: Updating eslint-config-wikimedia to 0.25.0 (duration: 00m 13s) [08:05:55] (03CR) 10Jelto: "Thanks, that would be a great addition! Two comments in line." [puppet] - 10https://gerrit.wikimedia.org/r/913199 (owner: 10EoghanGaffney) [08:07:27] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.reimage (exit_code=99) for host netflow2003.codfw.wmnet with OS bookworm [08:07:31] 10SRE-tools, 10Infrastructure-Foundations: Upgrade Fastnetmon to 1.2.4 - https://phabricator.wikimedia.org/T330884 (10ops-monitoring-bot) Cookbook cookbooks.sre.ganeti.reimage started by jmm@cumin2002 for host netflow2003.codfw.wmnet with OS bookworm executed with errors: - netflow2003 (**FAIL**) - Removed f... [08:07:36] RECOVERY - Check systemd state on gitlab1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:13:13] (ThanosSidecarNoConnectionToStartedPrometheus) firing: (2) Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [08:15:19] !log delete wal and chunks_head from prometheus5002 and prometheus4002 to let prometheus start back up and not crashloop - T309979 [08:15:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:23] T309979: Upgrade Prometheus VMs in PoPs to Bullseye - https://phabricator.wikimedia.org/T309979 [08:16:24] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:18:13] (ThanosSidecarNoConnectionToStartedPrometheus) resolved: Thanos Sidecar cannot access Prometheus, even though Prometheus seems healthy and has reloaded WAL. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarNoConnectionToStartedPrometheus [08:18:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (2) wcqs2002:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [08:20:26] (03CR) 10Filippo Giunchedi: "LGTM, though please port the alerts to alerts.git (see also https://phabricator.wikimedia.org/T309011)" [puppet] - 10https://gerrit.wikimedia.org/r/916158 (owner: 10David Caro) [08:20:38] (03CR) 10Filippo Giunchedi: [C: 03+1] labstore: avoid paging for iowait/throughput alerts [puppet] - 10https://gerrit.wikimedia.org/r/916158 (owner: 10David Caro) [08:21:10] (ThanosRuleHighRuleEvaluationFailures) resolved: (2) Thanos Rule is failing to evaluate rules. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/35da848f5f92b2dc612e0c3a0577b8a1/thanos-rule - https://alerts.wikimedia.org/?q=alertname%3DThanosRuleHighRuleEvaluationFailures [08:23:08] PROBLEM - haproxy failover on dbproxy1012 is CRITICAL: CRITICAL check_failover servers up 2 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [08:23:11] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: Warmup script does not warm memcached enough - https://phabricator.wikimedia.org/T331981 (10Clement_Goubert) 05Open→03Resolved a:03Clement_Goubert I'm good with that. [08:23:17] 10SRE, 10Data-Persistence, 10serviceops: 14 March 2023 eqiad Service repooling - https://phabricator.wikimedia.org/T331541 (10Clement_Goubert) [08:23:25] 10SRE, 10Data-Persistence, 10serviceops, 10Datacenter-Switchover: Post March 2023 Datacenter Switchover Tasks - https://phabricator.wikimedia.org/T328907 (10Clement_Goubert) [08:23:50] (03CR) 10Jelto: wdqs/wcqs: change discovery name of backends for GUIs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/915737 (owner: 10Dzahn) [08:27:38] (03CR) 10David Caro: [C: 03+2] labstore: avoid paging for iowait/throughput alerts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/916158 (owner: 10David Caro) [08:28:13] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "I don't think we need this at this point." [puppet] - 10https://gerrit.wikimedia.org/r/908214 (owner: 10Hashar) [08:28:44] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:30:42] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "please correct me if I'm missing something, but this is a worse version of the general alert coming from pybal and registered in icinga." [alerts] - 10https://gerrit.wikimedia.org/r/908830 (owner: 10Clément Goubert) [08:31:11] (03CR) 10Giuseppe Lavagetto: [C: 03+1] service::node: Remove use_nodejs10 [puppet] - 10https://gerrit.wikimedia.org/r/908226 (owner: 10Muehlenhoff) [08:31:14] RECOVERY - Check systemd state on gitlab1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:32:18] (03CR) 10Giuseppe Lavagetto: service: add comment for spicerack field addition (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/909605 (owner: 10Clément Goubert) [08:32:28] (03Abandoned) 10Clément Goubert: team-sre: add alert on mediawiki pooled percentage [alerts] - 10https://gerrit.wikimedia.org/r/908830 (owner: 10Clément Goubert) [08:36:08] (03CR) 10Muehlenhoff: [C: 03+2] fastnetmon: Remove absented Icinga resource [puppet] - 10https://gerrit.wikimedia.org/r/916096 (owner: 10Muehlenhoff) [08:36:16] 10SRE, 10Infrastructure-Foundations, 10observability, 10User-MoritzMuehlenhoff: ipmiseld not running reliably - https://phabricator.wikimedia.org/T305147 (10MoritzMuehlenhoff) >>! In T305147#8828793, @Marostegui wrote: > @herron we've seen this alert being flapping on db2180 a lot lately: > ` > [08:15:42]... [08:36:35] (03CR) 10Giuseppe Lavagetto: "LGTM overall, you should remove the yaml anchors from values.yaml IMHO." [deployment-charts] - 10https://gerrit.wikimedia.org/r/905947 (https://phabricator.wikimedia.org/T334065) (owner: 10Clément Goubert) [08:36:49] (03Abandoned) 10Muehlenhoff: Failover urldownloaders for reboots [dns] - 10https://gerrit.wikimedia.org/r/915741 (owner: 10Muehlenhoff) [08:37:11] 10SRE, 10Infrastructure-Foundations, 10observability, 10User-MoritzMuehlenhoff: ipmiseld not running reliably - https://phabricator.wikimedia.org/T305147 (10Marostegui) Thanks Moritz, I will work with DCOps on that. [08:38:13] (03CR) 10Giuseppe Lavagetto: [C: 03+2] cli: remove ms from datefmt [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/910472 (owner: 10Clément Goubert) [08:38:19] 10ops-codfw, 10DBA: Update firmware for db2180 - https://phabricator.wikimedia.org/T336031 (10Marostegui) [08:38:31] 10ops-codfw, 10DBA: Update firmware for db2180 - https://phabricator.wikimedia.org/T336031 (10Marostegui) p:05Triage→03Medium [08:40:19] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM; do you have a date in mind to make a release?" [puppet] - 10https://gerrit.wikimedia.org/r/910882 (https://phabricator.wikimedia.org/T253547) (owner: 10Krinkle) [08:40:42] (03Merged) 10jenkins-bot: cli: remove ms from datefmt [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/910472 (owner: 10Clément Goubert) [08:41:02] (03CR) 10Arturo Borrero Gonzalez: profile::bird::anycast: allow setting the BGP IP address from the profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/914772 (https://phabricator.wikimedia.org/T335760) (owner: 10Arturo Borrero Gonzalez) [08:43:49] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Handle Canonical URL for EntitySchemas [puppet] - 10https://gerrit.wikimedia.org/r/912327 (https://phabricator.wikimedia.org/T225778) (owner: 10Michael Große) [08:43:58] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Handle Canonical URL for EntitySchemas [deployment-charts] - 10https://gerrit.wikimedia.org/r/912326 (https://phabricator.wikimedia.org/T225778) (owner: 10Michael Große) [08:45:02] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:48:31] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [08:53:51] PROBLEM - puppet last run on gitlab1003 is CRITICAL: CRITICAL: Puppet last ran 7 hours ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [08:58:41] !log deploy CR914772 on all hosts running Bird [08:58:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:59:27] (03CR) 10Ayounsi: [C: 03+2] profile::bird::anycast: allow setting the BGP IP address from the profile [puppet] - 10https://gerrit.wikimedia.org/r/914772 (https://phabricator.wikimedia.org/T335760) (owner: 10Arturo Borrero Gonzalez) [09:05:43] (03CR) 10Ladsgroup: [C: 03+1] wmnet: Failover m2-master [dns] - 10https://gerrit.wikimedia.org/r/916091 (owner: 10Marostegui) [09:06:48] !log hnowlan@deploy1002 Started deploy [restbase/deploy@8aba801]: deploying to host missing from configs [09:08:11] !log hnowlan@deploy1002 Finished deploy [restbase/deploy@8aba801]: deploying to host missing from configs (duration: 01m 22s) [09:09:34] (03CR) 10Marostegui: [C: 03+2] wmnet: Failover m2-master [dns] - 10https://gerrit.wikimedia.org/r/916091 (owner: 10Marostegui) [09:09:52] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/915810 (https://phabricator.wikimedia.org/T335529) (owner: 10Effie Mouzeli) [09:10:51] !log Failover m2-master from dbproxy1013 to dbproxy1015 [09:10:52] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/915806 (https://phabricator.wikimedia.org/T335438) (owner: 10Effie Mouzeli) [09:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:38] (03CR) 10Jcrespo: [C: 03+1] "both point to db1195/db1217" [dns] - 10https://gerrit.wikimedia.org/r/916091 (owner: 10Marostegui) [09:14:54] !log power cycled db1170\ [09:14:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:15:37] (03PS11) 10Winston Sung: Add DEPRECATED_LANGUAGE_CODE_MAPPING to wgInterlanguageLinkCodeMap [mediawiki-config] - 10https://gerrit.wikimedia.org/r/558052 (https://phabricator.wikimedia.org/T248352) (owner: 10Fomafix) [09:21:15] RECOVERY - puppet last run on gitlab1003 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [09:22:50] (03PS1) 10Muehlenhoff: Extend changelog [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/916412 [09:24:14] (03PS1) 10Marostegui: db1170: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/916414 (https://phabricator.wikimedia.org/T336033) [09:24:56] (03CR) 10Marostegui: [C: 03+2] db1170: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/916414 (https://phabricator.wikimedia.org/T336033) (owner: 10Marostegui) [09:25:41] 10ops-eqiad, 10DBA, 10Patch-For-Review: db1170 is not coming back online - https://phabricator.wikimedia.org/T336033 (10Marostegui) [09:26:45] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Extend changelog [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/916412 (owner: 10Muehlenhoff) [09:28:03] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:28:03] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [09:28:09] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on db1170.eqiad.wmnet with reason: Host sad (T336033) [09:28:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on db1170.eqiad.wmnet with reason: Host sad (T336033) [09:28:13] T336033: db1170 is not coming back online - https://phabricator.wikimedia.org/T336033 [09:30:43] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Clement_Goubert) [09:34:24] 10SRE, 10ops-eqiad, 10DBA: db1170 is not coming back online - https://phabricator.wikimedia.org/T336033 (10Marostegui) hard reset seems to have worked :) [09:34:42] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops: Add configuration file support to mw-on-k8s.lua ATS script - https://phabricator.wikimedia.org/T336037 (10Clement_Goubert) [09:34:53] (03CR) 10David Caro: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/915771 (https://phabricator.wikimedia.org/T325163) (owner: 10Raymond Ndibe) [09:34:58] 10SRE, 10ops-eqiad, 10DBA: db1170 is not coming back online - https://phabricator.wikimedia.org/T336033 (10Marostegui) 05Open→03Resolved a:03Marostegui [09:35:35] (03PS1) 10Marostegui: Revert "db1170: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/915712 [09:36:58] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Add traffic sampling support to mw-on-k8s.lua ATS script - https://phabricator.wikimedia.org/T336038 (10Clement_Goubert) [09:37:17] 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops: Add traffic sampling support to mw-on-k8s.lua ATS script - https://phabricator.wikimedia.org/T336038 (10Clement_Goubert) [09:37:51] (03PS2) 10Giuseppe Lavagetto: scaffold: add support for periodic jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/915127 [09:40:39] RECOVERY - haproxy failover on dbproxy1012 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [09:40:43] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudlb: fix BGP IP address [puppet] - 10https://gerrit.wikimedia.org/r/915476 (https://phabricator.wikimedia.org/T335760) (owner: 10Arturo Borrero Gonzalez) [09:46:25] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] "FYI, I just uploaded an updated 0.5.6 package with this change to apt.wikimedia.org" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/893708 (owner: 10Lucas Werkmeister (WMDE)) [09:46:45] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:48:33] 10SRE, 10Cloud-Services, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: Modify Bird module to allow source IP to be passed to template - https://phabricator.wikimedia.org/T335760 (10aborrero) 05Open→03Resolved a:03aborrero all 3 cloudlb servers should be using the right IP for BGP now:... [09:55:20] (03CR) 10Jbond: [C: 03+2] utils: rm hiera_lookup (replaced by puppet lookup) [puppet] - 10https://gerrit.wikimedia.org/r/908214 (owner: 10Hashar) [09:58:52] (03PS1) 10Majavah: prometheus::server: fix target and rule purge rules [puppet] - 10https://gerrit.wikimedia.org/r/916422 [10:01:58] (03PS1) 10Majavah: P:lvs::configuration: replace labsproject with wmcs_project [puppet] - 10https://gerrit.wikimedia.org/r/916424 [10:02:00] (03PS1) 10Majavah: realm: stop setting labsproject [puppet] - 10https://gerrit.wikimedia.org/r/916425 [10:04:26] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, I'll merge on Monday" [puppet] - 10https://gerrit.wikimedia.org/r/916422 (owner: 10Majavah) [10:06:19] RECOVERY - Persistent high iowait on clouddumps1001 is OK: (C)1e+04 ge (W)5 ge 4.474 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Shared_storage%23Dumps https://grafana.wikimedia.org/d/000000568/wmcs-dumps-general-view [10:08:31] (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [10:10:32] (03PS1) 10Jbond: openstack::codfw1dev: move defaults to common location in hiera [puppet] - 10https://gerrit.wikimedia.org/r/916431 [10:11:34] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41060/console" [puppet] - 10https://gerrit.wikimedia.org/r/916431 (owner: 10Jbond) [10:11:36] (03CR) 10Jbond: openstack::codfw1dev: move defaults to common location in hiera [puppet] - 10https://gerrit.wikimedia.org/r/916431 (owner: 10Jbond) [10:12:29] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] openstack::codfw1dev: move defaults to common location in hiera [puppet] - 10https://gerrit.wikimedia.org/r/916431 (owner: 10Jbond) [10:12:50] (03CR) 10Jbond: [C: 03+2] openstack::codfw1dev: move defaults to common location in hiera [puppet] - 10https://gerrit.wikimedia.org/r/916431 (owner: 10Jbond) [10:25:23] (03PS1) 10Muehlenhoff: Move duplicity check for apt keyrings to !defined [puppet] - 10https://gerrit.wikimedia.org/r/916434 (https://phabricator.wikimedia.org/T330495) [10:27:53] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:28:08] (03CR) 10Muehlenhoff: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/916434 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [10:34:50] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [10:35:01] (03PS7) 10JMeybohm: Refactor envoy access_log_path to access loggers [deployment-charts] - 10https://gerrit.wikimedia.org/r/915458 (https://phabricator.wikimedia.org/T300324) [10:35:03] (03PS3) 10JMeybohm: mesh/configuration: Refactor common_http_protocol_options [deployment-charts] - 10https://gerrit.wikimedia.org/r/915672 (https://phabricator.wikimedia.org/T300324) [10:35:49] (03Abandoned) 10JMeybohm: Copy configuration_1.2.0 to configuration_1.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/915457 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [10:36:14] (03CR) 10JMeybohm: Refactor envoy access_log_path to access loggers (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/915458 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [10:38:09] (03PS8) 10JMeybohm: Refactor envoy access_log_path to access loggers [deployment-charts] - 10https://gerrit.wikimedia.org/r/915458 (https://phabricator.wikimedia.org/T300324) [10:38:11] (03PS4) 10JMeybohm: mesh/configuration: Refactor common_http_protocol_options [deployment-charts] - 10https://gerrit.wikimedia.org/r/915672 (https://phabricator.wikimedia.org/T300324) [10:40:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:40:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1170:3312 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P47750 and previous config saved to /var/cache/conftool/dbconfig/20230505-104050-ladsgroup.json [10:41:12] !log installing wireshark security updates [10:41:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:41:36] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1170:3317 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P47751 and previous config saved to /var/cache/conftool/dbconfig/20230505-104135-ladsgroup.json [10:42:12] (03PS2) 10Ladsgroup: Revert "db1170: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/915712 (owner: 10Marostegui) [10:42:19] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Revert "db1170: Disable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/915712 (owner: 10Marostegui) [10:45:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [10:46:20] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:55:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1170:3312 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P47752 and previous config saved to /var/cache/conftool/dbconfig/20230505-105555-ladsgroup.json [10:56:33] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/915780 (https://phabricator.wikimedia.org/T335101) (owner: 10Effie Mouzeli) [10:56:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1170:3317 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P47753 and previous config saved to /var/cache/conftool/dbconfig/20230505-105640-ladsgroup.json [10:56:57] (03CR) 10Jbond: [C: 03+1] data.yaml: Add Ellen Rayfield to restricted [puppet] - 10https://gerrit.wikimedia.org/r/915806 (https://phabricator.wikimedia.org/T335438) (owner: 10Effie Mouzeli) [10:57:04] (03CR) 10Jbond: [C: 03+1] data.yaml: Add Julia Kieserman to restricted [puppet] - 10https://gerrit.wikimedia.org/r/915810 (https://phabricator.wikimedia.org/T335529) (owner: 10Effie Mouzeli) [10:57:14] PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Connect - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:58:58] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:59:36] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/916434 (https://phabricator.wikimedia.org/T330495) (owner: 10Muehlenhoff) [11:00:20] (03PS2) 10Jelto: gitlab: enable and run partial backups daily [puppet] - 10https://gerrit.wikimedia.org/r/912791 (https://phabricator.wikimedia.org/T316935) [11:01:07] (03PS3) 10Jelto: gitlab: enable and run partial backups daily [puppet] - 10https://gerrit.wikimedia.org/r/912791 (https://phabricator.wikimedia.org/T316935) [11:02:34] (03CR) 10Jelto: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41061/console" [puppet] - 10https://gerrit.wikimedia.org/r/912791 (https://phabricator.wikimedia.org/T316935) (owner: 10Jelto) [11:04:53] (03PS4) 10Arturo Borrero Gonzalez: templates: add 20.172.in-addr.arpa [dns] - 10https://gerrit.wikimedia.org/r/914751 (https://phabricator.wikimedia.org/T335759) [11:07:25] (03CR) 10Jelto: [V: 03+1] gitlab: enable and run partial backups daily (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/912791 (https://phabricator.wikimedia.org/T316935) (owner: 10Jelto) [11:08:31] (NodeTextfileStale) firing: Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [11:10:24] (03PS1) 10Jbond: idp: add gitlab oidc entry [puppet] - 10https://gerrit.wikimedia.org/r/916455 (https://phabricator.wikimedia.org/T320390) [11:10:56] (03CR) 10Clément Goubert: team-sre: add alert on mediawiki pooled percentage (031 comment) [alerts] - 10https://gerrit.wikimedia.org/r/908830 (owner: 10Clément Goubert) [11:11:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1170:3312 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P47754 and previous config saved to /var/cache/conftool/dbconfig/20230505-111100-ladsgroup.json [11:11:16] (03CR) 10Jbond: [C: 03+2] idp: add gitlab oidc entry [puppet] - 10https://gerrit.wikimedia.org/r/916455 (https://phabricator.wikimedia.org/T320390) (owner: 10Jbond) [11:11:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1170:3317 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P47755 and previous config saved to /var/cache/conftool/dbconfig/20230505-111145-ladsgroup.json [11:12:42] RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 6, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [11:14:05] (03CR) 10Jelto: "two comments in-line" [puppet] - 10https://gerrit.wikimedia.org/r/916455 (https://phabricator.wikimedia.org/T320390) (owner: 10Jbond) [11:15:58] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:26:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1170:3312 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P47756 and previous config saved to /var/cache/conftool/dbconfig/20230505-112605-ladsgroup.json [11:26:26] (03PS1) 10Arturo Borrero Gonzalez: Revert "Revert "Revert "P:wmcs::prometheus: set openstack scrape_interval to 4m""" [puppet] - 10https://gerrit.wikimedia.org/r/915713 [11:26:50] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1170:3317 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P47757 and previous config saved to /var/cache/conftool/dbconfig/20230505-112649-ladsgroup.json [11:28:24] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:29:53] (03PS2) 10Arturo Borrero Gonzalez: Revert "Revert "Revert "P:wmcs::prometheus: set openstack scrape_interval to 4m""" [puppet] - 10https://gerrit.wikimedia.org/r/915713 (https://phabricator.wikimedia.org/T335943) [11:32:11] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Revert "Revert "Revert "P:wmcs::prometheus: set openstack scrape_interval to 4m""" [puppet] - 10https://gerrit.wikimedia.org/r/915713 (https://phabricator.wikimedia.org/T335943) (owner: 10Arturo Borrero Gonzalez) [11:51:07] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance [11:51:20] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1182.eqiad.wmnet with reason: Maintenance [11:51:26] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1182 (T335845)', diff saved to https://phabricator.wikimedia.org/P47758 and previous config saved to /var/cache/conftool/dbconfig/20230505-115126-ladsgroup.json [11:52:05] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-db1002.eqiad.wmnet [11:52:34] (03PS1) 10Arturo Borrero Gonzalez: clod_private_subnet: fix BGP neighbors [puppet] - 10https://gerrit.wikimedia.org/r/916464 (https://phabricator.wikimedia.org/T324992) [11:53:22] (03PS2) 10Arturo Borrero Gonzalez: clod_private_subnet: fix BGP neighbors [puppet] - 10https://gerrit.wikimedia.org/r/916464 (https://phabricator.wikimedia.org/T324992) [11:53:56] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:56:29] (03CR) 10Jcrespo: [C: 03+1] "ok on the implementation, but I haven't checked it works, just to be transparent." [puppet] - 10https://gerrit.wikimedia.org/r/912791 (https://phabricator.wikimedia.org/T316935) (owner: 10Jelto) [11:58:22] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:58:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T335845)', diff saved to https://phabricator.wikimedia.org/P47759 and previous config saved to /var/cache/conftool/dbconfig/20230505-115830-ladsgroup.json [11:58:48] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-db1002.eqiad.wmnet [11:59:55] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-mariadb1001.eqiad.wmnet [12:05:44] PROBLEM - SSH on cloudbackup2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:05:58] (03CR) 10Effie Mouzeli: [C: 03+2] data.yaml: Add Hasan Akgün (WMDE) to restricted [puppet] - 10https://gerrit.wikimedia.org/r/915780 (https://phabricator.wikimedia.org/T335101) (owner: 10Effie Mouzeli) [12:06:43] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-mariadb1001.eqiad.wmnet [12:06:45] (03CR) 10Effie Mouzeli: [C: 03+2] data.yaml: Add Ellen Rayfield to restricted [puppet] - 10https://gerrit.wikimedia.org/r/915806 (https://phabricator.wikimedia.org/T335438) (owner: 10Effie Mouzeli) [12:06:54] (03PS3) 10Effie Mouzeli: data.yaml: Add Ellen Rayfield to restricted [puppet] - 10https://gerrit.wikimedia.org/r/915806 (https://phabricator.wikimedia.org/T335438) [12:07:08] there will be now a temporary bacula job prometheus failure (job only has 1 backend) [12:07:14] RECOVERY - SSH on cloudbackup2001 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:07:15] (03PS2) 10Effie Mouzeli: data.yaml: Add Julia Kieserman to restricted [puppet] - 10https://gerrit.wikimedia.org/r/915810 (https://phabricator.wikimedia.org/T335529) [12:10:01] (03CR) 10Effie Mouzeli: [C: 03+2] data.yaml: Add Julia Kieserman to restricted [puppet] - 10https://gerrit.wikimedia.org/r/915810 (https://phabricator.wikimedia.org/T335529) (owner: 10Effie Mouzeli) [12:11:54] PROBLEM - SSH on cloudbackup2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [12:13:37] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P47760 and previous config saved to /var/cache/conftool/dbconfig/20230505-121336-ladsgroup.json [12:16:52] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:18:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (2) wcqs2002:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [12:24:32] (03CR) 10Cathal Mooney: [C: 03+1] "LGTM I think that should do it thanks :)" [puppet] - 10https://gerrit.wikimedia.org/r/916464 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [12:24:43] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-mariadb1002.eqiad.wmnet [12:24:48] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] clod_private_subnet: fix BGP neighbors [puppet] - 10https://gerrit.wikimedia.org/r/916464 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [12:27:40] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:28:43] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182', diff saved to https://phabricator.wikimedia.org/P47761 and previous config saved to /var/cache/conftool/dbconfig/20230505-122843-ladsgroup.json [12:31:29] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-mariadb1002.eqiad.wmnet [12:33:14] 10SRE, 10Wikidata, 10wdwb-tech, 10Patch-For-Review, and 4 others: [ES-M2]: Define canonical URI for EntitySchemas - https://phabricator.wikimedia.org/T225778 (10Michael) The overall functionality, that is expected after this is deployed, can be tested by going to `/entity/E123` and being redirected... [12:42:02] 10SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10BTullis) [12:43:49] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1182 (T335845)', diff saved to https://phabricator.wikimedia.org/P47762 and previous config saved to /var/cache/conftool/dbconfig/20230505-124349-ladsgroup.json [12:43:54] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance [12:44:07] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1188.eqiad.wmnet with reason: Maintenance [12:44:13] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1188 (T335845)', diff saved to https://phabricator.wikimedia.org/P47763 and previous config saved to /var/cache/conftool/dbconfig/20230505-124412-ladsgroup.json [12:46:14] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:46:53] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host cephosd1001.eqiad.wmnet [12:50:38] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T335845)', diff saved to https://phabricator.wikimedia.org/P47764 and previous config saved to /var/cache/conftool/dbconfig/20230505-125038-ladsgroup.json [12:56:06] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cephosd1001.eqiad.wmnet [12:56:43] ping Lucas_WMDE as you're not in -dev, if you wanna chat here instead of on that patch :p [12:56:55] * Lucas_WMDE hasn’t heard of that channel [12:57:17] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host cephosd1002.eqiad.wmnet [12:57:18] #wikimedia-dev...? o.O [12:57:19] (03PS4) 10EoghanGaffney: [gitlab/failover] Rename host flags [cookbooks] - 10https://gerrit.wikimedia.org/r/911951 (https://phabricator.wikimedia.org/T330771) [12:57:42] 10SRE, 10Wikidata, 10wdwb-tech, 10Patch-For-Review, and 4 others: [ES-M2]: Define canonical URI for EntitySchemas - https://phabricator.wikimedia.org/T225778 (10Arian_Bozorg) [12:58:03] aka the 'there is only wikibugs' channel [12:58:38] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:02:37] I’m already in #mediawiki-core and #wikimedia-tech, how many more channels do you want :P [13:05:26] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to}/{provider} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [13:05:38] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cephosd1002.eqiad.wmnet [13:05:40] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host cephosd1003.eqiad.wmnet [13:05:44] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P47765 and previous config saved to /var/cache/conftool/dbconfig/20230505-130544-ladsgroup.json [13:06:52] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [13:08:38] 10SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10BTullis) I've confirmed information related to [[https://gerrit.wikimedia.org/g/analytics/datahub|analytics/datahub]] above. Whilst we could move it... [13:09:51] (03CR) 10EoghanGaffney: [C: 03+1] gitlab: enable and run partial backups daily [puppet] - 10https://gerrit.wikimedia.org/r/912791 (https://phabricator.wikimedia.org/T316935) (owner: 10Jelto) [13:13:13] !log rebooting cloudbackup2001.codfw.wmnet, unresponsive [13:13:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:14:44] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cephosd1003.eqiad.wmnet [13:14:46] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host cephosd1004.eqiad.wmnet [13:16:38] PROBLEM - Host cloudbackup2001 is DOWN: PING CRITICAL - Packet loss = 100% [13:17:14] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:17:49] (03CR) 10JHathaway: [C: 03+1] utils: rm hiera_lookup (replaced by puppet lookup) [puppet] - 10https://gerrit.wikimedia.org/r/908214 (owner: 10Hashar) [13:18:33] (03PS1) 10AikoChou: Add mediawiki.page_outlink_topic_prediction_change stream [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915789 (https://phabricator.wikimedia.org/T328899) [13:20:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188', diff saved to https://phabricator.wikimedia.org/P47766 and previous config saved to /var/cache/conftool/dbconfig/20230505-132050-ladsgroup.json [13:21:38] RECOVERY - SSH on cloudbackup2001 is OK: SSH OK - OpenSSH_8.4p1 Debian-5+deb11u1 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:21:40] RECOVERY - Host cloudbackup2001 is UP: PING OK - Packet loss = 0%, RTA = 31.64 ms [13:23:42] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mx2001.wikimedia.org with reason: New kernel, T335835 [13:23:55] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx2001.wikimedia.org with reason: New kernel, T335835 [13:24:10] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cephosd1004.eqiad.wmnet [13:24:12] !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host cephosd1005.eqiad.wmnet [13:25:51] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mx1001.wikimedia.org with reason: New kernel, T335835 [13:26:04] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx1001.wikimedia.org with reason: New kernel, T335835 [13:26:58] PROBLEM - Check systemd state on cloudbackup2001 is CRITICAL: CRITICAL - degraded: The following units failed: dm-event.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:28:04] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:28:25] (03PS1) 10Jbond: gitlab: refactor omniauth providers to a data structure [puppet] - 10https://gerrit.wikimedia.org/r/916489 (https://phabricator.wikimedia.org/T320390) [13:29:05] (03CR) 10CI reject: [V: 04-1] gitlab: refactor omniauth providers to a data structure [puppet] - 10https://gerrit.wikimedia.org/r/916489 (https://phabricator.wikimedia.org/T320390) (owner: 10Jbond) [13:29:54] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41063/console" [puppet] - 10https://gerrit.wikimedia.org/r/916489 (https://phabricator.wikimedia.org/T320390) (owner: 10Jbond) [13:30:42] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mirror1001.wikimedia.org with reason: New kernel, T335835 [13:30:55] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mirror1001.wikimedia.org with reason: New kernel, T335835 [13:31:19] 10SRE, 10SRE-Access-Requests: Requesting access to restricted for HasanAkgun_WMDE - https://phabricator.wikimedia.org/T335101 (10jijiki) 05In progress→03Resolved [13:31:48] 10SRE, 10SRE-Access-Requests: Requesting access to bastions and mwmaint servers for erayfield - https://phabricator.wikimedia.org/T335438 (10jijiki) 05Open→03Resolved a:03jijiki [13:31:50] 10SRE, 10SRE-Access-Requests: Requesting access to bastions and mwmaint for jkieserman - https://phabricator.wikimedia.org/T335529 (10jijiki) 05Open→03Resolved a:03jijiki [13:33:35] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cephosd1005.eqiad.wmnet [13:35:33] 10SRE, 10SRE-Access-Requests: Requesting Analytics access for Surbhi Gupta - https://phabricator.wikimedia.org/T335657 (10jijiki) [13:35:57] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1188 (T335845)', diff saved to https://phabricator.wikimedia.org/P47767 and previous config saved to /var/cache/conftool/dbconfig/20230505-133556-ladsgroup.json [13:36:01] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance [13:36:25] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1197.eqiad.wmnet with reason: Maintenance [13:36:31] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1197 (T335845)', diff saved to https://phabricator.wikimedia.org/P47768 and previous config saved to /var/cache/conftool/dbconfig/20230505-133631-ladsgroup.json [13:39:40] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on lists1003.wikimedia.org with reason: New kernel, T335835 [13:39:53] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on lists1003.wikimedia.org with reason: New kernel, T335835 [13:43:58] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T335845)', diff saved to https://phabricator.wikimedia.org/P47769 and previous config saved to /var/cache/conftool/dbconfig/20230505-134358-ladsgroup.json [13:46:48] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:47:00] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on aux-k8s-ctrl1001.eqiad.wmnet with reason: New kernel, T335835 [13:47:13] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on aux-k8s-ctrl1001.eqiad.wmnet with reason: New kernel, T335835 [13:47:14] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on aux-k8s-ctrl1002.eqiad.wmnet with reason: New kernel, T335835 [13:47:27] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on aux-k8s-ctrl1002.eqiad.wmnet with reason: New kernel, T335835 [13:47:29] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on aux-k8s-etcd1001.eqiad.wmnet with reason: New kernel, T335835 [13:47:42] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on aux-k8s-etcd1001.eqiad.wmnet with reason: New kernel, T335835 [13:47:43] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on aux-k8s-etcd1002.eqiad.wmnet with reason: New kernel, T335835 [13:47:56] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on aux-k8s-etcd1002.eqiad.wmnet with reason: New kernel, T335835 [13:47:58] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on aux-k8s-etcd1003.eqiad.wmnet with reason: New kernel, T335835 [13:48:11] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on aux-k8s-etcd1003.eqiad.wmnet with reason: New kernel, T335835 [13:48:12] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on aux-k8s-worker1001.eqiad.wmnet with reason: New kernel, T335835 [13:48:25] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on aux-k8s-worker1001.eqiad.wmnet with reason: New kernel, T335835 [13:48:27] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on aux-k8s-worker1002.eqiad.wmnet with reason: New kernel, T335835 [13:48:39] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on aux-k8s-worker1002.eqiad.wmnet with reason: New kernel, T335835 [13:52:22] (03PS1) 10Effie Mouzeli: data.yaml: Add Surbhi Gupta [puppet] - 10https://gerrit.wikimedia.org/r/916491 (https://phabricator.wikimedia.org/T335657) [13:55:14] jouncebot: nowandnext [13:55:14] For the next 17 hour(s) and 4 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230505T0700) [13:55:14] In 17 hour(s) and 4 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230506T0700) [13:55:19] cool [13:56:13] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2011.codfw.wmnet with OS bullseye [13:56:22] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs2011.codfw.wmnet with OS bullseye [13:56:24] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye [13:56:31] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye executed w... [13:57:52] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:57:59] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2011.codfw.wmnet with OS bullseye [13:58:09] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye [13:59:05] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P47770 and previous config saved to /var/cache/conftool/dbconfig/20230505-135904-ladsgroup.json [13:59:49] 10SRE, 10LDAP-Access-Requests: Grant Access to ldap/wmde for lojo - https://phabricator.wikimedia.org/T335858 (10jijiki) @lojo Hi! I am the clinic duty person, can you help me understand what is the "Wikibase Suite Team development" service you refer to in the description? [14:00:29] (03PS2) 10Jbond: gitlab: refactor omniauth providers to a data structure [puppet] - 10https://gerrit.wikimedia.org/r/916489 (https://phabricator.wikimedia.org/T320390) [14:01:50] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41064/console" [puppet] - 10https://gerrit.wikimedia.org/r/916489 (https://phabricator.wikimedia.org/T320390) (owner: 10Jbond) [14:04:02] (03PS3) 10Jbond: gitlab: refactor omniauth providers to a data structure [puppet] - 10https://gerrit.wikimedia.org/r/916489 (https://phabricator.wikimedia.org/T320390) [14:04:09] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.dhcp (exit_code=99) for host lvs2011.codfw.wmnet [14:04:12] !log sukhe@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs2011.codfw.wmnet with OS bullseye [14:04:21] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by sukhe@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye executed w... [14:05:10] !log sukhe@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2011.codfw.wmnet with OS bullseye [14:05:21] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by sukhe@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye [14:05:25] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41065/console" [puppet] - 10https://gerrit.wikimedia.org/r/916489 (https://phabricator.wikimedia.org/T320390) (owner: 10Jbond) [14:08:00] (03PS4) 10Jbond: gitlab: refactor omniauth providers to a data structure [puppet] - 10https://gerrit.wikimedia.org/r/916489 (https://phabricator.wikimedia.org/T320390) [14:08:31] (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [14:08:41] (03CR) 10CI reject: [V: 04-1] gitlab: refactor omniauth providers to a data structure [puppet] - 10https://gerrit.wikimedia.org/r/916489 (https://phabricator.wikimedia.org/T320390) (owner: 10Jbond) [14:09:52] (03PS5) 10Jbond: gitlab: refactor omniauth providers to a data structure [puppet] - 10https://gerrit.wikimedia.org/r/916489 (https://phabricator.wikimedia.org/T320390) [14:11:09] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41067/console" [puppet] - 10https://gerrit.wikimedia.org/r/916489 (https://phabricator.wikimedia.org/T320390) (owner: 10Jbond) [14:11:48] 10SRE, 10ops-codfw, 10Cloud-Services, 10cloud-services-team: cloudbackup2001 lockup on 2023-05-05 - https://phabricator.wikimedia.org/T336060 (10Andrew) [14:12:35] (03PS1) 10Muehlenhoff: os-updates: Generate an additional overview page with a breakdown per SRE sub team [puppet] - 10https://gerrit.wikimedia.org/r/916493 [14:14:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197', diff saved to https://phabricator.wikimedia.org/P47771 and previous config saved to /var/cache/conftool/dbconfig/20230505-141410-ladsgroup.json [14:14:45] (03CR) 10CI reject: [V: 04-1] os-updates: Generate an additional overview page with a breakdown per SRE sub team [puppet] - 10https://gerrit.wikimedia.org/r/916493 (owner: 10Muehlenhoff) [14:16:34] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:21:20] (03PS1) 10JMeybohm: envoy: Move upstream HTTP config into the new HttpProtocolOptions message [puppet] - 10https://gerrit.wikimedia.org/r/916498 (https://phabricator.wikimedia.org/T303230) [14:21:24] (03PS1) 10JMeybohm: envoyproxy: Add python 3.11 to tox [puppet] - 10https://gerrit.wikimedia.org/r/916499 (https://phabricator.wikimedia.org/T300324) [14:26:13] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cuminunpriv1001.eqiad.wmnet [14:26:27] !log btullis@cumin1001 START - Cookbook sre.presto.reboot-workers for Presto analytics cluster: Reboot Presto nodes [14:29:12] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:29:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1197 (T335845)', diff saved to https://phabricator.wikimedia.org/P47772 and previous config saved to /var/cache/conftool/dbconfig/20230505-142917-ladsgroup.json [14:29:22] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance [14:29:34] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1222.eqiad.wmnet with reason: Maintenance [14:29:41] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1222 (T335845)', diff saved to https://phabricator.wikimedia.org/P47773 and previous config saved to /var/cache/conftool/dbconfig/20230505-142940-ladsgroup.json [14:30:05] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cuminunpriv1001.eqiad.wmnet [14:31:42] (03PS1) 10Hnowlan: thumbor: haproxy timeout changes, block /metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/916506 (https://phabricator.wikimedia.org/T334488) [14:33:45] (03PS3) 10Giuseppe Lavagetto: scaffold: add support for periodic jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/915127 [14:34:50] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [14:35:30] (03CR) 10CI reject: [V: 04-1] scaffold: add support for periodic jobs [deployment-charts] - 10https://gerrit.wikimedia.org/r/915127 (owner: 10Giuseppe Lavagetto) [14:37:04] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T335845)', diff saved to https://phabricator.wikimedia.org/P47774 and previous config saved to /var/cache/conftool/dbconfig/20230505-143703-ladsgroup.json [14:37:54] (03CR) 10Muehlenhoff: data.yaml: Add Surbhi Gupta (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/916491 (https://phabricator.wikimedia.org/T335657) (owner: 10Effie Mouzeli) [14:38:05] 10ops-codfw, 10Traffic, 10Patch-For-Review: Q4:rack/decom codfw unified decommission task - https://phabricator.wikimedia.org/T335777 (10Jhancock.wm) [14:43:48] (03PS2) 10Muehlenhoff: os-updates: Generate an additional overview page with a breakdown per SRE sub team [puppet] - 10https://gerrit.wikimedia.org/r/916493 [14:46:01] (03CR) 10CI reject: [V: 04-1] os-updates: Generate an additional overview page with a breakdown per SRE sub team [puppet] - 10https://gerrit.wikimedia.org/r/916493 (owner: 10Muehlenhoff) [14:46:34] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:47:44] PROBLEM - Check systemd state on aux-k8s-ctrl1002 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:48:00] PROBLEM - Check systemd state on aux-k8s-ctrl1001 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:51:04] (03PS3) 10Muehlenhoff: os-updates: Generate an additional overview page with a breakdown per SRE sub team [puppet] - 10https://gerrit.wikimedia.org/r/916493 [14:52:01] (03PS3) 10EoghanGaffney: [gitlab/runner] Add basic pool/depool commands [puppet] - 10https://gerrit.wikimedia.org/r/913199 [14:52:10] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P47776 and previous config saved to /var/cache/conftool/dbconfig/20230505-145209-ladsgroup.json [14:52:16] (03PS1) 10Jbond: Gitlab: Support OIDC alongside CAS for OmniAuth in Gitlab [puppet] - 10https://gerrit.wikimedia.org/r/916509 (https://phabricator.wikimedia.org/T320390) [14:53:20] (03CR) 10CI reject: [V: 04-1] os-updates: Generate an additional overview page with a breakdown per SRE sub team [puppet] - 10https://gerrit.wikimedia.org/r/916493 (owner: 10Muehlenhoff) [14:53:49] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41068/console" [puppet] - 10https://gerrit.wikimedia.org/r/916509 (https://phabricator.wikimedia.org/T320390) (owner: 10Jbond) [14:54:37] (03PS4) 10Muehlenhoff: os-updates: Generate an additional overview page with a breakdown per SRE team [puppet] - 10https://gerrit.wikimedia.org/r/916493 [14:56:12] (03PS2) 10Jbond: Gitlab: Support OIDC alongside CAS for OmniAuth in Gitlab [puppet] - 10https://gerrit.wikimedia.org/r/916509 (https://phabricator.wikimedia.org/T320390) [14:58:58] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:59:12] PROBLEM - Check whether ferm is active by checking the default input chain on aux-k8s-ctrl1001 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:04:55] (03PS1) 10Muehlenhoff: Add new url downloaders to ACLs [deployment-charts] - 10https://gerrit.wikimedia.org/r/916512 [15:05:48] (03PS2) 10Muehlenhoff: Add new url downloaders to ACLs [deployment-charts] - 10https://gerrit.wikimedia.org/r/916512 (https://phabricator.wikimedia.org/T329945) [15:06:15] !log mforns@deploy1002 Started deploy [airflow-dags/analytics@11fa4e1]: (no justification provided) [15:06:29] !log mforns@deploy1002 Finished deploy [airflow-dags/analytics@11fa4e1]: (no justification provided) (duration: 00m 13s) [15:07:16] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1222', diff saved to https://phabricator.wikimedia.org/P47777 and previous config saved to /var/cache/conftool/dbconfig/20230505-150716-ladsgroup.json [15:08:19] (03PS3) 10Hashar: ci: daily update all git cache repositories [puppet] - 10https://gerrit.wikimedia.org/r/914710 [15:08:21] (03PS2) 10Hashar: ci: add a couple extensions to git cache [puppet] - 10https://gerrit.wikimedia.org/r/914711 [15:08:23] (03PS1) 10Hashar: ci: use an array to manage gitcache repos [puppet] - 10https://gerrit.wikimedia.org/r/916514 [15:08:25] (03PS1) 10Hashar: ci: rm gitcache absented timers [puppet] - 10https://gerrit.wikimedia.org/r/916515 [15:08:31] (NodeTextfileStale) firing: Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [15:08:56] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/916514 (owner: 10Hashar) [15:08:58] (03CR) 10CI reject: [V: 04-1] ci: add a couple extensions to git cache [puppet] - 10https://gerrit.wikimedia.org/r/914711 (owner: 10Hashar) [15:09:04] (03CR) 10CI reject: [V: 04-1] ci: use an array to manage gitcache repos [puppet] - 10https://gerrit.wikimedia.org/r/916514 (owner: 10Hashar) [15:09:06] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/914710 (owner: 10Hashar) [15:09:09] GRRRRRR [15:09:16] (03CR) 10CI reject: [V: 04-1] ci: rm gitcache absented timers [puppet] - 10https://gerrit.wikimedia.org/r/916515 (owner: 10Hashar) [15:11:24] PROBLEM - BGP status on cloudsw1-b1-codfw.mgmt is CRITICAL: BGP CRITICAL - AS64605/IPv4: Active - Anycast https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:12:22] PROBLEM - Bird Internet Routing Daemon on cloudlb2001-dev is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [15:12:24] (03PS6) 10Jbond: gitlab: refactor omniauth providers to a data structure [puppet] - 10https://gerrit.wikimedia.org/r/916489 (https://phabricator.wikimedia.org/T320390) [15:12:26] PROBLEM - Check systemd state on cloudlb2001-dev is CRITICAL: CRITICAL - degraded: The following units failed: bird.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:12:26] (03PS3) 10Jbond: Gitlab: Support OIDC alongside CAS for OmniAuth in Gitlab [puppet] - 10https://gerrit.wikimedia.org/r/916509 (https://phabricator.wikimedia.org/T320390) [15:12:43] (03PS2) 10Hashar: ci: use an array to manage gitcache repos [puppet] - 10https://gerrit.wikimedia.org/r/916514 [15:12:45] (03PS3) 10Hashar: ci: add a couple extensions to git cache [puppet] - 10https://gerrit.wikimedia.org/r/914711 [15:12:47] (03PS2) 10Hashar: ci: rm gitcache absented timers [puppet] - 10https://gerrit.wikimedia.org/r/916515 [15:15:34] (KubernetesAPILatency) firing: High Kubernetes API latency (GET pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:16:06] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:16:22] PROBLEM - Check whether ferm is active by checking the default input chain on aux-k8s-ctrl1002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:19:24] PROBLEM - Check if anycast-healthchecker and all configured threads are running on cloudlb2001-dev is CRITICAL: CRITICAL: anycast-healthchecker could be down as pid file /var/run/anycast-healthchecker/anycast-healthchecker.pid doesnt exist https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [15:20:34] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:22:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1222 (T335845)', diff saved to https://phabricator.wikimedia.org/P47778 and previous config saved to /var/cache/conftool/dbconfig/20230505-152222-ladsgroup.json [15:22:27] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance [15:22:40] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1225.eqiad.wmnet with reason: Maintenance [15:27:01] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudvirt1020.eqiad.wmnet [15:27:04] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudvirt1019.eqiad.wmnet [15:28:32] 10SRE, 10Gerrit, 10Release-Engineering-Team, 10serviceops-collab: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 (10hashar) I wasn't aware about this task until yesterday (via T335730). I'd like the new host to be added first as a replica rather than an entirely new pri... [15:28:36] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:29:29] (03PS4) 10EoghanGaffney: [gitlab/runner] Add basic pool/depool commands [puppet] - 10https://gerrit.wikimedia.org/r/913199 [15:31:07] (03CR) 10Jbond: "before the secret" [puppet] - 10https://gerrit.wikimedia.org/r/916509 (https://phabricator.wikimedia.org/T320390) (owner: 10Jbond) [15:31:21] (03PS7) 10Jbond: gitlab: refactor omniauth providers to a data structure [puppet] - 10https://gerrit.wikimedia.org/r/916489 (https://phabricator.wikimedia.org/T320390) [15:31:23] (03PS4) 10Jbond: Gitlab: Support OIDC alongside CAS for OmniAuth in Gitlab [puppet] - 10https://gerrit.wikimedia.org/r/916509 (https://phabricator.wikimedia.org/T320390) [15:31:25] (03PS1) 10Jbond: gitlab: omniauth_sync_profile_attributes shuold be a list [puppet] - 10https://gerrit.wikimedia.org/r/916516 (https://phabricator.wikimedia.org/T320390) [15:32:23] (03CR) 10CI reject: [V: 04-1] gitlab: omniauth_sync_profile_attributes shuold be a list [puppet] - 10https://gerrit.wikimedia.org/r/916516 (https://phabricator.wikimedia.org/T320390) (owner: 10Jbond) [15:33:31] (03PS1) 10Andrew Bogott: Remove references to cloudvirt1019 and 1020 [puppet] - 10https://gerrit.wikimedia.org/r/916517 (https://phabricator.wikimedia.org/T336063) [15:33:57] (03CR) 10Jbond: "thanks for the PS, i started commenting on the change but then i thought it would probably be better to give the code a bit of a refactor " [puppet] - 10https://gerrit.wikimedia.org/r/915701 (https://phabricator.wikimedia.org/T320390) (owner: 10Chad) [15:34:28] 10SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10isarantopoulos) Wow, sorry for the noise by referencing wrong ticket :) [15:35:01] (03PS2) 10Jbond: gitlab: omniauth_sync_profile_attributes shuold be a list [puppet] - 10https://gerrit.wikimedia.org/r/916516 (https://phabricator.wikimedia.org/T320390) [15:36:22] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41070/console" [puppet] - 10https://gerrit.wikimedia.org/r/916516 (https://phabricator.wikimedia.org/T320390) (owner: 10Jbond) [15:36:43] (03PS8) 10Jbond: gitlab: refactor omniauth providers to a data structure [puppet] - 10https://gerrit.wikimedia.org/r/916489 (https://phabricator.wikimedia.org/T320390) [15:37:44] !log andrew@cumin1001 START - Cookbook sre.dns.netbox [15:38:04] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41071/console" [puppet] - 10https://gerrit.wikimedia.org/r/916489 (https://phabricator.wikimedia.org/T320390) (owner: 10Jbond) [15:38:55] (03CR) 10Andrew Bogott: [C: 03+2] Remove references to cloudvirt1019 and 1020 [puppet] - 10https://gerrit.wikimedia.org/r/916517 (https://phabricator.wikimedia.org/T336063) (owner: 10Andrew Bogott) [15:39:26] !log andrew@cumin1001 START - Cookbook sre.dns.netbox [15:39:44] RECOVERY - Check if anycast-healthchecker and all configured threads are running on cloudlb2001-dev is OK: OK: UP (pid=744000) and all threads (1) are running https://wikitech.wikimedia.org/wiki/Anycast%23Anycast_healthchecker_not_running [15:40:09] !log andrew@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1020.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1001" [15:40:21] (03PS4) 10Hashar: ci: daily update all git cache repositories [puppet] - 10https://gerrit.wikimedia.org/r/914710 [15:40:23] (03PS3) 10Hashar: ci: use an array to manage gitcache repos [puppet] - 10https://gerrit.wikimedia.org/r/916514 [15:40:25] (03PS4) 10Hashar: ci: add a couple extensions to git cache [puppet] - 10https://gerrit.wikimedia.org/r/914711 [15:40:27] (03PS3) 10Hashar: ci: rm gitcache absented timers [puppet] - 10https://gerrit.wikimedia.org/r/916515 [15:40:39] !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:40:40] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudvirt1019.eqiad.wmnet [15:41:20] !log andrew@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1020.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1001" [15:41:20] !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:41:21] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudvirt1020.eqiad.wmnet [15:41:37] !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki [15:41:38] (03CR) 10Hashar: [C: 04-1] "The Puppet compiler does not have access to the CI agents on WMCS. I will look at deploying the chain of patches on the CI Puppet master a" [puppet] - 10https://gerrit.wikimedia.org/r/914710 (owner: 10Hashar) [15:41:54] (03PS5) 10Jbond: Gitlab: Support OIDC alongside CAS for OmniAuth in Gitlab [puppet] - 10https://gerrit.wikimedia.org/r/916509 (https://phabricator.wikimedia.org/T320390) [15:42:13] (03PS1) 10Arturo Borrero Gonzalez: cloudlb: update BGP anycast-healthcheck [puppet] - 10https://gerrit.wikimedia.org/r/916519 (https://phabricator.wikimedia.org/T324992) [15:42:38] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudvirt1020.eqiad.wmnet [15:43:13] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/41072/console" [puppet] - 10https://gerrit.wikimedia.org/r/916509 (https://phabricator.wikimedia.org/T320390) (owner: 10Jbond) [15:44:52] (03CR) 10Cathal Mooney: [C: 03+1] "This will always be true as the VIP is on the local loopback IP. But it will get the BGP announcement going. In terms of longer term ope" [puppet] - 10https://gerrit.wikimedia.org/r/916519 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [15:45:15] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudlb: update BGP anycast-healthcheck [puppet] - 10https://gerrit.wikimedia.org/r/916519 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez) [15:46:25] (03CR) 10CI reject: [V: 04-1] Gitlab: Support OIDC alongside CAS for OmniAuth in Gitlab [puppet] - 10https://gerrit.wikimedia.org/r/916509 (https://phabricator.wikimedia.org/T320390) (owner: 10Jbond) [15:46:44] RECOVERY - Bird Internet Routing Daemon on cloudlb2001-dev is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running [15:46:46] RECOVERY - Check systemd state on cloudlb2001-dev is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:47:18] RECOVERY - BGP status on cloudsw1-b1-codfw.mgmt is OK: BGP OK - up: 7, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:47:20] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:48:19] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "noop, just moves keys around. https://puppet-compiler.wmflabs.org/output/911920/41069/" [puppet] - 10https://gerrit.wikimedia.org/r/911920 (owner: 10Dzahn) [15:48:26] !log btullis@cumin1001 END (PASS) - Cookbook sre.presto.reboot-workers (exit_code=0) for Presto analytics cluster: Reboot Presto nodes [15:50:22] !log andrew@cumin1001 START - Cookbook sre.dns.netbox [15:50:37] (03PS1) 10Jbond: gitlab: sync all configured providers [puppet] - 10https://gerrit.wikimedia.org/r/916522 (https://phabricator.wikimedia.org/T320390) [15:51:33] (03CR) 10Dzahn: [V: 03+1 C: 03+2] gerrit: move hieradata from role/common to common/profile [puppet] - 10https://gerrit.wikimedia.org/r/911920 (owner: 10Dzahn) [15:51:36] !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:51:37] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.decommission (exit_code=1) for hosts cloudvirt1020.eqiad.wmnet [15:52:07] (03CR) 10Jbond: [V: 03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/916509 (https://phabricator.wikimedia.org/T320390) (owner: 10Jbond) [15:53:02] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "deployed first on non-prod machine with puppet disabled on prod machine, confirming noop, re-enabling puppet on prod" [puppet] - 10https://gerrit.wikimedia.org/r/911920 (owner: 10Dzahn) [15:54:13] 10ops-eqiad, 10cloud-services-team, 10decommission-hardware, 10Patch-For-Review: decommission cloudvirt1019 and cloudvirt1020 - https://phabricator.wikimedia.org/T336063 (10Andrew) a:05Andrew→03Jclark-ctr [15:55:15] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "noop confirmed everywhere in prod. (cloud currently doesn't have a running instance). nothing changed." [puppet] - 10https://gerrit.wikimedia.org/r/911920 (owner: 10Dzahn) [15:58:10] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:58:24] (03CR) 10Jbond: [V: 03+1] gitlab: refactor omniauth providers to a data structure (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/916489 (https://phabricator.wikimedia.org/T320390) (owner: 10Jbond) [15:59:41] (03PS1) 10Andrew Bogott: remove references to cloudvirt1023 and cloudvirt1024 [puppet] - 10https://gerrit.wikimedia.org/r/916523 (https://phabricator.wikimedia.org/T336064) [15:59:53] (03CR) 10CI reject: [V: 04-1] remove references to cloudvirt1023 and cloudvirt1024 [puppet] - 10https://gerrit.wikimedia.org/r/916523 (https://phabricator.wikimedia.org/T336064) (owner: 10Andrew Bogott) [16:00:14] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2011.codfw.wmnet with OS bullseye [16:00:23] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs2011.codfw.wmnet with OS bullseye [16:00:24] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye [16:00:33] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye executed... [16:03:25] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2011.codfw.wmnet with OS bullseye [16:03:34] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye [16:03:34] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs2011.codfw.wmnet with OS bullseye [16:03:43] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye executed... [16:06:28] !log btullis@cumin1001 Added views for new wiki: zhwiki T334041 [16:06:28] !log btullis@cumin1001 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) [16:06:32] T334041: Run maintain-views on zhwiki, newiki - https://phabricator.wikimedia.org/T334041 [16:07:35] (03PS2) 10Andrew Bogott: remove references to cloudvirt1023 and cloudvirt1024 [puppet] - 10https://gerrit.wikimedia.org/r/916523 (https://phabricator.wikimedia.org/T336064) [16:08:12] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudvirt1023.eqiad.wmnet [16:08:31] (03CR) 10Andrew Bogott: [C: 03+2] remove references to cloudvirt1023 and cloudvirt1024 [puppet] - 10https://gerrit.wikimedia.org/r/916523 (https://phabricator.wikimedia.org/T336064) (owner: 10Andrew Bogott) [16:10:09] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2011.codfw.wmnet with OS bullseye [16:10:19] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs2011.codfw.wmnet with OS bullseye [16:10:19] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye [16:10:28] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye executed... [16:10:35] !log andrew@cumin1001 START - Cookbook sre.hosts.decommission for hosts cloudvirt1024.eqiad.wmnet [16:10:59] !log btullis@cumin1001 START - Cookbook sre.wikireplicas.add-wiki [16:13:37] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2011.codfw.wmnet with OS bullseye [16:13:47] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye [16:15:32] !log andrew@cumin1001 START - Cookbook sre.dns.netbox [16:16:47] !log andrew@cumin1001 START - Cookbook sre.dns.netbox [16:17:57] !log andrew@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1023.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1001" [16:17:57] !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:17:58] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudvirt1024.eqiad.wmnet [16:18:02] 10ops-eqiad, 10cloud-services-team, 10decommission-hardware: decommission cloudvirt1019 and cloudvirt1020 - https://phabricator.wikimedia.org/T336063 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for hosts: `cloudvirt1024.eqiad.wmnet` - cloudvirt1024.eqiad.wmnet (**WARN... [16:18:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (2) wcqs2002:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [16:20:02] !log andrew@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: cloudvirt1023.eqiad.wmnet decommissioned, removing all IPs except the asset tag one - andrew@cumin1001" [16:20:02] !log andrew@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:20:02] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts cloudvirt1023.eqiad.wmnet [16:20:06] 10ops-eqiad, 10cloud-services-team, 10decommission-hardware: decommission cloudvirt1019 and cloudvirt1020 - https://phabricator.wikimedia.org/T336063 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by andrew@cumin1001 for hosts: `cloudvirt1023.eqiad.wmnet` - cloudvirt1023.eqiad.wmnet (**WARN... [16:20:38] 10ops-eqiad, 10cloud-services-team, 10decommission-hardware: decommission cloudvirt1023 and cloudvirt1024 - https://phabricator.wikimedia.org/T336064 (10Andrew) a:05Andrew→03Jclark-ctr [16:21:16] (03PS1) 10Jbond: idp: use unique id [puppet] - 10https://gerrit.wikimedia.org/r/916525 [16:24:36] (03CR) 10Jbond: [C: 03+2] idp: add gitlab oidc entry (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/916455 (https://phabricator.wikimedia.org/T320390) (owner: 10Jbond) [16:24:42] (03CR) 10Jbond: [C: 03+2] idp: use unique id [puppet] - 10https://gerrit.wikimedia.org/r/916525 (owner: 10Jbond) [16:25:56] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:28:19] !log pt1979@cumin2002 END (ERROR) - Cookbook sre.hosts.reimage (exit_code=97) for host lvs2011.codfw.wmnet with OS bullseye [16:28:29] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye executed... [16:28:58] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:29:25] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2011.codfw.wmnet with OS bullseye [16:29:40] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye [16:35:51] !log btullis@cumin1001 Added views for new wiki: newiki T334041 [16:35:51] !log btullis@cumin1001 END (PASS) - Cookbook sre.wikireplicas.add-wiki (exit_code=0) [16:35:55] T334041: Run maintain-views on zhwiki, newiki - https://phabricator.wikimedia.org/T334041 [16:37:21] (03CR) 10Dzahn: "hold with the reviews until Monday night or so. WIP, thanks" [puppet] - 10https://gerrit.wikimedia.org/r/915737 (owner: 10Dzahn) [16:38:31] (03CR) 10Dzahn: "@Jelto, ancient change I had sitting since over a year ago when I wanted to move 15.wp.org to k8s.. then the tests need to move.. so this " [puppet] - 10https://gerrit.wikimedia.org/r/761063 (owner: 10Dzahn) [16:39:16] (03CR) 10Dzahn: "same for this, it's been sitting in gerrit since 2022 but actually does make sense still for now" [puppet] - 10https://gerrit.wikimedia.org/r/761062 (owner: 10Dzahn) [16:40:00] (03CR) 10Dzahn: "This one probably outdated since now we want to move both together? But worth chatting about on Monday." [puppet] - 10https://gerrit.wikimedia.org/r/761060 (owner: 10Dzahn) [16:40:25] PROBLEM - Host mw2363 is DOWN: PING CRITICAL - Packet loss = 100% [16:40:29] (03CR) 10Dzahn: "finally this for the ingress and certs" [deployment-charts] - 10https://gerrit.wikimedia.org/r/766875 (https://phabricator.wikimedia.org/T300171) (owner: 10Dzahn) [16:41:17] (03CR) 10Dzahn: "I imagine we also merge this on Monday in meeting." [puppet] - 10https://gerrit.wikimedia.org/r/914881 (owner: 10Dzahn) [16:42:05] RECOVERY - Host mw2363 is UP: PING OK - Packet loss = 0%, RTA = 33.17 ms [16:42:21] (03CR) 10Dzahn: "let's have a chat some time if this still makes sense to do in the future" [puppet] - 10https://gerrit.wikimedia.org/r/616917 (https://phabricator.wikimedia.org/T238593) (owner: 10Dzahn) [16:44:16] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs2011.codfw.wmnet with reason: host reimage [16:44:46] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2011.codfw.wmnet with OS bullseye [16:44:56] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye [16:45:25] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs2011.codfw.wmnet with OS bullseye [16:45:35] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye executed... [16:45:40] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2011.codfw.wmnet with OS bullseye [16:45:50] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye [16:46:21] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:46:47] (03CR) 10Dzahn: [C: 03+1] miscweb: add annualreport release to miscweb (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/915673 (https://phabricator.wikimedia.org/T300171) (owner: 10Jelto) [16:47:28] PROBLEM - Host logstash2002 is DOWN: PING CRITICAL - Packet loss = 100% [16:48:06] RECOVERY - Host logstash2002 is UP: PING OK - Packet loss = 0%, RTA = 33.23 ms [16:49:01] PROBLEM - Host db2165 #page is DOWN: PING CRITICAL - Packet loss = 100% [16:50:12] PROBLEM - MariaDB Replica IO: s8 on db2152 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2165.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2165.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:50:26] PROBLEM - MariaDB Replica IO: s8 on db2181 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2165.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2165.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:50:36] PROBLEM - MariaDB Replica IO: s8 on db2167 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2165.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2165.codfw.wmnet (113 No route to host) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:50:44] PROBLEM - MariaDB Replica IO: s8 on db2161 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2165.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2165.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:50:46] something up with rack C5? [16:50:59] RECOVERY - Host db2165 #page is UP: PING OK - Packet loss = 0%, RTA = 33.16 ms [16:51:04] PROBLEM - MariaDB Replica IO: s8 on db2166 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2165.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2165.codfw.wmnet (113 No route to host) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:51:08] PROBLEM - MariaDB Replica IO: s8 on db2154 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2165.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2165.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:51:25] cwhite: see -sre [16:52:02] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2011.codfw.wmnet with OS bullseye [16:52:11] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs2011.codfw.wmnet with OS bullseye [16:52:12] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye [16:52:23] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye executed... [16:52:39] (03PS1) 10Arturo Borrero Gonzalez: cloud_private_subnet: add support for VRF if using BGP [puppet] - 10https://gerrit.wikimedia.org/r/916528 (https://phabricator.wikimedia.org/T336071) [16:52:54] PROBLEM - MariaDB Replica IO: s8 on db2100 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2165.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2165.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:53:14] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2011.codfw.wmnet with OS bullseye [16:53:20] RECOVERY - MariaDB Replica IO: s8 on db2154 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:53:24] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye [16:53:28] RECOVERY - MariaDB Replica IO: s8 on db2152 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:53:42] RECOVERY - MariaDB Replica IO: s8 on db2181 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:53:54] RECOVERY - MariaDB Replica IO: s8 on db2167 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:54:02] RECOVERY - MariaDB Replica IO: s8 on db2100 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:54:04] RECOVERY - MariaDB Replica IO: s8 on db2161 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:54:26] RECOVERY - MariaDB Replica IO: s8 on db2166 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:54:53] need help? [16:55:46] (03PS2) 10Arturo Borrero Gonzalez: cloud_private_subnet: add support for VRF if using BGP [puppet] - 10https://gerrit.wikimedia.org/r/916528 (https://phabricator.wikimedia.org/T336071) [16:55:48] db2165 should not be set in read-write anyway, as it is codfw [16:56:31] jynus: co-ordination is in -sre [16:58:48] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:03:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:06:15] (03PS1) 10Gerrit maintenance bot: mariadb: Promote db2161 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/915800 (https://phabricator.wikimedia.org/T336075) [17:07:10] PROBLEM - MegaRAID on an-worker1088 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:08:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:08:59] !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host lvs2011.codfw.wmnet with OS bullseye [17:09:19] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye executed... [17:09:30] !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host lvs2011.codfw.wmnet with OS bullseye [17:09:40] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye [17:15:13] ACKNOWLEDGEMENT - MegaRAID on an-worker1088 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough Btullis T336077 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [17:16:14] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:24:12] (03PS1) 10Umherirrender: Revert "api: Use RevisionStore::newRevisionsFromBatch to fetch revision records" [core] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/915719 (https://phabricator.wikimedia.org/T336008) [17:24:20] !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on lvs2011.codfw.wmnet with reason: host reimage [17:26:14] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:27:17] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 1.028 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [17:27:31] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on lvs2011.codfw.wmnet with reason: host reimage [17:28:40] (03CR) 10Ottomata: [C: 03+1] "Can deploy this for you Monday :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915789 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [17:28:41] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:28:54] (03CR) 10Ottomata: [C: 03+1] "Or you can schedule it for deployment during a backport window https://wikitech.wikimedia.org/wiki/Backport_windows#How_to_submit_a_patch_" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/915789 (https://phabricator.wikimedia.org/T328899) (owner: 10AikoChou) [17:42:30] (03CR) 10CI reject: [V: 04-1] Revert "api: Use RevisionStore::newRevisionsFromBatch to fetch revision records" [core] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/915719 (https://phabricator.wikimedia.org/T336008) (owner: 10Umherirrender) [17:42:55] !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host lvs2011.codfw.wmnet with OS bullseye [17:43:05] 10SRE, 10ops-codfw, 10DC-Ops, 10Traffic, 10Patch-For-Review: Q4:rack/setup/install lvs2011, lvs2012, lvs2013, lvs2014 - https://phabricator.wikimedia.org/T326767 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host lvs2011.codfw.wmnet with OS bullseye completed... [17:43:31] (03CR) 10Umherirrender: "recheck" [core] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/915719 (https://phabricator.wikimedia.org/T336008) (owner: 10Umherirrender) [17:44:24] (03CR) 10Ahmon Dancy: wmf-update-known-hosts-production: Automatically download DNS (031 comment) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/893708 (owner: 10Lucas Werkmeister (WMDE)) [17:45:59] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:55:45] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1114.eqiad.wmnet - https://phabricator.wikimedia.org/T335837 (10Jclark-ctr) [17:56:24] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1114.eqiad.wmnet - https://phabricator.wikimedia.org/T335837 (10Jclark-ctr) 05Open→03Resolved Removed server and used netbox offline script [17:57:08] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1114.eqiad.wmnet - https://phabricator.wikimedia.org/T335837 (10Jclark-ctr) [17:57:45] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1114.eqiad.wmnet - https://phabricator.wikimedia.org/T335837 (10Jclark-ctr) 05Resolved→03Open commented on wrong ticket this has not been removed from rack of offline script done [17:57:59] PROBLEM - Check systemd state on thumbor1006 is CRITICAL: CRITICAL - degraded: The following units failed: puppet-agent-timer.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:58:37] 10SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 19 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10SDunlap) [17:58:49] 10SRE, 10ops-eqiad, 10cloud-services-team, 10decommission-hardware: decommission cloudvirt1019 and cloudvirt1020 - https://phabricator.wikimedia.org/T336063 (10Jclark-ctr) [17:59:07] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [17:59:20] 10SRE, 10ops-eqiad, 10cloud-services-team, 10decommission-hardware: decommission cloudvirt1019 and cloudvirt1020 - https://phabricator.wikimedia.org/T336063 (10Jclark-ctr) 05Open→03Resolved Removed server and used netbox offline script [17:59:23] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1111.eqiad.wmnet - https://phabricator.wikimedia.org/T335836 (10Jclark-ctr) [17:59:54] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1111.eqiad.wmnet - https://phabricator.wikimedia.org/T335836 (10Jclark-ctr) 05Open→03Resolved Removed server and used netbox offline script [18:00:28] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1117.eqiad.wmnet - https://phabricator.wikimedia.org/T335017 (10Jclark-ctr) Removed server and used netbox offline script [18:00:43] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1117.eqiad.wmnet - https://phabricator.wikimedia.org/T335017 (10Jclark-ctr) [18:00:52] 10SRE, 10ops-eqiad, 10decommission-hardware: decommission db1117.eqiad.wmnet - https://phabricator.wikimedia.org/T335017 (10Jclark-ctr) 05Open→03Resolved [18:01:28] 10SRE, 10ops-eqiad, 10cloud-services-team, 10decommission-hardware: decommission cloudvirt1023 and cloudvirt1024 - https://phabricator.wikimedia.org/T336064 (10Jclark-ctr) [18:01:55] 10SRE, 10ops-eqiad, 10cloud-services-team, 10decommission-hardware: decommission cloudvirt1023 and cloudvirt1024 - https://phabricator.wikimedia.org/T336064 (10Jclark-ctr) 05Open→03Resolved Removed server and used netbox offline script [18:08:31] (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [18:16:19] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:18:17] RECOVERY - Check systemd state on thumbor1006 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:24:15] (03CR) 10Brennen Bearnes: [C: 03+2] "Let's give this a shot on group0,group1. If it drops memcached stuff to normal levels, we can roll forward." [core] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/915719 (https://phabricator.wikimedia.org/T336008) (owner: 10Umherirrender) [18:25:20] !log train 1.41.0-wmf.7 (T330213): trying revert for T336008, T336022 [18:25:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:26] T330213: 1.41.0-wmf.7 deployment blockers - https://phabricator.wikimedia.org/T330213 [18:25:27] T336022: 1.41.0-wmf.7 increases Memcached call rate by +100% - https://phabricator.wikimedia.org/T336022 [18:25:27] T336008: MWException: Internal error in ApiQueryRevisionsBase::getRevisionRecords: RevisionStore does not return record for [n] - https://phabricator.wikimedia.org/T336008 [18:26:22] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [core] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/915719 (https://phabricator.wikimedia.org/T336008) (owner: 10Umherirrender) [18:26:36] (forgot to just use scap backport for this.) [18:27:02] this is the way. [18:28:15] I'm around btw, I'm planning to debug it ASAP [18:28:24] <3 [18:28:47] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:30:04] we figured we'd go to mwdebug and see if it fixes the api issue, assume it will. seems like if this is the culprit for the memcached traffic we'll need to roll it out to check. [18:30:58] well. checking that a handful of the revisions api calls don't explode on mwdebug anyway. [18:34:50] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [18:36:02] 10ops-codfw: PowerSupplyFailure - https://phabricator.wikimedia.org/T336082 (10phaultfinder) [18:36:19] 10SRE, 10Gerrit, 10Release-Engineering-Team, 10serviceops-collab: gerrit1003 service implementation task - https://phabricator.wikimedia.org/T326368 (10Dzahn) [18:40:59] RECOVERY - MegaRAID on an-worker1088 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [18:41:50] (03Merged) 10jenkins-bot: Revert "api: Use RevisionStore::newRevisionsFromBatch to fetch revision records" [core] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/915719 (https://phabricator.wikimedia.org/T336008) (owner: 10Umherirrender) [18:42:44] !log brennen@deploy1002 Started scap: Backport for [[gerrit:915719|Revert "api: Use RevisionStore::newRevisionsFromBatch to fetch revision records" (T336008 T336022)]] [18:42:49] T336022: 1.41.0-wmf.7 increases Memcached call rate by +100% - https://phabricator.wikimedia.org/T336022 [18:42:49] T336008: MWException: Internal error in ApiQueryRevisionsBase::getRevisionRecords: RevisionStore does not return record for [n] - https://phabricator.wikimedia.org/T336008 [18:44:17] !log brennen@deploy1002 umherirrender and brennen: Backport for [[gerrit:915719|Revert "api: Use RevisionStore::newRevisionsFromBatch to fetch revision records" (T336008 T336022)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [18:47:35] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:49:27] RECOVERY - Check systemd state on cloudbackup2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:51:09] looks like it's fine checking a handful of api requests? Let's see what happens... [18:57:05] !log brennen@deploy1002 Finished scap: Backport for [[gerrit:915719|Revert "api: Use RevisionStore::newRevisionsFromBatch to fetch revision records" (T336008 T336022)]] (duration: 14m 21s) [18:57:10] T336022: 1.41.0-wmf.7 increases Memcached call rate by +100% - https://phabricator.wikimedia.org/T336022 [18:57:10] T336008: MWException: Internal error in ApiQueryRevisionsBase::getRevisionRecords: RevisionStore does not return record for [n] - https://phabricator.wikimedia.org/T336008 [18:57:34] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (PUT deployments) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:58:27] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:02:34] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (PUT deployments) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [19:04:50] (RdfStreamingUpdaterFlinkJobUnstable) resolved: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [19:04:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (3) wdqs2005:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [19:08:28] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (6) wdqs2004:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [19:08:31] (NodeTextfileStale) firing: Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [19:08:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (3) wcqs2001:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [19:11:43] (03PS4) 10Eevans: (WIP) cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) [19:12:13] (03CR) 10CI reject: [V: 04-1] (WIP) cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [19:12:38] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [19:16:57] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:18:07] (03PS3) 10CDanis: add tunnelencabulator [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/902826 (https://phabricator.wikimedia.org/T266784) [19:18:52] (03CR) 10CDanis: "thanks for the review!" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/902826 (https://phabricator.wikimedia.org/T266784) (owner: 10CDanis) [19:27:45] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:30:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [19:35:21] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v2/translate/{from}/{to} (Machine translate an HTML fragment using TestClient, adapt the links to target language wiki.) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [19:35:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [19:35:54] (03PS5) 10Eevans: (WIP) cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) [19:36:27] (03CR) 10CI reject: [V: 04-1] (WIP) cassandra: add support for version 4.1.1 [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [19:36:44] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/913265 (https://phabricator.wikimedia.org/T313814) (owner: 10Eevans) [19:38:25] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [19:39:07] (03PS1) 10Andrew Bogott: graphite archive-instances.py: use mwopenstackclient for openstack access [puppet] - 10https://gerrit.wikimedia.org/r/916586 (https://phabricator.wikimedia.org/T330759) [19:39:09] (03PS1) 10Andrew Bogott: nfs-exportd service: subscribe to clouds.yaml [puppet] - 10https://gerrit.wikimedia.org/r/916587 (https://phabricator.wikimedia.org/T330759) [19:39:11] (03PS1) 10Andrew Bogott: grid_configurator: use mwopenstackclients library [puppet] - 10https://gerrit.wikimedia.org/r/916588 (https://phabricator.wikimedia.org/T330759) [19:39:13] (03PS1) 10Andrew Bogott: toolforge k8s: permit access to clouds.yaml [puppet] - 10https://gerrit.wikimedia.org/r/916589 (https://phabricator.wikimedia.org/T330759) [19:39:15] (03PS1) 10Andrew Bogott: wmcs prometheus: include 'OPENSTACK->CLOUD' in prometheus config [puppet] - 10https://gerrit.wikimedia.org/r/916590 (https://phabricator.wikimedia.org/T330759) [19:39:51] (03CR) 10CI reject: [V: 04-1] grid_configurator: use mwopenstackclients library [puppet] - 10https://gerrit.wikimedia.org/r/916588 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [19:40:41] PROBLEM - Host mw2448 is DOWN: PING CRITICAL - Packet loss = 100% [19:44:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [19:46:47] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:47:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [19:49:16] (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - api_appserver - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [19:52:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [19:59:19] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:06:23] (03Abandoned) 10Ladsgroup: mariadb: Promote db2161 to s8 master [puppet] - 10https://gerrit.wikimedia.org/r/915800 (https://phabricator.wikimedia.org/T336075) (owner: 10Gerrit maintenance bot) [20:16:31] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:23:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [20:28:59] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:33:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [20:45:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [20:46:07] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:50:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [20:58:39] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:58:58] (RdfStreamingUpdaterHighConsumerUpdateLag) resolved: (3) wcqs2001:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [21:03:05] (03PS1) 10Dwisehaupt: Add monitoring for new fr-tech hosts [puppet] - 10https://gerrit.wikimedia.org/r/916617 (https://phabricator.wikimedia.org/T334505) [21:03:40] (03CR) 10Dwisehaupt: "For when we are ready to enable monitoring next week." [puppet] - 10https://gerrit.wikimedia.org/r/916617 (https://phabricator.wikimedia.org/T334505) (owner: 10Dwisehaupt) [21:03:42] (03CR) 10CI reject: [V: 04-1] Add monitoring for new fr-tech hosts [puppet] - 10https://gerrit.wikimedia.org/r/916617 (https://phabricator.wikimedia.org/T334505) (owner: 10Dwisehaupt) [21:04:29] 10SRE, 10RESTBase, 10RESTBase-API, 10Traffic: REST API is not invalidating caches after template and/or module changes - https://phabricator.wikimedia.org/T335770 (10Bawolff) >>! In T335770#8825240, @Brycehughes wrote: > @akosiaris if I curl it from San Francisco (VPN) I don't see it. If I curl it from Bel... [21:07:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [21:17:35] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:25:54] 10SRE, 10RESTBase, 10RESTBase-API, 10Traffic: REST API is not invalidating caches after template and/or module changes - https://phabricator.wikimedia.org/T335770 (10Brycehughes) @Bawolff IIRC, those are sent to standard error. Sorry, should have redirected to stdout. We may be stuck for now, since I seem... [21:28:47] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:46:05] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:46:54] 10SRE, 10RESTBase, 10RESTBase-API, 10Traffic: REST API is not invalidating caches after template and/or module changes - https://phabricator.wikimedia.org/T335770 (10Platonides) I also hope it's not something doing MITM into the https connection without you being aware. It seems unlikely, though, and listi... [21:50:40] (03PS1) 10Krinkle: ResourceLoader: Log when MAXAGE_RECOVER is detected [core] (wmf/1.41.0-wmf.7) - 10https://gerrit.wikimedia.org/r/915720 (https://phabricator.wikimedia.org/T321394) [21:56:39] 10SRE-swift-storage, 10MediaWiki-File-management, 10MediaWiki-Page-rename: Pagemove broke: file neither in old nor new location - https://phabricator.wikimedia.org/T336086 (10RhinosF1) [21:57:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [21:58:31] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:08:31] (Access port speed <= 100Mbps) firing: Alert for device asw2-b-eqiad.mgmt.eqiad.wmnet - Access port speed <= 100Mbps - https://alerts.wikimedia.org/?q=alertname%3DAccess+port+speed+%3C%3D+100Mbps [22:17:33] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:18:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [22:28:33] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:33:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [22:45:41] (03PS1) 10Dzahn: lower TTL for gerrit.wikimedia.org reverse lookups [dns] - 10https://gerrit.wikimedia.org/r/916637 (https://phabricator.wikimedia.org/T334524) [22:45:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [22:47:31] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:50:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [22:54:06] (03PS1) 10Dzahn: gerrit: switch service name, turn new into current and current into old [dns] - 10https://gerrit.wikimedia.org/r/916639 (https://phabricator.wikimedia.org/T334524) [22:55:16] (03CR) 10Dzahn: "I think we can do it this way so that you have old gerrit accessible under gerrit-old for a short grace period. Since on the netbox side t" [dns] - 10https://gerrit.wikimedia.org/r/916639 (https://phabricator.wikimedia.org/T334524) (owner: 10Dzahn) [22:55:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [22:55:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [22:56:36] (03PS2) 10Dzahn: gerrit: switch service name, turn new into current and current into old [dns] - 10https://gerrit.wikimedia.org/r/916639 (https://phabricator.wikimedia.org/T334524) [22:59:22] (03PS3) 10Dzahn: gerrit: switch service name, turn new into current and current into old [dns] - 10https://gerrit.wikimedia.org/r/916639 (https://phabricator.wikimedia.org/T326368) [22:59:28] (03PS2) 10Dzahn: lower TTL for gerrit.wikimedia.org reverse lookups [dns] - 10https://gerrit.wikimedia.org/r/916637 (https://phabricator.wikimedia.org/T326368) [23:00:03] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:01:51] (03PS2) 10Andrew Bogott: nfs-exportd service: subscribe to clouds.yaml [puppet] - 10https://gerrit.wikimedia.org/r/916587 (https://phabricator.wikimedia.org/T330759) [23:01:53] (03PS2) 10Andrew Bogott: grid_configurator: use mwopenstackclients library [puppet] - 10https://gerrit.wikimedia.org/r/916588 (https://phabricator.wikimedia.org/T330759) [23:01:55] (03PS2) 10Andrew Bogott: toolforge k8s: permit access to clouds.yaml [puppet] - 10https://gerrit.wikimedia.org/r/916589 (https://phabricator.wikimedia.org/T330759) [23:01:57] (03PS2) 10Andrew Bogott: wmcs prometheus: include 'OPENSTACK->CLOUD' in prometheus config [puppet] - 10https://gerrit.wikimedia.org/r/916590 (https://phabricator.wikimedia.org/T330759) [23:02:28] (03CR) 10CI reject: [V: 04-1] grid_configurator: use mwopenstackclients library [puppet] - 10https://gerrit.wikimedia.org/r/916588 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [23:02:50] (03CR) 10Andrew Bogott: [C: 03+2] nfs-exportd service: subscribe to clouds.yaml [puppet] - 10https://gerrit.wikimedia.org/r/916587 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [23:03:22] (03CR) 10Andrew Bogott: [C: 03+2] graphite archive-instances.py: use mwopenstackclient for openstack access [puppet] - 10https://gerrit.wikimedia.org/r/916586 (https://phabricator.wikimedia.org/T330759) (owner: 10Andrew Bogott) [23:04:58] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (3) wdqs2005:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [23:08:28] (RdfStreamingUpdaterHighConsumerUpdateLag) firing: (6) wdqs2004:9101 has fallen behind applying updates from the RDF Streaming Updater - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterHighConsumerUpdateLag [23:08:46] (NodeTextfileStale) firing: Stale textfile for lists1003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale [23:11:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [23:14:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [23:16:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [23:17:13] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:24:03] !log removing emails from 230 users per self-requests [23:24:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:24:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [23:28:09] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:31:47] (03CR) 10Thcipriani: [C: 03+1] lower TTL for gerrit.wikimedia.org reverse lookups [dns] - 10https://gerrit.wikimedia.org/r/916637 (https://phabricator.wikimedia.org/T326368) (owner: 10Dzahn) [23:36:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [23:41:49] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [23:42:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [23:47:03] PROBLEM - Check systemd state on db2180 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:47:49] (RdfStreamingUpdaterFlinkJobUnstable) resolved: WDQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [23:57:13] 10SRE, 10ops-eqiad, 10DBA: db1170 is not coming back online - https://phabricator.wikimedia.org/T336033 (10Peachey88) [23:57:59] RECOVERY - Check systemd state on db2180 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state