[00:00:20] RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:14] PROBLEM - Check systemd state on maps1009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:05:38] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:08:40] (03CR) 10Cwhite: rsyslog: allow specifying a hiera-defined certfile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/865174 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott) [00:13:12] (03PS2) 10Andrea Denisse: librenms: Increase the TTL for LibreNMS [dns] - 10https://gerrit.wikimedia.org/r/866496 (https://phabricator.wikimedia.org/T322695) [00:14:36] (03CR) 10Andrea Denisse: librenms: Increase the TTL for LibreNMS (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/866496 (https://phabricator.wikimedia.org/T322695) (owner: 10Andrea Denisse) [00:17:53] (03CR) 10Cwhite: [C: 03+1] "Change LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/865731 (https://phabricator.wikimedia.org/T324623) (owner: 10Southparkfan) [00:18:27] (03CR) 10Cwhite: [C: 03+1] "Thanks!" [dns] - 10https://gerrit.wikimedia.org/r/866496 (https://phabricator.wikimedia.org/T322695) (owner: 10Andrea Denisse) [00:30:17] !log eevans@cumin1001 START - Cookbook sre.hosts.decommission for hosts restbase-dev2002.codfw.wmnet [00:30:56] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/865184 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott) [00:32:45] (03CR) 10Zabe: librenms: Increase the TTL for LibreNMS (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/866496 (https://phabricator.wikimedia.org/T322695) (owner: 10Andrea Denisse) [00:34:52] !log eevans@cumin1001 START - Cookbook sre.dns.netbox [00:36:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [00:37:02] !log eevans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: restbase-dev2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1001" [00:38:29] !log eevans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: restbase-dev2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1001" [00:38:29] !log eevans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:38:29] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts restbase-dev2002.codfw.wmnet [00:39:06] !log eevans@cumin1001 START - Cookbook sre.hosts.decommission for hosts restbase-dev2003.codfw.wmnet [00:43:32] !log eevans@cumin1001 START - Cookbook sre.dns.netbox [00:45:32] !log eevans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: restbase-dev2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1001" [00:46:54] !log eevans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: restbase-dev2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1001" [00:46:54] !log eevans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [00:46:55] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts restbase-dev2003.codfw.wmnet [00:54:18] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [00:56:10] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [01:01:50] !log eevans@cumin1001 START - Cookbook sre.dns.netbox [01:04:13] !log eevans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Rename restbase-dev200x hosts to cassandra-dev200x - eevans@cumin1001" [01:05:17] !log eevans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Rename restbase-dev200x hosts to cassandra-dev200x - eevans@cumin1001" [01:05:17] !log eevans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:05:47] !log eevans@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cassandra-dev2002 [01:06:22] !log eevans@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cassandra-dev2002 [01:06:27] !log eevans@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cassandra-dev2003 [01:07:02] !log eevans@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cassandra-dev2003 [01:11:30] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host cassandra-dev2002.codfw.wmnet with OS buster [01:30:33] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cassandra-dev2002.codfw.wmnet with reason: host reimage [01:33:37] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cassandra-dev2002.codfw.wmnet with reason: host reimage [01:41:46] (JobUnavailable) firing: (9) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:47:31] !log eevans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - eevans@cumin1001" [01:48:46] !log eevans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - eevans@cumin1001" [01:48:47] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cassandra-dev2002.codfw.wmnet with OS buster [01:49:37] !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host cassandra-dev2003.codfw.wmnet with OS buster [01:51:46] (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:56:46] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:04:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:05:40] I fired the klaxon for T324801 [02:05:42] T324801: REST API serving content of current revision for old revisions - https://phabricator.wikimedia.org/T324801 [02:06:46] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:08:20] !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cassandra-dev2003.codfw.wmnet with reason: host reimage [02:08:56] legoktm: hello [02:09:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:09:13] hi! long-time no outage ;) [02:09:30] so, restbase is serving the current revision as old revisions [02:09:57] I don't know if it's a MW change or restbase, TheresNoTime is poking at it, and I was talking to Arlo and Subbu out of band [02:10:15] noting that there were some MW core rest API changes this week: https://github.com/wikimedia/mediawiki/commits/master/includes/Rest [02:10:48] my money is starting to be on https://gerrit.wikimedia.org/r/c/mediawiki/core/+/864138 fwiw [02:10:56] I'm out of the loop on who should be responding to this / what the remediation should be [02:11:17] TheresNoTime: that's not deployed AFAIS? [02:11:23] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cassandra-dev2003.codfw.wmnet with reason: host reimage [02:11:25] or are you saying the lack of it is the issue? [02:11:56] nope, ignore me [02:12:53] ok, cscott is also suggesting a train rollback: https://phabricator.wikimedia.org/T324801#8455865 [02:13:22] * cwhite looks for rollback instructions [02:15:38] cwhite: https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Rollback [02:17:24] ok, here we go [02:20:43] I'm peeking at deploy1002, neat, had missed that the container build process is integrated with scap now [02:21:29] hehe yeah that's the wait. [02:21:46] (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:22:53] I'm not sure "To rollback a wikiversion change, it should be pretty quick." is true any more. This scap triggers my "that command is taking too long" sense. [02:23:06] o/ [02:23:35] I wonder if it has to rebuild the l10n cache since we're adding an old MW version back in [02:23:43] !log eevans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - eevans@cumin1001" [02:27:04] !log eevans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - eevans@cumin1001" [02:27:05] !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cassandra-dev2003.codfw.wmnet with OS buster [02:28:29] well, it's at the image push step [02:29:14] on docker_pull_k8s now [02:29:45] my theory about rebuilding the l10n cache is probably wrong, this image is roughly the same size as the older ones [02:30:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:30:06] note to self: ask someone if 10 minute `scap sync-wikiversions` is normal [02:31:05] I will take notes as not-really-IC :p [02:32:18] 50% complete [02:33:45] cwhite: 10m is quite high for that stage.. [02:34:21] Can you hold off for a bit before you actually roll back the train? [02:34:31] We are discussing if there is something on the train that cannot be rolled back. [02:34:36] subbu: rollback is already in flight [02:34:40] i see. [02:35:03] :v [02:35:03] (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [02:37:23] well I've broken my local restbase so (: [02:39:20] !log cwhite@deploy1002 rebuilt and synchronized wikiversions files: Revert "group2 wikis to 1.40.0-wmf.13" [02:39:29] if VE editing breaks with rollback, we'll have to roll forward the train again (and fix the REST API issue or roll back a specific patch causing the issue). [02:39:49] rollback complete [02:40:37] ok .. will test now. [02:40:54] looks like the right revision is being served now [02:41:31] (on group 2 wikis) [02:41:55] seems like editing is not broken. [02:42:21] but, will test a bit more. [02:43:16] hmm, I can't git push origin from deploy1002 [02:46:08] (03PS1) 10Cwhite: Revert "group2 wikis to 1.40.0-wmf.13" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866525 (https://phabricator.wikimedia.org/T324801) [02:46:10] (03CR) 10Cwhite: [C: 03+2] Revert "group2 wikis to 1.40.0-wmf.13" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866525 (https://phabricator.wikimedia.org/T324801) (owner: 10Cwhite) [02:46:17] there we go [02:46:46] (03Merged) 10jenkins-bot: Revert "group2 wikis to 1.40.0-wmf.13" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866525 (https://phabricator.wikimedia.org/T324801) (owner: 10Cwhite) [02:47:19] oh, so, editing seems to be working fine after the rollback on enwiki. [02:48:11] * cwhite resolves klaxon page [02:48:32] TheresNoTime, editing the not-current revision in VE is not a common use case .. so, I think we can probably wait to fix this and roll forward again tomorrow. [02:48:51] Might need daniel to look tomorrow .. but I'll start poking around at the deployed patches in core. [02:49:32] Ping me here or via klaxon if you need me. I'm going to step away and grab a bite but will stay nearby. [02:50:11] thanks! Ya, I should go eat my dinner as well ... was at a restaurant and had just ordered food ... got it packed up and came home. :) [02:50:53] but, i'll hang around in case anyone reports anything else. thanks legoktm for stepping in as well. [02:58:31] thanks cwhite! [03:00:17] Thank you, legoktm! Good to see you and I hope you're doing well :) [03:00:26] :D [03:25:18] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_editattemptstep_hourly.service,eventlogging_to_druid_navigationtiming_hourly.service,eventlogging_to_druid_prefupdate_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [03:27:10] (03PS1) 10Andrea Denisse: netmon: Remove the netmon1002 instance as passive node [puppet] - 10https://gerrit.wikimedia.org/r/866526 (https://phabricator.wikimedia.org/T322321) [03:28:31] (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38658/console" [puppet] - 10https://gerrit.wikimedia.org/r/866526 (https://phabricator.wikimedia.org/T322321) (owner: 10Andrea Denisse) [03:30:11] (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/output/866526/" [puppet] - 10https://gerrit.wikimedia.org/r/866526 (https://phabricator.wikimedia.org/T322321) (owner: 10Andrea Denisse) [03:46:14] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 106 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:46:36] PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [03:48:06] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:51:39] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: search_eqiad elasticsearch and plugin upgrade - ryankemper@cumin1001 - T322776 [03:51:46] T322776: Deploy Ukrainian Analyzer Plugin - https://phabricator.wikimedia.org/T322776 [03:52:21] !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on 50 hosts with reason: Rolling restart in progress [03:52:54] !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on 50 hosts with reason: Rolling restart in progress [04:00:26] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:09:00] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: search_eqiad elasticsearch and plugin upgrade - ryankemper@cumin1001 - T322776 [04:09:06] T322776: Deploy Ukrainian Analyzer Plugin - https://phabricator.wikimedia.org/T322776 [04:24:44] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_editattemptstep_hourly.service,eventlogging_to_druid_navigationtiming_hourly.service,eventlogging_to_druid_prefupdate_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:36:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [04:57:37] I uploaded a fix for the UBN .. hopefully daniel can review early tomorrow and test and we can roll the train forward again. [05:00:08] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:03:21] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: search_eqiad elasticsearch and plugin upgrade - ryankemper@cumin1001 - T322776 [05:03:28] T322776: Deploy Ukrainian Analyzer Plugin - https://phabricator.wikimedia.org/T322776 [05:10:08] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:10:51] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: search_eqiad elasticsearch and plugin upgrade - ryankemper@cumin1001 - T322776 [05:10:58] T322776: Deploy Ukrainian Analyzer Plugin - https://phabricator.wikimedia.org/T322776 [05:13:33] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: search_eqiad elasticsearch and plugin upgrade - ryankemper@cumin1001 - T322776 [05:21:18] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:24:18] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_editattemptstep_hourly.service,eventlogging_to_druid_navigationtiming_hourly.service,eventlogging_to_druid_prefupdate_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:28:29] !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: search_eqiad elasticsearch and plugin upgrade - ryankemper@cumin1001 - T322776 [05:28:36] T322776: Deploy Ukrainian Analyzer Plugin - https://phabricator.wikimedia.org/T322776 [05:41:20] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:52:34] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:01:08] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:18:56] (03PS1) 10Marostegui: db1206: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/866529 [06:20:05] (03CR) 10Marostegui: [C: 03+2] db1206: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/866529 (owner: 10Marostegui) [06:20:27] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 1%: Testing new RAID controller', diff saved to https://phabricator.wikimedia.org/P42655 and previous config saved to /var/cache/conftool/dbconfig/20221209-062027-root.json [06:22:00] (JobUnavailable) firing: Reduced availability for job trafficserver-text in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:25:24] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_editattemptstep_hourly.service,eventlogging_to_druid_navigationtiming_hourly.service,eventlogging_to_druid_prefupdate_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:35:32] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 5%: Testing new RAID controller', diff saved to https://phabricator.wikimedia.org/P42656 and previous config saved to /var/cache/conftool/dbconfig/20221209-063532-root.json [06:50:37] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 10%: Testing new RAID controller', diff saved to https://phabricator.wikimedia.org/P42657 and previous config saved to /var/cache/conftool/dbconfig/20221209-065037-root.json [06:55:24] !log Deploy schema change on s6 T324797 [06:55:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:29] T324797: Add primary key and drop unique index on securepoll_msgs on wmf wikis - https://phabricator.wikimedia.org/T324797 [06:57:16] !log Deploy schema change on s8 T324797 [06:57:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:58:42] !log Deploy schema change on s7 T324797 [06:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:15] !log Deploy schema change on s4 T324797 [07:00:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:42] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 25%: Testing new RAID controller', diff saved to https://phabricator.wikimedia.org/P42658 and previous config saved to /var/cache/conftool/dbconfig/20221209-070542-root.json [07:16:00] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:19:48] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [07:20:47] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 50%: Testing new RAID controller', diff saved to https://phabricator.wikimedia.org/P42659 and previous config saved to /var/cache/conftool/dbconfig/20221209-072047-root.json [07:21:20] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [07:21:42] 10SRE, 10Infrastructure-Foundations, 10vm-requests: CODFW: 1 VM requested for test of reimaging cookbook - https://phabricator.wikimedia.org/T324744 (10SLyngshede-WMF) 05Open→03Resolved [07:23:32] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_editattemptstep_hourly.service,eventlogging_to_druid_navigationtiming_hourly.service,eventlogging_to_druid_prefupdate_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:28:53] !log Deploy schema change on s2 T324797 [07:28:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:59] T324797: Add primary key and drop unique index on securepoll_msgs on wmf wikis - https://phabricator.wikimedia.org/T324797 [07:29:21] !log dbmaint schema change on s2 T324797 [07:29:22] !log dbmaint schema change on s4 T324797 [07:29:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:24] !log dbmaint schema change on s7 T324797 [07:29:26] !log dbmaint schema change on s8 T324797 [07:29:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:29:33] !log dbmaint schema change on s6 T324797 [07:29:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:31:26] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [07:35:02] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [07:35:52] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 75%: Testing new RAID controller', diff saved to https://phabricator.wikimedia.org/P42660 and previous config saved to /var/cache/conftool/dbconfig/20221209-073552-root.json [07:36:01] !log dbmaint schema change on s1 T324797 [07:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:36:05] T324797: Add primary key and drop unique index on securepoll_msgs on wmf wikis - https://phabricator.wikimedia.org/T324797 [07:36:51] !log dbmaint schema change on s5 T324797 [07:36:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:34] (03PS4) 10Slyngshede: WIP: Signup and LDAP flow. [software/bitu] - 10https://gerrit.wikimedia.org/r/860021 [07:45:58] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:50:57] !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 100%: Testing new RAID controller', diff saved to https://phabricator.wikimedia.org/P42661 and previous config saved to /var/cache/conftool/dbconfig/20221209-075057-root.json [07:56:46] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221209T0800) [08:00:50] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:02:16] !log dbmaint schema change on s3 T324797 [08:02:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:02:22] T324797: Add primary key and drop unique index on securepoll_msgs on wmf wikis - https://phabricator.wikimedia.org/T324797 [08:05:40] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [08:11:04] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [08:16:32] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [08:24:24] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_editattemptstep_hourly.service,eventlogging_to_druid_navigationtiming_hourly.service,eventlogging_to_druid_prefupdate_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:25:34] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [08:31:24] (03CR) 10Jgiannelos: [C: 03+1] enable migrate namespace function on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866518 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey) [08:35:49] !log dbmaint schema change on s3@eqiad T324797 [08:35:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:35:55] T324797: Add primary key and drop unique index on securepoll_msgs on wmf wikis - https://phabricator.wikimedia.org/T324797 [08:36:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [08:38:46] !log dbmaint schema change on s1@eqiad T324797 [08:38:48] !log dbmaint schema change on s2@eqiad T324797 [08:38:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:49] !log dbmaint schema change on s4@eqiad T324797 [08:38:51] !log dbmaint schema change on s5@eqiad T324797 [08:38:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:53] !log dbmaint schema change on s6@eqiad T324797 [08:38:54] !log dbmaint schema change on s7@eqiad T324797 [08:38:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:56] !log dbmaint schema change on s8@eqiad T324797 [08:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:40:45] (03CR) 10Hashar: Replace CI results table by Gerrit Check API (031 comment) [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/859083 (https://phabricator.wikimedia.org/T214068) (owner: 10Hashar) [08:58:42] (03PS1) 10David Caro: puppetdb: restart through systemd if service dies [puppet] - 10https://gerrit.wikimedia.org/r/866552 (https://phabricator.wikimedia.org/T324812) [08:59:16] (03CR) 10CI reject: [V: 04-1] puppetdb: restart through systemd if service dies [puppet] - 10https://gerrit.wikimedia.org/r/866552 (https://phabricator.wikimedia.org/T324812) (owner: 10David Caro) [09:00:32] (03PS2) 10David Caro: puppetdb: restart through systemd if service dies [puppet] - 10https://gerrit.wikimedia.org/r/866552 (https://phabricator.wikimedia.org/T324812) [09:00:36] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:04:27] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38660/console" [puppet] - 10https://gerrit.wikimedia.org/r/866552 (https://phabricator.wikimedia.org/T324812) (owner: 10David Caro) [09:07:08] 10SRE, 10Scap: Wrong umask when deploying from screen - https://phabricator.wikimedia.org/T200690 (10Tgr) >>! In T200690#8265133, @dancy wrote: > @Tgr Can you confirm that this is still a problem? Probably not because these days you'd use `scap backport` which runs git commants as the deploy user, there isn't... [09:10:15] (03PS1) 10Muehlenhoff: Make ganeti5006 a ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/866553 (https://phabricator.wikimedia.org/T324610) [09:10:24] (03PS3) 10David Caro: puppetdb: restart through systemd if service dies [puppet] - 10https://gerrit.wikimedia.org/r/866552 (https://phabricator.wikimedia.org/T324812) [09:12:00] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [09:12:52] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [09:13:23] (03PS4) 10David Caro: puppetdb: restart through systemd if service dies [puppet] - 10https://gerrit.wikimedia.org/r/866552 (https://phabricator.wikimedia.org/T324812) [09:16:49] (03PS5) 10David Caro: puppetdb: restart through systemd if service dies [puppet] - 10https://gerrit.wikimedia.org/r/866552 (https://phabricator.wikimedia.org/T324812) [09:18:27] (03CR) 10Hashar: Boilerplate for QUnit testing (032 comments) [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/861486 (owner: 10Hashar) [09:19:03] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38663/console" [puppet] - 10https://gerrit.wikimedia.org/r/866552 (https://phabricator.wikimedia.org/T324812) (owner: 10David Caro) [09:19:47] (03CR) 10David Caro: [V: 03+1] "Now it's ready :), pcc looks ok" [puppet] - 10https://gerrit.wikimedia.org/r/866552 (https://phabricator.wikimedia.org/T324812) (owner: 10David Caro) [09:20:24] (03CR) 10Hashar: Replace CI results table by Gerrit Check API (031 comment) [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/859083 (https://phabricator.wikimedia.org/T214068) (owner: 10Hashar) [09:21:25] (03CR) 10Muehlenhoff: [C: 03+2] Make ganeti5006 a ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/866553 (https://phabricator.wikimedia.org/T324610) (owner: 10Muehlenhoff) [09:24:07] (03PS17) 10Hashar: Replace CI results table by Gerrit Check API [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/859083 (https://phabricator.wikimedia.org/T214068) [09:24:09] (03PS7) 10Hashar: Add unit testing with QUnit [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/861486 [09:34:04] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2003.codfw.wmnet with OS bullseye [09:34:09] 10SRE-swift-storage: Run Thanos backend on Bullseye - https://phabricator.wikimedia.org/T288937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host thanos-be2003.codfw.wmnet with OS bullseye [09:45:39] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [09:46:41] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [09:47:21] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38664/console" [puppet] - 10https://gerrit.wikimedia.org/r/861463 (https://phabricator.wikimedia.org/T256762) (owner: 10BBlack) [09:51:08] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2003.codfw.wmnet with reason: host reimage [09:53:54] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2003.codfw.wmnet with reason: host reimage [09:57:59] (03CR) 10David Caro: [C: 03+1] webservice cli: allow for deployment of custom harbor images (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe) [09:59:23] (03CR) 10JMeybohm: [V: 03+1 C: 03+1] "Oops, did not see your PCC comment. LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/861463 (https://phabricator.wikimedia.org/T256762) (owner: 10BBlack) [10:00:03] I will look at moving the train forward at 13:00 UTC [10:05:30] hashar: Will you be moving it forward at 1300 or just looking at it? :p [10:05:58] (I'll be there, all jokes aside) [10:08:50] (03CR) 10Jbond: [C: 04-1] puppetdb: restart through systemd if service dies (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/866552 (https://phabricator.wikimedia.org/T324812) (owner: 10David Caro) [10:09:10] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2003.codfw.wmnet with OS bullseye [10:09:13] 10SRE-swift-storage: Run Thanos backend on Bullseye - https://phabricator.wikimedia.org/T288937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host thanos-be2003.codfw.wmnet with OS bullseye completed: - thanos-be2003 (**PASS**) - Downtimed on Icinga/Alertmanager... [10:12:28] (03CR) 10Muehlenhoff: puppetdb: restart through systemd if service dies (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/866552 (https://phabricator.wikimedia.org/T324812) (owner: 10David Caro) [10:14:57] (03PS1) 10Ladsgroup: Followup to 5cb38845: Don't drop revid info [core] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/866472 (https://phabricator.wikimedia.org/T324801) [10:15:06] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5006.eqsin.wmnet [10:15:07] (03CR) 10David Caro: [V: 03+1] puppetdb: restart through systemd if service dies (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/866552 (https://phabricator.wikimedia.org/T324812) (owner: 10David Caro) [10:16:33] (03CR) 10Effie Mouzeli: [C: 03+2] maps: Use new swift container for eqiad pregeneration [puppet] - 10https://gerrit.wikimedia.org/r/866442 (https://phabricator.wikimedia.org/T314472) (owner: 10Jgiannelos) [10:16:42] (03PS6) 10David Caro: puppetdb: restart through systemd if service dies [puppet] - 10https://gerrit.wikimedia.org/r/866552 (https://phabricator.wikimedia.org/T324812) [10:17:08] (03PS7) 10David Caro: puppetdb: restart through systemd if service dies [puppet] - 10https://gerrit.wikimedia.org/r/866552 (https://phabricator.wikimedia.org/T324812) [10:17:19] (03CR) 10David Caro: puppetdb: restart through systemd if service dies (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/866552 (https://phabricator.wikimedia.org/T324812) (owner: 10David Caro) [10:22:00] (JobUnavailable) firing: Reduced availability for job trafficserver-text in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:22:13] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38665/console" [puppet] - 10https://gerrit.wikimedia.org/r/866552 (https://phabricator.wikimedia.org/T324812) (owner: 10David Caro) [10:25:15] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5006.eqsin.wmnet [10:27:10] (03CR) 10Ladsgroup: [C: 03+2] Followup to 5cb38845: Don't drop revid info [core] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/866472 (https://phabricator.wikimedia.org/T324801) (owner: 10Ladsgroup) [10:30:01] (03CR) 10JMeybohm: flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [10:34:38] !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti5006.eqsin.wmnet to cluster eqsin and group 1 [10:36:23] !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti5006.eqsin.wmnet to cluster eqsin and group 1 [10:37:01] 10SRE-swift-storage: Run Thanos backend on Bullseye - https://phabricator.wikimedia.org/T288937 (10MatthewVernon) [10:41:23] (03Merged) 10jenkins-bot: Followup to 5cb38845: Don't drop revid info [core] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/866472 (https://phabricator.wikimedia.org/T324801) (owner: 10Ladsgroup) [10:48:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/866472 (https://phabricator.wikimedia.org/T324801) (owner: 10Ladsgroup) [10:49:22] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:866472|Followup to 5cb38845: Don't drop revid info (T324801)]] [10:49:28] T324801: REST API serving content of current revision for old revisions - https://phabricator.wikimedia.org/T324801 [10:51:16] !log ladsgroup@deploy1002 ladsgroup and ladsgroup: Backport for [[gerrit:866472|Followup to 5cb38845: Don't drop revid info (T324801)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet [11:00:48] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2004.codfw.wmnet with OS bullseye [11:00:52] 10SRE-swift-storage: Run Thanos backend on Bullseye - https://phabricator.wikimedia.org/T288937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host thanos-be2004.codfw.wmnet with OS bullseye [11:01:25] !log cgoubert@cumin1001 START - Cookbook sre.dns.netbox [11:02:21] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:866472|Followup to 5cb38845: Don't drop revid info (T324801)]] (duration: 12m 59s) [11:02:27] T324801: REST API serving content of current revision for old revisions - https://phabricator.wikimedia.org/T324801 [11:03:55] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@17b9319] (codfw): codfw: Enable mirroring for 25% of the traffic [11:06:27] (03CR) 10Jbond: [C: 03+1] "thanks; lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/866552 (https://phabricator.wikimedia.org/T324812) (owner: 10David Caro) [11:06:27] !log cgoubert@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Decomissioning netmon2001 - cgoubert@cumin1001" [11:09:04] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@17b9319] (codfw): codfw: Enable mirroring for 25% of the traffic (duration: 05m 08s) [11:10:33] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Decomissioning netmon2001 - cgoubert@cumin1001" [11:10:33] !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [11:16:45] RECOVERY - Check systemd state on mw1358 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:17:46] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2004.codfw.wmnet with reason: host reimage [11:18:55] RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [11:20:09] (03PS1) 10AikoChou: ml-services: update revertrisk docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/866569 (https://phabricator.wikimedia.org/T323023) [11:20:30] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2004.codfw.wmnet with reason: host reimage [11:29:04] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (we'll also need to add python-django-rq to the Debian deps separately)" [software/bitu] - 10https://gerrit.wikimedia.org/r/853290 (owner: 10Slyngshede) [11:35:52] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2004.codfw.wmnet with OS bullseye [11:35:58] 10SRE-swift-storage: Run Thanos backend on Bullseye - https://phabricator.wikimedia.org/T288937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host thanos-be2004.codfw.wmnet with OS bullseye completed: - thanos-be2004 (**PASS**) - Downtimed on Icinga/Alertmanager... [11:40:33] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@d1bd7dc] (codfw): Enable geopoints on production [11:41:33] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@d1bd7dc] (codfw): Enable geopoints on production (duration: 01m 00s) [11:44:31] (03CR) 10Muehlenhoff: "Looks good, some comments inline" [software/bitu] - 10https://gerrit.wikimedia.org/r/850465 (https://phabricator.wikimedia.org/T319410) (owner: 10Slyngshede) [11:46:44] (03CR) 10Jbond: [C: 03+1] remote syslog: allow hiera config of rsyslog TLS CA [puppet] - 10https://gerrit.wikimedia.org/r/865184 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott) [11:54:30] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM, one nit inline" [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/865069 (owner: 10Slyngshede) [11:55:48] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/864729 (owner: 10Volans) [11:57:35] (03PS1) 10Muehlenhoff: Make ganeti5007 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/866572 (https://phabricator.wikimedia.org/T324610) [11:58:27] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: sync [11:58:44] !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: sync [12:00:03] RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:02:48] 10SRE, 10ops-eqsin, 10Infrastructure-Foundations, 10Patch-For-Review: ganeti500[567] implementation tracking - https://phabricator.wikimedia.org/T324610 (10MoritzMuehlenhoff) [12:02:49] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/865728 (owner: 10Volans) [12:05:31] PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:08:44] (03PS7) 10Slyngshede: Bitu IDM, initial checkin [software/bitu] - 10https://gerrit.wikimedia.org/r/850465 (https://phabricator.wikimedia.org/T319410) [12:09:04] (03CR) 10Slyngshede: Bitu IDM, initial checkin (036 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/850465 (https://phabricator.wikimedia.org/T319410) (owner: 10Slyngshede) [12:12:22] !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@6b70e03] (codfw): Reduce mirrored traffic to 5% [12:13:05] (03PS1) 10Reedy: CommonSettings.php: Mark REL1_39 as Default Snapshot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866574 (https://phabricator.wikimedia.org/T324808) [12:13:07] (03CR) 10Jbond: [C: 03+1] profile::cumin: use bool2str to simplify code (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/865728 (owner: 10Volans) [12:13:09] (03PS3) 10Slyngshede: Add RQ support to Django [software/bitu] - 10https://gerrit.wikimedia.org/r/853290 [12:14:01] !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@6b70e03] (codfw): Reduce mirrored traffic to 5% (duration: 01m 39s) [12:14:09] (03CR) 10Muehlenhoff: cumin: add an audit report for insetup servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/864729 (owner: 10Volans) [12:15:23] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [12:15:30] (03PS2) 10Slyngshede: Version bump. Go to version 0.0.2. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/865069 [12:15:46] (03CR) 10Slyngshede: Version bump. Go to version 0.0.2. (031 comment) [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/865069 (owner: 10Slyngshede) [12:16:57] (03CR) 10Muehlenhoff: [C: 03+1] "Ship it" [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/865069 (owner: 10Slyngshede) [12:17:11] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [12:19:50] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/760619 (https://phabricator.wikimedia.org/T298246) (owner: 10Hnowlan) [12:20:49] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [12:21:00] (03PS8) 10Slyngshede: Bitu IDM, initial checkin [software/bitu] - 10https://gerrit.wikimedia.org/r/850465 (https://phabricator.wikimedia.org/T319410) [12:24:27] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [12:28:05] (03PS12) 10Cathal Mooney: Example strategy for marking DSCP with ferm and puppet integration [puppet] - 10https://gerrit.wikimedia.org/r/865108 (https://phabricator.wikimedia.org/T316358) [12:28:39] (03CR) 10CI reject: [V: 04-1] Example strategy for marking DSCP with ferm and puppet integration [puppet] - 10https://gerrit.wikimedia.org/r/865108 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney) [12:29:24] (03CR) 10Jbond: [C: 03+1] rsyslog: add support for openssl netstream driver [puppet] - 10https://gerrit.wikimedia.org/r/865731 (https://phabricator.wikimedia.org/T324623) (owner: 10Southparkfan) [12:31:20] (03CR) 10Slyngshede: [C: 03+2] Version bump. Go to version 0.0.2. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/865069 (owner: 10Slyngshede) [12:32:33] (03Merged) 10jenkins-bot: Version bump. Go to version 0.0.2. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/865069 (owner: 10Slyngshede) [12:32:56] (03PS13) 10Cathal Mooney: Example strategy for marking DSCP with ferm and puppet integration [puppet] - 10https://gerrit.wikimedia.org/r/865108 (https://phabricator.wikimedia.org/T316358) [12:33:59] (03CR) 10Jbond: [C: 03+1] "LGTM, much simpler then expected 😊" [puppet] - 10https://gerrit.wikimedia.org/r/865075 (owner: 10JMeybohm) [12:35:46] (03CR) 10CI reject: [V: 04-1] Example strategy for marking DSCP with ferm and puppet integration [puppet] - 10https://gerrit.wikimedia.org/r/865108 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney) [12:36:10] (03CR) 10Jbond: [C: 03+1] "LGTM, i think we still need to think about how we fix the monitoring problem but we can tackle that later" [puppet] - 10https://gerrit.wikimedia.org/r/865591 (owner: 10JMeybohm) [12:36:17] !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be1001.eqiad.wmnet with OS bullseye [12:36:21] 10SRE-swift-storage: Run Thanos backend on Bullseye - https://phabricator.wikimedia.org/T288937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host thanos-be1001.eqiad.wmnet with OS bullseye [12:36:43] 10SRE-swift-storage: Run Thanos backend on Bullseye - https://phabricator.wikimedia.org/T288937 (10MatthewVernon) [12:36:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [12:36:52] (03CR) 10Jbond: [C: 03+1] pki: Add intermediates for wikikube and wikikube staging (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/865591 (owner: 10JMeybohm) [12:38:26] (03PS14) 10Cathal Mooney: Example strategy for marking DSCP with ferm and puppet integration [puppet] - 10https://gerrit.wikimedia.org/r/865108 (https://phabricator.wikimedia.org/T316358) [12:39:50] (03PS1) 10Slyngshede: deb: align package naming. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/866582 [12:47:39] (03CR) 10Cathal Mooney: [C: 04-1] "Marking -1, this is not intended to be merged in current version, just some examples." [puppet] - 10https://gerrit.wikimedia.org/r/865108 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney) [12:50:05] (03CR) 10Muehlenhoff: deb: align package naming. (031 comment) [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/866582 (owner: 10Slyngshede) [12:58:06] good afternoon [13:02:54] (03PS4) 10Matthias Mullie: Add mediawiki.searchpreview schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845518 (https://phabricator.wikimedia.org/T321069) [13:04:41] (03PS1) 10TrainBranchBot: all wikis to 1.40.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866586 (https://phabricator.wikimedia.org/T320518) [13:04:43] (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.40.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866586 (https://phabricator.wikimedia.org/T320518) (owner: 10TrainBranchBot) [13:04:46] running da train [13:04:50] Hey hashar [13:04:53] choo choo [13:05:24] (03Merged) 10jenkins-bot: all wikis to 1.40.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866586 (https://phabricator.wikimedia.org/T320518) (owner: 10TrainBranchBot) [13:05:41] (03PS1) 10Effie Mouzeli: mediawiki::mcrouter_wancache: Add mc2053 to memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/866587 (https://phabricator.wikimedia.org/T293012) [13:05:47] the thing is the broken case reported on T324801 is already fixed [13:05:49] T324801: REST API serving content of current revision for old revisions - https://phabricator.wikimedia.org/T324801 [13:05:54] potentially by the backport Amir has done this morning to wmf.13 [13:06:02] yeah it was backported this morning iiuc [13:06:02] but wikipedia are still on wmf.12 [13:06:05] so well I don't know [13:06:59] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [13:07:19] oh joy [13:07:49] the grafana link shows an empty graph [13:08:10] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [13:08:15] cause the `var-method` wasn't set [13:08:22] !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be1001.eqiad.wmnet with reason: host reimage [13:08:28] most probably something is cutting the messages when it is too long [13:09:23] the graph looks all fine, no idea why the alert has triggered [13:09:43] It's been flapping the past few days but I haven't managed to figure out why [13:11:16] (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki::mcrouter_wancache: Add mc2053 to memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/866587 (https://phabricator.wikimedia.org/T293012) (owner: 10Effie Mouzeli) [13:11:48] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be1001.eqiad.wmnet with reason: host reimage [13:13:16] !log hashar@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.40.0-wmf.13 refs T320518 [13:13:22] T320518: 1.40.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T320518 [13:13:58] et voilà! [13:14:32] Now I get to see if my delay before alerting for opcache health works as intended :p [13:15:50] (03PS1) 10Marostegui: Revert "db1206: Enable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/866473 [13:16:00] (03PS4) 10Slyngshede: Add RQ support to Django [software/bitu] - 10https://gerrit.wikimedia.org/r/853290 [13:18:18] MediaWiki logs look quiet [13:18:53] (03PS2) 10Slyngshede: deb: align package naming. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/866582 [13:22:29] hashar: I recon you are deploying to k8s as well ? [13:23:22] effie: Nope, except mw-debug iirc [13:23:40] why is that ? [13:24:33] I've caused a repl breakage in much of s3 [13:24:34] on it [13:24:35] No I lied [13:24:52] I thought we hadn´t flipped the switch for it, but we did [13:25:11] I need to apply some changes and I don;t want to be in hashar's way :) [13:25:30] It's all done now [13:25:53] cgoubert@deploy1002:/srv/deployment-charts/helmfile.d/services/mw-web$ helmfile -e eqiad status 2> /dev/null | grep LAST [13:25:54] effie: yeah it is all open please do ;) [13:25:55] LAST DEPLOYED: Fri Dec 9 13:09:22 2022 [13:25:55] claime: we might a get a page right now [13:25:57] LAST DEPLOYED: Fri Dec 9 13:07:57 2022 [13:26:01] Amir1: ack [13:26:05] I hope it finishes asap [13:26:07] Do you need me for something [13:26:09] ? [13:26:15] I think the k8s deployment is automagically handled by scap now [13:26:21] hashar: yeah it is [13:26:21] emotional support [13:26:22] at least it gave me a bunch of lines about executing helm [13:26:27] Amir1: *hug* [13:26:33] <3 [13:26:39] which I am more happy to ignore / not understand as long as those lines are green / OKish [13:26:48] heh fair enough [13:26:53] I got lucky I think [13:27:26] it made a bit of s3 read-only for a bit due to excessive lag [13:27:35] but for a minute only [13:28:40] !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be1001.eqiad.wmnet with OS bullseye [13:28:43] 10SRE-swift-storage: Run Thanos backend on Bullseye - https://phabricator.wikimedia.org/T288937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host thanos-be1001.eqiad.wmnet with OS bullseye completed: - thanos-be1001 (**PASS**) - Downtimed on Icinga/Alertmanager... [13:30:48] Amir1: The alarm is "MariaDB sustained replica lag on s3" right? [13:31:03] there should be multiple [13:31:08] Yeah [13:31:09] but yeah, that's one of them [13:32:52] Yeah they seem to be going away without getting to the point of alerting [13:33:52] what was the cause? [13:34:07] (keeping an eye in case backups are needed) [13:34:19] (thanks for looking out for us <3) [13:35:01] 10SRE-swift-storage: Run Thanos backend on Bullseye - https://phabricator.wikimedia.org/T288937 (10MatthewVernon) [13:35:18] claime: please never doubt to ask for help, even if not sure yet if a recovery is needed [13:35:42] oh, you meant oncall, sorry [13:35:56] jynus: No no I meant you, I'm on call :P [13:36:12] And I won't hesitate, thanks :) [13:36:58] the thing is, it may take some time to proceed with a recovery (not everything is fully automated, and it may never will), so preparing the nuclear weapon can take some time, even before launch! [13:37:45] *nods* understood [13:37:48] Amir1: a schema or grant change, maybe? [13:38:22] jynus: nothing major, I was running a schema change on bewiki to fix flaggedrevs drift, it turned out it was drifitng in different hosts [13:38:39] yeah, that happens on s3, that is why it was my guess [13:38:45] so the schema change took longer than it should in some hosts choking the replication [13:38:49] it shouldn't, but it does [13:38:53] it wasn't all thankfully [13:39:06] ah, so it idn't broke? it was "just" lag [13:39:11] yeah [13:39:16] so much better [13:39:32] the schema change was idempotent [13:39:35] lag spikes happen all the time [13:39:37] noop [13:39:57] yes, even on all hosts [13:47:55] (03CR) 10Marostegui: [C: 03+2] Revert "db1206: Enable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/866473 (owner: 10Marostegui) [13:48:06] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1206', diff saved to https://phabricator.wikimedia.org/P42662 and previous config saved to /var/cache/conftool/dbconfig/20221209-134806-marostegui.json [13:56:20] (03CR) 10JMeybohm: pki: Add intermediates for wikikube and wikikube staging (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/865591 (owner: 10JMeybohm) [14:00:00] (03CR) 10Slyngshede: deb: align package naming. (031 comment) [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/866582 (owner: 10Slyngshede) [14:02:42] (03CR) 10David Caro: [V: 03+1 C: 03+2] puppetdb: restart through systemd if service dies [puppet] - 10https://gerrit.wikimedia.org/r/866552 (https://phabricator.wikimedia.org/T324812) (owner: 10David Caro) [14:02:52] (03CR) 10David Caro: [V: 03+1 C: 03+2] "I'll merge on monday" [puppet] - 10https://gerrit.wikimedia.org/r/866552 (https://phabricator.wikimedia.org/T324812) (owner: 10David Caro) [14:03:08] (03CR) 10Muehlenhoff: deb: align package naming. (031 comment) [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/866582 (owner: 10Slyngshede) [14:04:57] (03PS3) 10Slyngshede: deb: align package naming. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/866582 [14:05:05] (03CR) 10Slyngshede: deb: align package naming. (031 comment) [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/866582 (owner: 10Slyngshede) [14:06:35] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/866582 (owner: 10Slyngshede) [14:06:55] (03CR) 10Slyngshede: [C: 03+2] deb: align package naming. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/866582 (owner: 10Slyngshede) [14:08:25] (03Merged) 10jenkins-bot: deb: align package naming. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/866582 (owner: 10Slyngshede) [14:10:15] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/866569 (https://phabricator.wikimedia.org/T323023) (owner: 10AikoChou) [14:16:49] (03CR) 10Klausman: [V: 03+2 C: 03+2] ml-services: update revertrisk docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/866569 (https://phabricator.wikimedia.org/T323023) (owner: 10AikoChou) [14:17:16] (03PS1) 10Jbond: blackbox::check::http: change expiry check value from days to seconds [puppet] - 10https://gerrit.wikimedia.org/r/866594 [14:18:53] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: PSU failure for restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T324572 (10Jclark-ctr) 05Open→03Resolved removed power supply and reseated error has removed [14:20:00] (03PS1) 10Effie Mouzeli: mediawiki::mcrouter_wancache: Add mc2053 to memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/866595 (https://phabricator.wikimedia.org/T293012) [14:20:12] 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: PSU failure for restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T324572 (10Clement_Goubert) Thanks so much! [14:21:16] (03PS2) 10Effie Mouzeli: mediawiki::mcrouter_wancache: Add mc2052 to memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/866595 (https://phabricator.wikimedia.org/T293012) [14:21:57] (03Merged) 10jenkins-bot: ml-services: update revertrisk docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/866569 (https://phabricator.wikimedia.org/T323023) (owner: 10AikoChou) [14:22:00] (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki::mcrouter_wancache: Add mc2052 to memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/866595 (https://phabricator.wikimedia.org/T293012) (owner: 10Effie Mouzeli) [14:22:00] (JobUnavailable) firing: Reduced availability for job trafficserver-text in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:24:16] (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/863006 (https://phabricator.wikimedia.org/T311999) (owner: 10Jbond) [14:28:06] (03PS1) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/866597 [14:29:11] (03PS1) 10FNegri: Reinstate innodb_large_prefix on ToolsDB [puppet] - 10https://gerrit.wikimedia.org/r/866598 (https://phabricator.wikimedia.org/T324846) [14:32:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning: hw troubleshooting: power supply for an-worker1184.eqiad.wmnet - https://phabricator.wikimedia.org/T324559 (10Jclark-ctr) 05Open→03Resolved @BTullis Reseated power supply2 fault light cleared on rear of server [14:35:37] (03CR) 10Muehlenhoff: [C: 03+2] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/866597 (owner: 10Muehlenhoff) [14:37:00] (03PS1) 10AikoChou: ml-services: fix typo for revertrisk docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/866600 (https://phabricator.wikimedia.org/T323023) [14:38:57] (03CR) 10Klausman: [V: 03+2 C: 03+2] ml-services: fix typo for revertrisk docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/866600 (https://phabricator.wikimedia.org/T323023) (owner: 10AikoChou) [14:39:41] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Jclark-ctr) [14:40:54] 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Jclark-ctr) @Jgreen these have been received is this urgent or could I wait till after fundraising to rack and cable these? [14:41:29] RECOVERY - IPMI Sensor Status on an-worker1148 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [14:41:52] (03CR) 10Muehlenhoff: "One further comment inline, looks good otherwise" [software/bitu] - 10https://gerrit.wikimedia.org/r/850465 (https://phabricator.wikimedia.org/T319410) (owner: 10Slyngshede) [14:44:06] (03Merged) 10jenkins-bot: ml-services: fix typo for revertrisk docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/866600 (https://phabricator.wikimedia.org/T323023) (owner: 10AikoChou) [14:47:36] !log aikochou@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [14:48:58] 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T323961 (10Jclark-ctr) @Andrew we will need to preform flee power drain on server [14:52:30] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops, 10User-Eevans: Relocate hosts: aqs10[3-5] - https://phabricator.wikimedia.org/T307035 (10Jclark-ctr) @Eevans this could possibly happen next week is there a day that works best for you? I am working on another project next week with Papaul I am not available... [14:52:51] (03PS1) 10Jbond: cfssl::cert: add documntation and fix linting [puppet] - 10https://gerrit.wikimedia.org/r/866601 [14:52:54] (03PS1) 10Jbond: cfssl::cert: add ability to renew based on a relative value [puppet] - 10https://gerrit.wikimedia.org/r/866602 [14:57:32] (03CR) 10David Caro: [C: 03+1] "Wait for @Marostegi's ack though" [puppet] - 10https://gerrit.wikimedia.org/r/866598 (https://phabricator.wikimedia.org/T324846) (owner: 10FNegri) [14:59:30] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Decommission eqiad cage WiFi - https://phabricator.wikimedia.org/T320962 (10Jclark-ctr) Removed wifi1 from rack and ran decom script [14:59:37] 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Decommission eqiad cage WiFi - https://phabricator.wikimedia.org/T320962 (10Jclark-ctr) 05Open→03Resolved [15:04:13] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.5645 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [15:06:02] 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Hardware): decommission cloudmetrics100[1-2].eqiad.wmnet - https://phabricator.wikimedia.org/T297444 (10Jclark-ctr) [15:07:03] !log jclark@cumin1001 START - Cookbook sre.dns.netbox [15:07:45] 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Hardware): decommission cloudmetrics100[1-2].eqiad.wmnet - https://phabricator.wikimedia.org/T297444 (10Jclark-ctr) 05Open→03Resolved [15:08:13] !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:09:47] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.09677 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [15:10:14] (03CR) 10Jbond: [C: 03+2] cfssl::cert: add documntation and fix linting [puppet] - 10https://gerrit.wikimedia.org/r/866601 (owner: 10Jbond) [15:10:30] 10SRE, 10ops-eqiad, 10Data-Persistence (work done), 10Phabricator, and 3 others: decommission phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T323418 (10Jclark-ctr) [15:10:53] 10SRE, 10ops-eqiad, 10Data-Persistence (work done), 10Phabricator, and 3 others: decommission phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T323418 (10Jclark-ctr) 05Open→03Resolved [15:13:08] 10ops-codfw: Port with no description on access switch - https://phabricator.wikimedia.org/T324752 (10Papaul) @ayounsi @cmooney this interface is disable and i keep getting this task everything i close the task can you please check thanks. [15:14:45] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops, 10User-Eevans: Relocate hosts: aqs10[3-5] - https://phabricator.wikimedia.org/T307035 (10Eevans) >>! In T307035#8456972, @Jclark-ctr wrote: > @Eevans this could possibly happen next week is there a day that works best for you? I am working on another project n... [15:16:09] (03CR) 10Marostegui: [C: 03+1] "but toolsdb needs to be upgraded asap" [puppet] - 10https://gerrit.wikimedia.org/r/866598 (https://phabricator.wikimedia.org/T324846) (owner: 10FNegri) [15:32:15] (03CR) 10Hashar: "I have restarted Gerrit twice and confirmed it has fixed the issue. The H2 database files have been compacted successfully T323754#8454316" [puppet] - 10https://gerrit.wikimedia.org/r/865023 (https://phabricator.wikimedia.org/T323754) (owner: 10Hashar) [15:35:03] PROBLEM - IPMI Sensor Status on db1186 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:37:08] ^ I will get a task for that [15:37:19] (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] Wikidata: don't show Vector search thumbnails (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848421 (https://phabricator.wikimedia.org/T316093) (owner: 10Michael Große) [15:38:57] 10ops-eqiad, 10DBA, 10DC-Ops: db1186 power supplies not redundant - https://phabricator.wikimedia.org/T324858 (10Marostegui) [15:39:07] 10ops-eqiad, 10DBA, 10DC-Ops: db1186 power supplies not redundant - https://phabricator.wikimedia.org/T324858 (10Marostegui) p:05Triage→03Medium [15:40:16] ACKNOWLEDGEMENT - IPMI Sensor Status on db1186 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] Marostegui https://phabricator.wikimedia.org/T324858 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [15:40:22] (03CR) 10Michael Große: Wikidata: don't show Vector search thumbnails (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848421 (https://phabricator.wikimedia.org/T316093) (owner: 10Michael Große) [15:44:51] (03CR) 10Ahmon Dancy: [C: 03+1] P:gitlab::runner: Add environment variable for kokkuri [puppet] - 10https://gerrit.wikimedia.org/r/866520 (owner: 10Dduvall) [15:47:57] (03PS3) 10Michael Große: Wikidata: don't show Vector search thumbnails [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848421 (https://phabricator.wikimedia.org/T316093) [15:50:04] (03CR) 10Lucas Werkmeister (WMDE): Wikidata: don't show Vector search thumbnails (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848421 (https://phabricator.wikimedia.org/T316093) (owner: 10Michael Große) [15:50:24] (03CR) 10Lucas Werkmeister (WMDE): Wikidata: don't show Vector search thumbnails (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848421 (https://phabricator.wikimedia.org/T316093) (owner: 10Michael Große) [15:50:31] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.4194 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [15:51:55] (03CR) 10Lucas Werkmeister (WMDE): "Hm, something isn’t right here – why does the diffConfig build now say that there are no changes?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848421 (https://phabricator.wikimedia.org/T316093) (owner: 10Michael Große) [15:56:58] (03CR) 10Lucas Werkmeister (WMDE): Wikidata: don't show Vector search thumbnails (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848421 (https://phabricator.wikimedia.org/T316093) (owner: 10Michael Große) [15:58:19] (03CR) 10Lucas Werkmeister (WMDE): Wikidata: don't show Vector search thumbnails (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848421 (https://phabricator.wikimedia.org/T316093) (owner: 10Michael Große) [15:58:37] (03CR) 10Michael Große: Wikidata: don't show Vector search thumbnails (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848421 (https://phabricator.wikimedia.org/T316093) (owner: 10Michael Große) [15:59:47] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.03226 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [16:00:56] (03CR) 10Michael Große: Wikidata: don't show Vector search thumbnails (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848421 (https://phabricator.wikimedia.org/T316093) (owner: 10Michael Große) [16:04:12] (03PS1) 10Muehlenhoff: deployment servers: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/866607 (https://phabricator.wikimedia.org/T135991) [16:06:14] (03PS2) 10Jbond: cfssl::cert: add ability to renew based on a relative value [puppet] - 10https://gerrit.wikimedia.org/r/866602 [16:08:14] 10ops-eqiad, 10DBA, 10DC-Ops: db1186 power supplies not redundant - https://phabricator.wikimedia.org/T324858 (10Jclark-ctr) a:03Jclark-ctr [16:08:27] (03CR) 10Lucas Werkmeister (WMDE): Wikidata: don't show Vector search thumbnails (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848421 (https://phabricator.wikimedia.org/T316093) (owner: 10Michael Große) [16:11:13] (03PS1) 10Jbond: cli: handle a blank change number gracefully [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/866609 [16:12:21] (03PS1) 10Jbond: puppet_compiler: bump version [puppet] - 10https://gerrit.wikimedia.org/r/866610 [16:16:15] (03PS2) 10Jbond: puppet_compiler: bump version [puppet] - 10https://gerrit.wikimedia.org/r/866610 [16:16:22] (03PS2) 10Jbond: cli: handle a blank change number gracefully [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/866609 [16:16:31] (03PS2) 10Tsevener: Add event stream config for ios.talk_page_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866489 (https://phabricator.wikimedia.org/T324340) [16:17:49] (03CR) 10Lucas Werkmeister (WMDE): Wikidata: don't show Vector search thumbnails (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848421 (https://phabricator.wikimedia.org/T316093) (owner: 10Michael Große) [16:18:01] (03PS4) 10Michael Große: Wikidata: don't show Vector search thumbnails [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848421 (https://phabricator.wikimedia.org/T316093) [16:18:21] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3871 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [16:19:15] (03CR) 10CI reject: [V: 04-1] cli: handle a blank change number gracefully [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/866609 (owner: 10Jbond) [16:21:15] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "diffConfig looks good now \o/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848421 (https://phabricator.wikimedia.org/T316093) (owner: 10Michael Große) [16:21:41] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-stretch1001.eqiad.wmnet with OS bullseye [16:21:49] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host kafka-stretch1001.eqiad.wmnet with OS bullseye [16:21:49] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-stretch1001.eqiad.wmnet with OS bullseye [16:21:55] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host kafka-stretch1001.eqiad.wmnet with OS bullseye... [16:23:53] PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.6613 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [16:24:38] (03CR) 10David Caro: [C: 03+1] "LGTM once the tests pass :)" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/866609 (owner: 10Jbond) [16:27:50] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-stretch1002.eqiad.wmnet with OS bullseye [16:27:51] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10Cmjohnson) @papaul When I try to image these servers, the process fails immediately. This is the error I receive. Any ideas on what is wrong?... [16:27:57] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host kafka-stretch1002.eqiad.wmnet with OS bullseye [16:30:15] (03PS1) 10Ottomata: eventgate-analytics: bump replicas from 20 to 30 [deployment-charts] - 10https://gerrit.wikimedia.org/r/866612 [16:31:00] (03CR) 10Clément Goubert: [C: 03+2] eventgate-analytics: bump replicas from 20 to 30 [deployment-charts] - 10https://gerrit.wikimedia.org/r/866612 (owner: 10Ottomata) [16:31:17] (PHPFPMTooBusy) firing: Not enough idle php7.4-fpm.service workers for Mediawiki api_appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:31:56] (03CR) 10Jforrester: [C: 03+1] "Oops, knew I forgot something. Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866574 (https://phabricator.wikimedia.org/T324808) (owner: 10Reedy) [16:32:13] Working on ^^ in #wikimedia-serviceops [16:32:21] (03CR) 10Ottomata: [V: 03+2] eventgate-analytics: bump replicas from 20 to 30 [deployment-charts] - 10https://gerrit.wikimedia.org/r/866612 (owner: 10Ottomata) [16:32:23] page acknowledged [16:33:54] !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply [16:34:35] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [16:34:58] !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply [16:35:16] !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply [16:36:17] (PHPFPMTooBusy) resolved: Not enough idle php7.4-fpm.service workers for Mediawiki api_appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy [16:36:28] !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply [16:36:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [16:36:55] RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1 [16:38:25] !log cmjohnson@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kafka-stretch1001 - cmjohnson@cumin1001" [16:39:28] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kafka-stretch1001 - cmjohnson@cumin1001" [16:39:28] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:39:54] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-stretch1002.eqiad.wmnet with reason: host reimage [16:42:46] !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply [16:42:54] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-stretch1002.eqiad.wmnet with reason: host reimage [16:43:08] !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply [16:44:05] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10Papaul) @Cmjohnson try to delete the kafka-stretch1001.conf on install1003 and try again an let me know [16:56:14] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [16:57:24] !log cmjohnson@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - cmjohnson@cumin1001" [16:58:20] !log cmjohnson@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kafka-stretch1001 - cmjohnson@cumin1001" [16:58:47] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - cmjohnson@cumin1001" [16:58:52] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-stretch1002.eqiad.wmnet with OS bullseye [16:58:58] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host kafka-stretch1002.eqiad.wmnet with OS bullseye... [16:59:19] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kafka-stretch1001 - cmjohnson@cumin1001" [16:59:20] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [17:00:07] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host kafka-stretch1001.mgmt.eqiad.wmnet with reboot policy FORCED [17:02:50] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-stretch1001.mgmt.eqiad.wmnet with reboot policy FORCED [17:03:31] !log eventgate-analytics bumped to 30 replicas to absorb increased load - T320518 [17:03:32] 10SRE, 10LDAP-Access-Requests: Logstash access for contractor Wangombe - https://phabricator.wikimedia.org/T318209 (10jhathaway) @Wangombe apologies for not noticing this earlier but your developer account or wikitech account is linked to your personal email address. Would you kindly change that to your @wikim... [17:03:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:34] T320518: 1.40.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T320518 [17:09:20] (03PS1) 10JHathaway: Add Stephanie Delbecque to the wmf group [puppet] - 10https://gerrit.wikimedia.org/r/866620 (https://phabricator.wikimedia.org/T324753) [17:09:42] (03PS3) 10Jbond: cli: handle a blank change number gracefully [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/866609 [17:10:06] 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf for SDelbecque - https://phabricator.wikimedia.org/T324753 (10jhathaway) 05Open→03Resolved a:03jhathaway done! [17:10:38] (03PS3) 10Jbond: puppet_compiler: bump version [puppet] - 10https://gerrit.wikimedia.org/r/866610 [17:13:10] (03CR) 10Jbond: [C: 03+2] cli: handle a blank change number gracefully [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/866609 (owner: 10Jbond) [17:13:17] (03CR) 10Jbond: [C: 03+2] puppet_compiler: bump version [puppet] - 10https://gerrit.wikimedia.org/r/866610 (owner: 10Jbond) [17:14:18] (03PS3) 10Hashar: gerrit: script to report on git gc durations [puppet] - 10https://gerrit.wikimedia.org/r/856601 (https://phabricator.wikimedia.org/T237807) [17:15:24] (03CR) 10Hashar: "I have updated the script shebang to point to /usr/bin/python3. I have tested the script on gerrt1001." [puppet] - 10https://gerrit.wikimedia.org/r/856601 (https://phabricator.wikimedia.org/T237807) (owner: 10Hashar) [17:19:16] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38673/console" [puppet] - 10https://gerrit.wikimedia.org/r/866602 (owner: 10Jbond) [17:19:58] (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38674/console" [puppet] - 10https://gerrit.wikimedia.org/r/866598 (https://phabricator.wikimedia.org/T324846) (owner: 10FNegri) [17:44:03] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mx2001.wikimedia.org with reason: Moar Disk [17:44:16] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx2001.wikimedia.org with reason: Moar Disk [17:59:37] !log jnuche@deploy1002 Installing scap version "4.30.2" for 563 hosts [18:01:55] (03CR) 10Andrew Bogott: rsyslog: allow specifying a hiera-defined certfile (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/865174 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott) [18:02:11] (03CR) 10Andrew Bogott: rsyslog: allow specifying a hiera-defined certfile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/865174 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott) [18:03:05] (03PS13) 10Andrew Bogott: rsyslog: allow specifying a hiera-defined certfile [puppet] - 10https://gerrit.wikimedia.org/r/865174 (https://phabricator.wikimedia.org/T127717) [18:03:07] (03PS7) 10Andrew Bogott: remote syslog: allow hiera config of rsyslog TLS CA [puppet] - 10https://gerrit.wikimedia.org/r/865184 (https://phabricator.wikimedia.org/T127717) [18:04:16] !log jnuche@deploy1002 Installing scap version "4.30.2" for 562 hosts [18:04:40] !log jnuche@deploy1002 Installation of scap version "4.30.2" completed for 562 hosts [18:06:06] (03CR) 10Andrew Bogott: [C: 03+1] rsyslog: add support for openssl netstream driver (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/865731 (https://phabricator.wikimedia.org/T324623) (owner: 10Southparkfan) [18:14:55] 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh) [18:16:45] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mx2001.wikimedia.org with reason: Moar Disk 2! [18:16:48] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx2001.wikimedia.org with reason: Moar Disk 2! [18:22:00] (JobUnavailable) firing: Reduced availability for job trafficserver-text in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:22:58] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99) [18:23:34] (03PS1) 10Andrew Bogott: Added some comments about where/how cloud hiera settings are applied [puppet] - 10https://gerrit.wikimedia.org/r/866625 [18:24:26] (03CR) 10Andrew Bogott: "Adding the two of you as reviewers to this extremely trivial patch because I never cease to be surprised at how hiera lookup works." [puppet] - 10https://gerrit.wikimedia.org/r/866625 (owner: 10Andrew Bogott) [18:27:56] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:32:44] (03PS1) 10Daniel Kinzler: hewiki: set VisualEditor to direct mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866627 (https://phabricator.wikimedia.org/T320529) [18:32:52] (03CR) 10CI reject: [V: 04-1] hewiki: set VisualEditor to direct mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866627 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler) [18:39:04] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:41:07] (03PS14) 10Andrew Bogott: rsyslog: allow specifying a hiera-defined certfile [puppet] - 10https://gerrit.wikimedia.org/r/865174 (https://phabricator.wikimedia.org/T127717) [18:41:09] (03PS8) 10Andrew Bogott: remote syslog: allow hiera config of rsyslog TLS CA [puppet] - 10https://gerrit.wikimedia.org/r/865184 (https://phabricator.wikimedia.org/T127717) [18:41:11] (03PS1) 10Andrew Bogott: Turn on central auth logging for all eqiad1 VMs [puppet] - 10https://gerrit.wikimedia.org/r/866628 (https://phabricator.wikimedia.org/T127717) [18:46:39] (03CR) 10Andrew Bogott: "https://puppet-compiler.wmflabs.org/output/866628/38675/tools-sgebastion-10.tools.eqiad1.wikimedia.cloud/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/866628 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott) [18:52:41] 10SRE, 10LDAP-Access-Requests: Grant Access to wmde for Muhammad Jaziraly - https://phabricator.wikimedia.org/T324477 (10KFrancis) @jhathaway The NDA has been signed. Please proceed with the access request. Thanks! [18:54:20] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10Cmjohnson) KS1002 was installed without an issue, I started over with KS1001 but the mgmt IP address changed and the provision script didn't wor... [19:09:32] 10SRE, 10LDAP-Access-Requests: Grant Access to wmde for Muhammad Jaziraly - https://phabricator.wikimedia.org/T324477 (10jhathaway) 05Open→03Resolved a:03jhathaway @Muhammad_Yasser_Jazirahly_WMDE groups added, enjoy! [19:13:24] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:13:38] (03PS11) 10Ottomata: flink and flink-kubernetes-operator image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) [19:14:14] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:15:12] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 5.756 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:15:34] 10SRE, 10Data-Engineering-Planning, 10WMF-Communications: LDAP access for Sondes to access Matomo - https://phabricator.wikimedia.org/T324696 (10EdErhart-WMF) [19:16:39] (03PS2) 10Ottomata: [WIP] - flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) [19:17:27] (03CR) 10CI reject: [V: 04-1] [WIP] - flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [19:17:37] (03PS10) 10Ottomata: flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) [19:19:42] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49121 bytes in 0.107 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:20:07] (03PS3) 10Ottomata: [WIP] - flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) [19:20:56] (03CR) 10CI reject: [V: 04-1] [WIP] - flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [19:23:03] 10SRE, 10Data-Engineering-Planning, 10WMF-Communications: LDAP access for Sondes to access Matomo - https://phabricator.wikimedia.org/T324696 (10EdErhart-WMF) Tagging @Marostegui and @jcrespo per their recent involvement with LDAP access requests [19:34:10] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mx1001.wikimedia.org with reason: Moar Disk! [19:34:23] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx1001.wikimedia.org with reason: Moar Disk! [19:35:54] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [19:39:06] (03CR) 10Herron: [C: 03+1] netmon: Remove the netmon1002 instance as passive node [puppet] - 10https://gerrit.wikimedia.org/r/866526 (https://phabricator.wikimedia.org/T322321) (owner: 10Andrea Denisse) [19:39:38] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [19:42:19] (03PS1) 10Eevans: Configure new cassandra-dev cluster [puppet] - 10https://gerrit.wikimedia.org/r/866640 (https://phabricator.wikimedia.org/T324113) [19:46:14] (03PS2) 10Eevans: Configure new cassandra-dev cluster [puppet] - 10https://gerrit.wikimedia.org/r/866640 (https://phabricator.wikimedia.org/T324113) [19:47:06] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [19:47:53] (03PS3) 10Eevans: Configure new cassandra-dev cluster [puppet] - 10https://gerrit.wikimedia.org/r/866640 (https://phabricator.wikimedia.org/T324113) [19:48:56] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [19:49:56] !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mx1001.wikimedia.org with reason: Moar Disk! [19:49:58] !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx1001.wikimedia.org with reason: Moar Disk! [19:50:29] (03CR) 10Herron: librenms: Increase the TTL for LibreNMS (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/866496 (https://phabricator.wikimedia.org/T322695) (owner: 10Andrea Denisse) [19:55:10] 10SRE, 10Infrastructure-Foundations, 10Mail: MX: increasing disk space - https://phabricator.wikimedia.org/T305567 (10jhathaway) 05Open→03Resolved bumped both MXes, `mx{1001,2001}.wikimedia.org` to 50G root partitions [20:05:22] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:06:26] (03PS4) 10Ottomata: flink-kubernetes-operator - Initial commit of upstream helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/865100 (https://phabricator.wikimedia.org/T324576) [20:06:28] (03PS11) 10Ottomata: flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) [20:06:31] (03PS4) 10Ottomata: [WIP] - flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) [20:07:23] (03CR) 10CI reject: [V: 04-1] [WIP] - flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [20:11:12] PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method [20:11:23] (ThanosQueryHttpRequestQueryRangeErrorRateHigh) firing: Thanos Query is failing to handle requests. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryHttpRequestQueryRangeErrorRateHigh [20:12:23] (ThanosQueryRangeLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryRangeLatencyHigh [20:12:48] (ThanosQueryInstantLatencyHigh) firing: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [20:13:02] RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST [20:15:11] (03PS1) 10Andrew Bogott: puppetmasters: cache cleanup [puppet] - 10https://gerrit.wikimedia.org/r/866644 [20:15:29] (03CR) 10CI reject: [V: 04-1] puppetmasters: cache cleanup [puppet] - 10https://gerrit.wikimedia.org/r/866644 (owner: 10Andrew Bogott) [20:16:23] (ThanosQueryHttpRequestQueryRangeErrorRateHigh) resolved: Thanos Query is failing to handle requests. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryHttpRequestQueryRangeErrorRateHigh [20:16:30] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:17:23] (ThanosQueryRangeLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryRangeLatencyHigh [20:17:48] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [20:19:23] (03PS2) 10Andrew Bogott: puppetmasters: cache cleanup [puppet] - 10https://gerrit.wikimedia.org/r/866644 [20:19:52] (03PS12) 10Ottomata: flink and flink-kubernetes-operator image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) [20:20:09] (03PS13) 10Ottomata: flink and flink-kubernetes-operator image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519) [20:21:08] 10SRE, 10Data-Engineering-Planning, 10WMF-Communications: LDAP access for Sondes to access Matomo - https://phabricator.wikimedia.org/T324696 (10jcrespo) Hi, @EdErhart-WMF . There is no need to tag anyone- SRE has a clinic duty procedure in which someone on rotation attends LDAP requests every week. I sugges... [20:26:21] (03PS3) 10Andrew Bogott: puppetmasters: cache cleanup [puppet] - 10https://gerrit.wikimedia.org/r/866644 [20:28:07] (03PS1) 10Eevans: keys & certs for (new) cassandra-dev cluster [labs/private] - 10https://gerrit.wikimedia.org/r/866646 (https://phabricator.wikimedia.org/T324113) [20:29:57] (03PS4) 10Andrew Bogott: puppetmasters: cache cleanup [puppet] - 10https://gerrit.wikimedia.org/r/866644 [20:32:58] (03CR) 10Andrew Bogott: "https://puppet-compiler.wmflabs.org/output/866644/38681/puppetmaster1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/866644 (owner: 10Andrew Bogott) [20:33:03] (03CR) 10Eevans: [V: 03+2 C: 03+2] keys & certs for (new) cassandra-dev cluster [labs/private] - 10https://gerrit.wikimedia.org/r/866646 (https://phabricator.wikimedia.org/T324113) (owner: 10Eevans) [20:33:11] (03PS2) 10Sbailey: enable migrate namespace function on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866518 (https://phabricator.wikimedia.org/T299612) [20:34:34] (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/866640 (https://phabricator.wikimedia.org/T324113) (owner: 10Eevans) [20:36:48] (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed [20:57:14] (03PS4) 10Eevans: Configure new cassandra-dev cluster [puppet] - 10https://gerrit.wikimedia.org/r/866640 (https://phabricator.wikimedia.org/T324113) [20:57:32] (03CR) 10CI reject: [V: 04-1] Configure new cassandra-dev cluster [puppet] - 10https://gerrit.wikimedia.org/r/866640 (https://phabricator.wikimedia.org/T324113) (owner: 10Eevans) [21:11:12] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:22:22] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:31:42] (03PS1) 10JHathaway: Add Kwaku Addo Ofori to ops & wmf [puppet] - 10https://gerrit.wikimedia.org/r/866649 [21:55:03] (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:56:45] (03PS1) 10Aqu: HDFS FSImage is backed up to HDFS on monday [puppet] - 10https://gerrit.wikimedia.org/r/866650 (https://phabricator.wikimedia.org/T324850) [21:57:02] (03CR) 10CI reject: [V: 04-1] HDFS FSImage is backed up to HDFS on monday [puppet] - 10https://gerrit.wikimedia.org/r/866650 (https://phabricator.wikimedia.org/T324850) (owner: 10Aqu) [21:58:26] (03PS2) 10Aqu: HDFS FSImage is backed up to HDFS on monday [puppet] - 10https://gerrit.wikimedia.org/r/866650 (https://phabricator.wikimedia.org/T324850) [22:00:03] (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [22:15:10] (03PS2) 10Arlolra: Disable wgParserEnableLegacyMediaDOM on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865214 (https://phabricator.wikimedia.org/T314318) [22:15:12] (03PS1) 10Arlolra: Disable wgParserEnableLegacyMediaDOM on cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866653 (https://phabricator.wikimedia.org/T297984) [22:18:44] (03CR) 10Subramanya Sastry: [C: 03+1] Disable wgParserEnableLegacyMediaDOM on cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866653 (https://phabricator.wikimedia.org/T297984) (owner: 10Arlolra) [22:22:00] (JobUnavailable) firing: Reduced availability for job trafficserver-text in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:31:28] (03CR) 10Subramanya Sastry: [C: 03+2] enable migrate namespace function on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866518 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey) [22:32:13] (03Merged) 10jenkins-bot: enable migrate namespace function on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866518 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey) [22:47:58] (03PS5) 10Ottomata: flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) [22:48:44] (03CR) 10CI reject: [V: 04-1] flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [23:03:34] 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops, 10User-Eevans: Relocate hosts: aqs10[3-5] - https://phabricator.wikimedia.org/T307035 (10Eevans) >>! In T307035#8456972, @Jclark-ctr wrote: > @Eevans this could possibly happen next week is there a day that works best for you? I am working on another project n... [23:08:02] (03PS3) 10Herron: slo_dashboards: dynamic slo dashboard panels [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/861947 (https://phabricator.wikimedia.org/T320749) [23:11:33] (03CR) 10Herron: "updated to include new wdqs slo and improve panel layout using rows per slo" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/861947 (https://phabricator.wikimedia.org/T320749) (owner: 10Herron) [23:17:03] (03PS2) 10Cwhite: hiera: map logstash.wm.o to kibana7.eqiad [puppet] - 10https://gerrit.wikimedia.org/r/828112 (https://phabricator.wikimedia.org/T304440) [23:24:36] (03CR) 10Cwhite: [C: 03+1] netmon: Remove the netmon1002 instance as passive node [puppet] - 10https://gerrit.wikimedia.org/r/866526 (https://phabricator.wikimedia.org/T322321) (owner: 10Andrea Denisse) [23:24:39] (03PS1) 10Krinkle: Add Largest Contentful Paint (LCP) [extensions/NavigationTiming] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/866480 (https://phabricator.wikimedia.org/T319329) [23:31:26] (03CR) 10Cwhite: rsyslog: allow specifying a hiera-defined certfile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/865174 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott) [23:32:03] (03CR) 10Cwhite: [C: 03+1] rsyslog: add support for openssl netstream driver (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/865731 (https://phabricator.wikimedia.org/T324623) (owner: 10Southparkfan) [23:37:42] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/866628 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott) [23:39:16] !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: search_eqiad elasticsearch and plugin upgrade - ryankemper@cumin1001 - T322776 [23:39:22] T322776: Deploy Ukrainian Analyzer Plugin - https://phabricator.wikimedia.org/T322776 [23:47:45] (03CR) 10Southparkfan: [C: 03+1] "After https://gerrit.wikimedia.org/r/c/operations/puppet/+/865174/, https://gerrit.wikimedia.org/r/c/operations/puppet/+/865731/ and https" [puppet] - 10https://gerrit.wikimedia.org/r/866628 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott) [23:59:53] !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)