[00:00:20] <icinga-wm>	 RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:05:14] <icinga-wm>	 PROBLEM - Check systemd state on maps1009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:05:38] <icinga-wm>	 PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[00:08:40] <wikibugs>	 (03CR) 10Cwhite: rsyslog: allow specifying a hiera-defined certfile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/865174 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott)
[00:13:12] <wikibugs>	 (03PS2) 10Andrea Denisse: librenms: Increase the TTL for LibreNMS [dns] - 10https://gerrit.wikimedia.org/r/866496 (https://phabricator.wikimedia.org/T322695)
[00:14:36] <wikibugs>	 (03CR) 10Andrea Denisse: librenms: Increase the TTL for LibreNMS (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/866496 (https://phabricator.wikimedia.org/T322695) (owner: 10Andrea Denisse)
[00:17:53] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "Change LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/865731 (https://phabricator.wikimedia.org/T324623) (owner: 10Southparkfan)
[00:18:27] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "Thanks!" [dns] - 10https://gerrit.wikimedia.org/r/866496 (https://phabricator.wikimedia.org/T322695) (owner: 10Andrea Denisse)
[00:30:17] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.decommission for hosts restbase-dev2002.codfw.wmnet
[00:30:56] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/865184 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott)
[00:32:45] <wikibugs>	 (03CR) 10Zabe: librenms: Increase the TTL for LibreNMS (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/866496 (https://phabricator.wikimedia.org/T322695) (owner: 10Andrea Denisse)
[00:34:52] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.dns.netbox
[00:36:48] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[00:37:02] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: restbase-dev2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1001"
[00:38:29] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: restbase-dev2002.codfw.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1001"
[00:38:29] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[00:38:29] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts restbase-dev2002.codfw.wmnet
[00:39:06] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.decommission for hosts restbase-dev2003.codfw.wmnet
[00:43:32] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.dns.netbox
[00:45:32] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: restbase-dev2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1001"
[00:46:54] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: restbase-dev2003.codfw.wmnet decommissioned, removing all IPs except the asset tag one - eevans@cumin1001"
[00:46:54] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[00:46:55] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts restbase-dev2003.codfw.wmnet
[00:54:18] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[00:56:10] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[01:01:50] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.dns.netbox
[01:04:13] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Rename restbase-dev200x hosts to cassandra-dev200x - eevans@cumin1001"
[01:05:17] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Rename restbase-dev200x hosts to cassandra-dev200x - eevans@cumin1001"
[01:05:17] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[01:05:47] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cassandra-dev2002
[01:06:22] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cassandra-dev2002
[01:06:27] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.network.configure-switch-interfaces for host cassandra-dev2003
[01:07:02] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.network.configure-switch-interfaces (exit_code=0) for host cassandra-dev2003
[01:11:30] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host cassandra-dev2002.codfw.wmnet with OS buster
[01:30:33] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cassandra-dev2002.codfw.wmnet with reason: host reimage
[01:33:37] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cassandra-dev2002.codfw.wmnet with reason: host reimage
[01:41:46] <jinxer-wm>	 (JobUnavailable) firing: (9) Reduced availability for job nginx in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:47:31] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - eevans@cumin1001"
[01:48:46] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - eevans@cumin1001"
[01:48:47] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cassandra-dev2002.codfw.wmnet with OS buster
[01:49:37] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.reimage for host cassandra-dev2003.codfw.wmnet with OS buster
[01:51:46] <jinxer-wm>	 (JobUnavailable) firing: (10) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[01:56:46] <jinxer-wm>	 (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:04:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:05:40] <legoktm>	 I fired the klaxon for T324801
[02:05:42] <stashbot>	 T324801: REST API serving content of current revision for old revisions - https://phabricator.wikimedia.org/T324801
[02:06:46] <jinxer-wm>	 (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:08:20] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on cassandra-dev2003.codfw.wmnet with reason: host reimage
[02:08:56] <cwhite>	 legoktm: hello
[02:09:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:09:13] <legoktm>	 hi! long-time no outage ;)
[02:09:30] <legoktm>	 so, restbase is serving the current revision as old revisions
[02:09:57] <legoktm>	 I don't know if it's a MW change or restbase, TheresNoTime is poking at it, and I was talking to Arlo and Subbu out of band
[02:10:15] <legoktm>	 noting that there were some MW core rest API changes this week: https://github.com/wikimedia/mediawiki/commits/master/includes/Rest
[02:10:48] <TheresNoTime>	 my money is starting to be on https://gerrit.wikimedia.org/r/c/mediawiki/core/+/864138 fwiw
[02:10:56] <legoktm>	 I'm out of the loop on who should be responding to this / what the remediation should be
[02:11:17] <legoktm>	 TheresNoTime: that's not deployed AFAIS?
[02:11:23] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on cassandra-dev2003.codfw.wmnet with reason: host reimage
[02:11:25] <legoktm>	 or are you saying the lack of it is the issue?
[02:11:56] <TheresNoTime>	 nope, ignore me
[02:12:53] <legoktm>	 ok, cscott is also suggesting a train rollback: https://phabricator.wikimedia.org/T324801#8455865
[02:13:22] * cwhite looks for rollback instructions
[02:15:38] <legoktm>	 cwhite: https://wikitech.wikimedia.org/wiki/Heterogeneous_deployment/Train_deploys#Rollback
[02:17:24] <cwhite>	 ok, here we go
[02:20:43] <legoktm>	 I'm peeking at deploy1002, neat, had missed that the container build process is integrated with scap now
[02:21:29] <cwhite>	 hehe yeah that's the wait.
[02:21:46] <jinxer-wm>	 (JobUnavailable) firing: (11) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:22:53] <cwhite>	 I'm not sure "To rollback a wikiversion change, it should be pretty quick." is true any more.  This scap triggers my "that command is taking too long" sense.
[02:23:06] <subbu>	 o/
[02:23:35] <legoktm>	 I wonder if it has to rebuild the l10n cache since we're adding an old MW version back in
[02:23:43] <logmsgbot>	 !log eevans@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - eevans@cumin1001"
[02:27:04] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - eevans@cumin1001"
[02:27:05] <logmsgbot>	 !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cassandra-dev2003.codfw.wmnet with OS buster
[02:28:29] <legoktm>	 well, it's at the image push step
[02:29:14] <cwhite>	 on docker_pull_k8s now
[02:29:45] <legoktm>	 my theory about rebuilding the l10n cache is probably wrong, this image is roughly the same size as the older ones
[02:30:03] <jinxer-wm>	 (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:30:06] <cwhite>	 note to self: ask someone if 10 minute `scap sync-wikiversions` is normal
[02:31:05] <legoktm>	 I will take notes as not-really-IC :p
[02:32:18] <cwhite>	 50% complete
[02:33:45] <TheresNoTime>	 cwhite: 10m is quite high for that stage.. 
[02:34:21] <subbu>	 Can you hold off for a bit before you actually roll back the train?
[02:34:31] <subbu>	 We are discussing if there is something on the train that cannot be rolled back.
[02:34:36] <cwhite>	 subbu: rollback is already in flight
[02:34:40] <subbu>	 i see.
[02:35:03] <legoktm>	 :v
[02:35:03] <jinxer-wm>	 (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[02:37:23] <TheresNoTime>	 well I've broken my local restbase so (:
[02:39:20] <logmsgbot>	 !log cwhite@deploy1002 rebuilt and synchronized wikiversions files: Revert "group2 wikis to 1.40.0-wmf.13"
[02:39:29] <subbu>	 if VE editing breaks with rollback, we'll have to roll forward the train again (and fix the REST API issue or roll back a specific patch causing the issue).  
[02:39:49] <cwhite>	 rollback complete
[02:40:37] <subbu>	 ok .. will test now.
[02:40:54] <urandom>	 looks like the right revision is being served now
[02:41:31] <TheresNoTime>	 (on group 2 wikis)
[02:41:55] <subbu>	 seems like editing is not broken.
[02:42:21] <subbu>	 but, will test a bit more.
[02:43:16] <cwhite>	 hmm, I can't git push origin from deploy1002
[02:46:08] <wikibugs>	 (03PS1) 10Cwhite: Revert "group2 wikis to 1.40.0-wmf.13" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866525 (https://phabricator.wikimedia.org/T324801)
[02:46:10] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] Revert "group2 wikis to 1.40.0-wmf.13" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866525 (https://phabricator.wikimedia.org/T324801) (owner: 10Cwhite)
[02:46:17] <cwhite>	 there we go
[02:46:46] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "group2 wikis to 1.40.0-wmf.13" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866525 (https://phabricator.wikimedia.org/T324801) (owner: 10Cwhite)
[02:47:19] <subbu>	 oh, so, editing seems to be working fine after the rollback on enwiki.
[02:48:11] * cwhite resolves klaxon page
[02:48:32] <subbu>	 TheresNoTime, editing the not-current revision in VE is not a common use case .. so, I think we can probably wait to fix this and roll forward again tomorrow.
[02:48:51] <subbu>	 Might need daniel to look tomorrow .. but I'll start poking around at the deployed patches in core.
[02:49:32] <cwhite>	 Ping me here or via klaxon if you need me.  I'm going to step away and grab a bite but will stay nearby.
[02:50:11] <subbu>	 thanks! Ya, I should go eat my dinner as well ... was at a restaurant and had just ordered food ... got it packed up and came home. :) 
[02:50:53] <subbu>	 but, i'll hang around in case anyone reports anything else. thanks legoktm for stepping in as well.
[02:58:31] <legoktm>	 thanks cwhite!
[03:00:17] <cwhite>	 Thank you, legoktm!  Good to see you and I hope you're doing well :)
[03:00:26] <legoktm>	 :D
[03:25:18] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_editattemptstep_hourly.service,eventlogging_to_druid_navigationtiming_hourly.service,eventlogging_to_druid_prefupdate_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:27:10] <wikibugs>	 (03PS1) 10Andrea Denisse: netmon: Remove the netmon1002 instance as passive node [puppet] - 10https://gerrit.wikimedia.org/r/866526 (https://phabricator.wikimedia.org/T322321)
[03:28:31] <wikibugs>	 (03CR) 10Andrea Denisse: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38658/console" [puppet] - 10https://gerrit.wikimedia.org/r/866526 (https://phabricator.wikimedia.org/T322321) (owner: 10Andrea Denisse)
[03:30:11] <wikibugs>	 (03CR) 10Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/output/866526/" [puppet] - 10https://gerrit.wikimedia.org/r/866526 (https://phabricator.wikimedia.org/T322321) (owner: 10Andrea Denisse)
[03:46:14] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 106 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[03:46:36] <icinga-wm>	 PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[03:48:06] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops
[03:51:39] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: search_eqiad elasticsearch and plugin upgrade - ryankemper@cumin1001 - T322776
[03:51:46] <stashbot>	 T322776: Deploy Ukrainian Analyzer Plugin - https://phabricator.wikimedia.org/T322776
[03:52:21] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.hosts.downtime for 3:00:00 on 50 hosts with reason: Rolling restart in progress
[03:52:54] <logmsgbot>	 !log ryankemper@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3:00:00 on 50 hosts with reason: Rolling restart in progress
[04:00:26] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:09:00] <logmsgbot>	 !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: search_eqiad elasticsearch and plugin upgrade - ryankemper@cumin1001 - T322776
[04:09:06] <stashbot>	 T322776: Deploy Ukrainian Analyzer Plugin - https://phabricator.wikimedia.org/T322776
[04:24:44] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_editattemptstep_hourly.service,eventlogging_to_druid_navigationtiming_hourly.service,eventlogging_to_druid_prefupdate_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[04:36:48] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[04:57:37] <subbu>	 I uploaded a fix for the UBN .. hopefully daniel can review early tomorrow and test and we can roll the train forward again.
[05:00:08] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:03:21] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: search_eqiad elasticsearch and plugin upgrade - ryankemper@cumin1001 - T322776
[05:03:28] <stashbot>	 T322776: Deploy Ukrainian Analyzer Plugin - https://phabricator.wikimedia.org/T322776
[05:10:08] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:10:51] <logmsgbot>	 !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: search_eqiad elasticsearch and plugin upgrade - ryankemper@cumin1001 - T322776
[05:10:58] <stashbot>	 T322776: Deploy Ukrainian Analyzer Plugin - https://phabricator.wikimedia.org/T322776
[05:13:33] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: search_eqiad elasticsearch and plugin upgrade - ryankemper@cumin1001 - T322776
[05:21:18] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:24:18] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_editattemptstep_hourly.service,eventlogging_to_druid_navigationtiming_hourly.service,eventlogging_to_druid_prefupdate_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:28:29] <logmsgbot>	 !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: search_eqiad elasticsearch and plugin upgrade - ryankemper@cumin1001 - T322776
[05:28:36] <stashbot>	 T322776: Deploy Ukrainian Analyzer Plugin - https://phabricator.wikimedia.org/T322776
[05:41:20] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[05:52:34] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:01:08] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:18:56] <wikibugs>	 (03PS1) 10Marostegui: db1206: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/866529
[06:20:05] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1206: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/866529 (owner: 10Marostegui)
[06:20:27] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 1%: Testing new RAID controller', diff saved to https://phabricator.wikimedia.org/P42655 and previous config saved to /var/cache/conftool/dbconfig/20221209-062027-root.json
[06:22:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job trafficserver-text in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[06:25:24] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_editattemptstep_hourly.service,eventlogging_to_druid_navigationtiming_hourly.service,eventlogging_to_druid_prefupdate_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:35:32] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 5%: Testing new RAID controller', diff saved to https://phabricator.wikimedia.org/P42656 and previous config saved to /var/cache/conftool/dbconfig/20221209-063532-root.json
[06:50:37] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 10%: Testing new RAID controller', diff saved to https://phabricator.wikimedia.org/P42657 and previous config saved to /var/cache/conftool/dbconfig/20221209-065037-root.json
[06:55:24] <marostegui>	 !log Deploy schema change on s6 T324797
[06:55:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:55:29] <stashbot>	 T324797: Add primary key and drop unique index on securepoll_msgs on wmf wikis - https://phabricator.wikimedia.org/T324797
[06:57:16] <marostegui>	 !log Deploy schema change on s8 T324797
[06:57:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:58:42] <marostegui>	 !log Deploy schema change on s7 T324797
[06:58:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:00:15] <marostegui>	 !log Deploy schema change on s4 T324797
[07:00:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:05:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 25%: Testing new RAID controller', diff saved to https://phabricator.wikimedia.org/P42658 and previous config saved to /var/cache/conftool/dbconfig/20221209-070542-root.json
[07:16:00] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:19:48] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[07:20:47] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 50%: Testing new RAID controller', diff saved to https://phabricator.wikimedia.org/P42659 and previous config saved to /var/cache/conftool/dbconfig/20221209-072047-root.json
[07:21:20] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[07:21:42] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10vm-requests: CODFW: 1 VM requested for test of reimaging cookbook - https://phabricator.wikimedia.org/T324744 (10SLyngshede-WMF) 05Open→03Resolved
[07:23:32] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_editattemptstep_hourly.service,eventlogging_to_druid_navigationtiming_hourly.service,eventlogging_to_druid_prefupdate_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:28:53] <marostegui>	 !log Deploy schema change on s2 T324797
[07:28:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:28:59] <stashbot>	 T324797: Add primary key and drop unique index on securepoll_msgs on wmf wikis - https://phabricator.wikimedia.org/T324797
[07:29:21] <marostegui>	 !log dbmaint schema change on s2 T324797
[07:29:22] <marostegui>	 !log dbmaint schema change on s4 T324797
[07:29:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:29:24] <marostegui>	 !log dbmaint schema change on s7 T324797
[07:29:26] <marostegui>	 !log dbmaint schema change on s8 T324797
[07:29:26] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:29:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:29:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:29:33] <marostegui>	 !log dbmaint schema change on s6 T324797
[07:29:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:31:26] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[07:35:02] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[07:35:52] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 75%: Testing new RAID controller', diff saved to https://phabricator.wikimedia.org/P42660 and previous config saved to /var/cache/conftool/dbconfig/20221209-073552-root.json
[07:36:01] <marostegui>	 !log dbmaint schema change on s1 T324797
[07:36:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:36:05] <stashbot>	 T324797: Add primary key and drop unique index on securepoll_msgs on wmf wikis - https://phabricator.wikimedia.org/T324797
[07:36:51] <marostegui>	 !log dbmaint schema change on s5 T324797
[07:36:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:40:34] <wikibugs>	 (03PS4) 10Slyngshede: WIP: Signup and LDAP flow. [software/bitu] - 10https://gerrit.wikimedia.org/r/860021
[07:45:58] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[07:50:57] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1206 (re)pooling @ 100%: Testing new RAID controller', diff saved to https://phabricator.wikimedia.org/P42661 and previous config saved to /var/cache/conftool/dbconfig/20221209-075057-root.json
[07:56:46] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:00:04] <jouncebot>	 Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20221209T0800)
[08:00:50] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:02:16] <marostegui>	 !log dbmaint schema change on s3 T324797
[08:02:21] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:02:22] <stashbot>	 T324797: Add primary key and drop unique index on securepoll_msgs on wmf wikis - https://phabricator.wikimedia.org/T324797
[08:05:40] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[08:11:04] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[08:16:32] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[08:24:24] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: eventlogging_to_druid_editattemptstep_hourly.service,eventlogging_to_druid_navigationtiming_hourly.service,eventlogging_to_druid_prefupdate_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[08:25:34] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[08:31:24] <wikibugs>	 (03CR) 10Jgiannelos: [C: 03+1] enable migrate namespace function on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866518 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey)
[08:35:49] <marostegui>	 !log dbmaint schema change on s3@eqiad T324797
[08:35:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:35:55] <stashbot>	 T324797: Add primary key and drop unique index on securepoll_msgs on wmf wikis - https://phabricator.wikimedia.org/T324797
[08:36:48] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[08:38:46] <marostegui>	 !log dbmaint schema change on s1@eqiad T324797
[08:38:48] <marostegui>	 !log dbmaint schema change on s2@eqiad T324797
[08:38:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:38:49] <marostegui>	 !log dbmaint schema change on s4@eqiad T324797
[08:38:51] <marostegui>	 !log dbmaint schema change on s5@eqiad T324797
[08:38:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:38:53] <marostegui>	 !log dbmaint schema change on s6@eqiad T324797
[08:38:54] <marostegui>	 !log dbmaint schema change on s7@eqiad T324797
[08:38:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:38:56] <marostegui>	 !log dbmaint schema change on s8@eqiad T324797
[08:38:57] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:39:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:39:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:39:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:40:45] <wikibugs>	 (03CR) 10Hashar: Replace CI results table by Gerrit Check API (031 comment) [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/859083 (https://phabricator.wikimedia.org/T214068) (owner: 10Hashar)
[08:58:42] <wikibugs>	 (03PS1) 10David Caro: puppetdb: restart through systemd if service dies [puppet] - 10https://gerrit.wikimedia.org/r/866552 (https://phabricator.wikimedia.org/T324812)
[08:59:16] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppetdb: restart through systemd if service dies [puppet] - 10https://gerrit.wikimedia.org/r/866552 (https://phabricator.wikimedia.org/T324812) (owner: 10David Caro)
[09:00:32] <wikibugs>	 (03PS2) 10David Caro: puppetdb: restart through systemd if service dies [puppet] - 10https://gerrit.wikimedia.org/r/866552 (https://phabricator.wikimedia.org/T324812)
[09:00:36] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:04:27] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38660/console" [puppet] - 10https://gerrit.wikimedia.org/r/866552 (https://phabricator.wikimedia.org/T324812) (owner: 10David Caro)
[09:07:08] <wikibugs>	 10SRE, 10Scap: Wrong umask when deploying from screen - https://phabricator.wikimedia.org/T200690 (10Tgr) >>! In T200690#8265133, @dancy wrote: > @Tgr Can you confirm that this is still a problem?  Probably not because these days you'd use `scap backport` which runs git commants as the deploy user, there isn't...
[09:10:15] <wikibugs>	 (03PS1) 10Muehlenhoff: Make ganeti5006 a ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/866553 (https://phabricator.wikimedia.org/T324610)
[09:10:24] <wikibugs>	 (03PS3) 10David Caro: puppetdb: restart through systemd if service dies [puppet] - 10https://gerrit.wikimedia.org/r/866552 (https://phabricator.wikimedia.org/T324812)
[09:12:00] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[09:12:52] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[09:13:23] <wikibugs>	 (03PS4) 10David Caro: puppetdb: restart through systemd if service dies [puppet] - 10https://gerrit.wikimedia.org/r/866552 (https://phabricator.wikimedia.org/T324812)
[09:16:49] <wikibugs>	 (03PS5) 10David Caro: puppetdb: restart through systemd if service dies [puppet] - 10https://gerrit.wikimedia.org/r/866552 (https://phabricator.wikimedia.org/T324812)
[09:18:27] <wikibugs>	 (03CR) 10Hashar: Boilerplate for QUnit testing (032 comments) [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/861486 (owner: 10Hashar)
[09:19:03] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38663/console" [puppet] - 10https://gerrit.wikimedia.org/r/866552 (https://phabricator.wikimedia.org/T324812) (owner: 10David Caro)
[09:19:47] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "Now it's ready :), pcc looks ok" [puppet] - 10https://gerrit.wikimedia.org/r/866552 (https://phabricator.wikimedia.org/T324812) (owner: 10David Caro)
[09:20:24] <wikibugs>	 (03CR) 10Hashar: Replace CI results table by Gerrit Check API (031 comment) [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/859083 (https://phabricator.wikimedia.org/T214068) (owner: 10Hashar)
[09:21:25] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] Make ganeti5006 a ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/866553 (https://phabricator.wikimedia.org/T324610) (owner: 10Muehlenhoff)
[09:24:07] <wikibugs>	 (03PS17) 10Hashar: Replace CI results table by Gerrit Check API [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/859083 (https://phabricator.wikimedia.org/T214068)
[09:24:09] <wikibugs>	 (03PS7) 10Hashar: Add unit testing with QUnit [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/861486
[09:34:04] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2003.codfw.wmnet with OS bullseye
[09:34:09] <wikibugs>	 10SRE-swift-storage: Run Thanos backend on Bullseye - https://phabricator.wikimedia.org/T288937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host thanos-be2003.codfw.wmnet with OS bullseye
[09:45:39] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[09:46:41] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29
[09:47:21] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38664/console" [puppet] - 10https://gerrit.wikimedia.org/r/861463 (https://phabricator.wikimedia.org/T256762) (owner: 10BBlack)
[09:51:08] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2003.codfw.wmnet with reason: host reimage
[09:53:54] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2003.codfw.wmnet with reason: host reimage
[09:57:59] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] webservice cli: allow for deployment of custom harbor images (031 comment) [software/tools-webservice] - 10https://gerrit.wikimedia.org/r/857588 (https://phabricator.wikimedia.org/T293645) (owner: 10Raymond Ndibe)
[09:59:23] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1 C: 03+1] "Oops, did not see your PCC comment. LGTM, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/861463 (https://phabricator.wikimedia.org/T256762) (owner: 10BBlack)
[10:00:03] <hashar>	 I will look at moving the train forward at 13:00 UTC
[10:05:30] <claime>	 hashar: Will you be moving it forward at 1300 or just looking at it? :p 
[10:05:58] <claime>	 (I'll be there, all jokes aside)
[10:08:50] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] puppetdb: restart through systemd if service dies (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/866552 (https://phabricator.wikimedia.org/T324812) (owner: 10David Caro)
[10:09:10] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2003.codfw.wmnet with OS bullseye
[10:09:13] <wikibugs>	 10SRE-swift-storage: Run Thanos backend on Bullseye - https://phabricator.wikimedia.org/T288937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host thanos-be2003.codfw.wmnet with OS bullseye completed: - thanos-be2003 (**PASS**)   - Downtimed on Icinga/Alertmanager...
[10:12:28] <wikibugs>	 (03CR) 10Muehlenhoff: puppetdb: restart through systemd if service dies (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/866552 (https://phabricator.wikimedia.org/T324812) (owner: 10David Caro)
[10:14:57] <wikibugs>	 (03PS1) 10Ladsgroup: Followup to 5cb38845: Don't drop revid info [core] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/866472 (https://phabricator.wikimedia.org/T324801)
[10:15:06] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti5006.eqsin.wmnet
[10:15:07] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] puppetdb: restart through systemd if service dies (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/866552 (https://phabricator.wikimedia.org/T324812) (owner: 10David Caro)
[10:16:33] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] maps: Use new swift container for eqiad pregeneration [puppet] - 10https://gerrit.wikimedia.org/r/866442 (https://phabricator.wikimedia.org/T314472) (owner: 10Jgiannelos)
[10:16:42] <wikibugs>	 (03PS6) 10David Caro: puppetdb: restart through systemd if service dies [puppet] - 10https://gerrit.wikimedia.org/r/866552 (https://phabricator.wikimedia.org/T324812)
[10:17:08] <wikibugs>	 (03PS7) 10David Caro: puppetdb: restart through systemd if service dies [puppet] - 10https://gerrit.wikimedia.org/r/866552 (https://phabricator.wikimedia.org/T324812)
[10:17:19] <wikibugs>	 (03CR) 10David Caro: puppetdb: restart through systemd if service dies (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/866552 (https://phabricator.wikimedia.org/T324812) (owner: 10David Caro)
[10:22:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job trafficserver-text in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[10:22:13] <wikibugs>	 (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38665/console" [puppet] - 10https://gerrit.wikimedia.org/r/866552 (https://phabricator.wikimedia.org/T324812) (owner: 10David Caro)
[10:25:15] <logmsgbot>	 !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti5006.eqsin.wmnet
[10:27:10] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] Followup to 5cb38845: Don't drop revid info [core] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/866472 (https://phabricator.wikimedia.org/T324801) (owner: 10Ladsgroup)
[10:30:01] <wikibugs>	 (03CR) 10JMeybohm: flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[10:34:38] <logmsgbot>	 !log jmm@cumin2002 START - Cookbook sre.ganeti.addnode for new host ganeti5006.eqsin.wmnet to cluster eqsin and group 1
[10:36:23] <logmsgbot>	 !log jmm@cumin2002 END (FAIL) - Cookbook sre.ganeti.addnode (exit_code=99) for new host ganeti5006.eqsin.wmnet to cluster eqsin and group 1
[10:37:01] <wikibugs>	 10SRE-swift-storage: Run Thanos backend on Bullseye - https://phabricator.wikimedia.org/T288937 (10MatthewVernon)
[10:41:23] <wikibugs>	 (03Merged) 10jenkins-bot: Followup to 5cb38845: Don't drop revid info [core] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/866472 (https://phabricator.wikimedia.org/T324801) (owner: 10Ladsgroup)
[10:48:56] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [core] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/866472 (https://phabricator.wikimedia.org/T324801) (owner: 10Ladsgroup)
[10:49:22] <logmsgbot>	 !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:866472|Followup to 5cb38845: Don't drop revid info (T324801)]]
[10:49:28] <stashbot>	 T324801: REST API serving content of current revision for old revisions - https://phabricator.wikimedia.org/T324801
[10:51:16] <logmsgbot>	 !log ladsgroup@deploy1002 ladsgroup and ladsgroup: Backport for [[gerrit:866472|Followup to 5cb38845: Don't drop revid info (T324801)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet
[11:00:48] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be2004.codfw.wmnet with OS bullseye
[11:00:52] <wikibugs>	 10SRE-swift-storage: Run Thanos backend on Bullseye - https://phabricator.wikimedia.org/T288937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host thanos-be2004.codfw.wmnet with OS bullseye
[11:01:25] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.dns.netbox
[11:02:21] <logmsgbot>	 !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:866472|Followup to 5cb38845: Don't drop revid info (T324801)]] (duration: 12m 59s)
[11:02:27] <stashbot>	 T324801: REST API serving content of current revision for old revisions - https://phabricator.wikimedia.org/T324801
[11:03:55] <logmsgbot>	 !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@17b9319] (codfw): codfw: Enable mirroring for 25% of the traffic
[11:06:27] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "thanks; lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/866552 (https://phabricator.wikimedia.org/T324812) (owner: 10David Caro)
[11:06:27] <logmsgbot>	 !log cgoubert@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Decomissioning netmon2001 - cgoubert@cumin1001"
[11:09:04] <logmsgbot>	 !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@17b9319] (codfw): codfw: Enable mirroring for 25% of the traffic (duration: 05m 08s)
[11:10:33] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Decomissioning netmon2001 - cgoubert@cumin1001"
[11:10:33] <logmsgbot>	 !log cgoubert@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[11:16:45] <icinga-wm>	 RECOVERY - Check systemd state on mw1358 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:17:46] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be2004.codfw.wmnet with reason: host reimage
[11:18:55] <icinga-wm>	 RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
[11:20:09] <wikibugs>	 (03PS1) 10AikoChou: ml-services: update revertrisk docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/866569 (https://phabricator.wikimedia.org/T323023)
[11:20:30] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be2004.codfw.wmnet with reason: host reimage
[11:29:04] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good (we'll also need to add python-django-rq to the Debian deps separately)" [software/bitu] - 10https://gerrit.wikimedia.org/r/853290 (owner: 10Slyngshede)
[11:35:52] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be2004.codfw.wmnet with OS bullseye
[11:35:58] <wikibugs>	 10SRE-swift-storage: Run Thanos backend on Bullseye - https://phabricator.wikimedia.org/T288937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host thanos-be2004.codfw.wmnet with OS bullseye completed: - thanos-be2004 (**PASS**)   - Downtimed on Icinga/Alertmanager...
[11:40:33] <logmsgbot>	 !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@d1bd7dc] (codfw): Enable geopoints on production
[11:41:33] <logmsgbot>	 !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@d1bd7dc] (codfw): Enable geopoints on production (duration: 01m 00s)
[11:44:31] <wikibugs>	 (03CR) 10Muehlenhoff: "Looks good, some comments inline" [software/bitu] - 10https://gerrit.wikimedia.org/r/850465 (https://phabricator.wikimedia.org/T319410) (owner: 10Slyngshede)
[11:46:44] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] remote syslog: allow hiera config of rsyslog TLS CA [puppet] - 10https://gerrit.wikimedia.org/r/865184 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott)
[11:54:30] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "LGTM, one nit inline" [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/865069 (owner: 10Slyngshede)
[11:55:48] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/864729 (owner: 10Volans)
[11:57:35] <wikibugs>	 (03PS1) 10Muehlenhoff: Make ganeti5007 a Ganeti node [puppet] - 10https://gerrit.wikimedia.org/r/866572 (https://phabricator.wikimedia.org/T324610)
[11:58:27] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/tegola-vector-tiles: sync
[11:58:44] <logmsgbot>	 !log jgiannelos@deploy1002 helmfile [eqiad] DONE helmfile.d/services/tegola-vector-tiles: sync
[12:00:03] <icinga-wm>	 RECOVERY - Check systemd state on maps2009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:02:48] <wikibugs>	 10SRE, 10ops-eqsin, 10Infrastructure-Foundations, 10Patch-For-Review: ganeti500[567] implementation tracking - https://phabricator.wikimedia.org/T324610 (10MoritzMuehlenhoff)
[12:02:49] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/865728 (owner: 10Volans)
[12:05:31] <icinga-wm>	 PROBLEM - Check systemd state on maps2009 is CRITICAL: CRITICAL - degraded: The following units failed: planet_sync_tile_generation-gis.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[12:08:44] <wikibugs>	 (03PS7) 10Slyngshede: Bitu IDM, initial checkin [software/bitu] - 10https://gerrit.wikimedia.org/r/850465 (https://phabricator.wikimedia.org/T319410)
[12:09:04] <wikibugs>	 (03CR) 10Slyngshede: Bitu IDM, initial checkin (036 comments) [software/bitu] - 10https://gerrit.wikimedia.org/r/850465 (https://phabricator.wikimedia.org/T319410) (owner: 10Slyngshede)
[12:12:22] <logmsgbot>	 !log jgiannelos@deploy1002 Started deploy [kartotherian/deploy@6b70e03] (codfw): Reduce mirrored traffic to 5%
[12:13:05] <wikibugs>	 (03PS1) 10Reedy: CommonSettings.php: Mark REL1_39 as Default Snapshot [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866574 (https://phabricator.wikimedia.org/T324808)
[12:13:07] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] profile::cumin: use bool2str to simplify code (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/865728 (owner: 10Volans)
[12:13:09] <wikibugs>	 (03PS3) 10Slyngshede: Add RQ support to Django [software/bitu] - 10https://gerrit.wikimedia.org/r/853290
[12:14:01] <logmsgbot>	 !log jgiannelos@deploy1002 Finished deploy [kartotherian/deploy@6b70e03] (codfw): Reduce mirrored traffic to 5% (duration: 01m 39s)
[12:14:09] <wikibugs>	 (03CR) 10Muehlenhoff: cumin: add an audit report for insetup servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/864729 (owner: 10Volans)
[12:15:23] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[12:15:30] <wikibugs>	 (03PS2) 10Slyngshede: Version bump. Go to version 0.0.2. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/865069
[12:15:46] <wikibugs>	 (03CR) 10Slyngshede: Version bump. Go to version 0.0.2. (031 comment) [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/865069 (owner: 10Slyngshede)
[12:16:57] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Ship it" [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/865069 (owner: 10Slyngshede)
[12:17:11] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[12:19:50] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [puppet] - 10https://gerrit.wikimedia.org/r/760619 (https://phabricator.wikimedia.org/T298246) (owner: 10Hnowlan)
[12:20:49] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[12:21:00] <wikibugs>	 (03PS8) 10Slyngshede: Bitu IDM, initial checkin [software/bitu] - 10https://gerrit.wikimedia.org/r/850465 (https://phabricator.wikimedia.org/T319410)
[12:24:27] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[12:28:05] <wikibugs>	 (03PS12) 10Cathal Mooney: Example strategy for marking DSCP with ferm and puppet integration [puppet] - 10https://gerrit.wikimedia.org/r/865108 (https://phabricator.wikimedia.org/T316358)
[12:28:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Example strategy for marking DSCP with ferm and puppet integration [puppet] - 10https://gerrit.wikimedia.org/r/865108 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney)
[12:29:24] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] rsyslog: add support for openssl netstream driver [puppet] - 10https://gerrit.wikimedia.org/r/865731 (https://phabricator.wikimedia.org/T324623) (owner: 10Southparkfan)
[12:31:20] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] Version bump. Go to version 0.0.2. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/865069 (owner: 10Slyngshede)
[12:32:33] <wikibugs>	 (03Merged) 10jenkins-bot: Version bump. Go to version 0.0.2. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/865069 (owner: 10Slyngshede)
[12:32:56] <wikibugs>	 (03PS13) 10Cathal Mooney: Example strategy for marking DSCP with ferm and puppet integration [puppet] - 10https://gerrit.wikimedia.org/r/865108 (https://phabricator.wikimedia.org/T316358)
[12:33:59] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM, much simpler then expected 😊" [puppet] - 10https://gerrit.wikimedia.org/r/865075 (owner: 10JMeybohm)
[12:35:46] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Example strategy for marking DSCP with ferm and puppet integration [puppet] - 10https://gerrit.wikimedia.org/r/865108 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney)
[12:36:10] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM, i think we still need to think about how we fix the monitoring problem but we can tackle that later" [puppet] - 10https://gerrit.wikimedia.org/r/865591 (owner: 10JMeybohm)
[12:36:17] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.reimage for host thanos-be1001.eqiad.wmnet with OS bullseye
[12:36:21] <wikibugs>	 10SRE-swift-storage: Run Thanos backend on Bullseye - https://phabricator.wikimedia.org/T288937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mvernon@cumin2002 for host thanos-be1001.eqiad.wmnet with OS bullseye
[12:36:43] <wikibugs>	 10SRE-swift-storage: Run Thanos backend on Bullseye - https://phabricator.wikimedia.org/T288937 (10MatthewVernon)
[12:36:48] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[12:36:52] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] pki: Add intermediates for wikikube and wikikube staging (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/865591 (owner: 10JMeybohm)
[12:38:26] <wikibugs>	 (03PS14) 10Cathal Mooney: Example strategy for marking DSCP with ferm and puppet integration [puppet] - 10https://gerrit.wikimedia.org/r/865108 (https://phabricator.wikimedia.org/T316358)
[12:39:50] <wikibugs>	 (03PS1) 10Slyngshede: deb: align package naming. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/866582
[12:47:39] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 04-1] "Marking -1, this is not intended to be merged in current version, just some examples." [puppet] - 10https://gerrit.wikimedia.org/r/865108 (https://phabricator.wikimedia.org/T316358) (owner: 10Cathal Mooney)
[12:50:05] <wikibugs>	 (03CR) 10Muehlenhoff: deb: align package naming. (031 comment) [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/866582 (owner: 10Slyngshede)
[12:58:06] <hashar>	 good afternoon
[13:02:54] <wikibugs>	 (03PS4) 10Matthias Mullie: Add mediawiki.searchpreview schema [mediawiki-config] - 10https://gerrit.wikimedia.org/r/845518 (https://phabricator.wikimedia.org/T321069)
[13:04:41] <wikibugs>	 (03PS1) 10TrainBranchBot: all wikis to 1.40.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866586 (https://phabricator.wikimedia.org/T320518)
[13:04:43] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] all wikis to 1.40.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866586 (https://phabricator.wikimedia.org/T320518) (owner: 10TrainBranchBot)
[13:04:46] <hashar>	 running da train
[13:04:50] <claime>	 Hey hashar 
[13:04:53] <claime>	 choo choo
[13:05:24] <wikibugs>	 (03Merged) 10jenkins-bot: all wikis to 1.40.0-wmf.13 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866586 (https://phabricator.wikimedia.org/T320518) (owner: 10TrainBranchBot)
[13:05:41] <wikibugs>	 (03PS1) 10Effie Mouzeli: mediawiki::mcrouter_wancache: Add mc2053 to memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/866587 (https://phabricator.wikimedia.org/T293012)
[13:05:47] <hashar>	 the thing is the broken case reported on T324801 is already fixed
[13:05:49] <stashbot>	 T324801: REST API serving content of current revision for old revisions - https://phabricator.wikimedia.org/T324801
[13:05:54] <hashar>	  potentially by the backport Amir has done this morning to wmf.13
[13:06:02] <claime>	 yeah it was backported this morning iiuc
[13:06:02] <hashar>	 but wikipedia are still on wmf.12
[13:06:05] <hashar>	 so well I don't know
[13:06:59] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[13:07:19] <hashar>	 oh joy
[13:07:49] <hashar>	 the grafana link shows an empty graph
[13:08:10] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[13:08:15] <hashar>	 cause the `var-method` wasn't set
[13:08:22] <logmsgbot>	 !log mvernon@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on thanos-be1001.eqiad.wmnet with reason: host reimage
[13:08:28] <hashar>	 most probably something is cutting the messages when it is too long
[13:09:23] <hashar>	 the graph looks all fine, no idea why the alert has triggered
[13:09:43] <claime>	 It's been flapping the past few days but I haven't managed to figure out why
[13:11:16] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki::mcrouter_wancache: Add mc2053 to memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/866587 (https://phabricator.wikimedia.org/T293012) (owner: 10Effie Mouzeli)
[13:11:48] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on thanos-be1001.eqiad.wmnet with reason: host reimage
[13:13:16] <logmsgbot>	 !log hashar@deploy1002 rebuilt and synchronized wikiversions files: all wikis to 1.40.0-wmf.13  refs T320518
[13:13:22] <stashbot>	 T320518: 1.40.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T320518
[13:13:58] <hashar>	 et voilà!
[13:14:32] <claime>	 Now I get to see if my delay before alerting for opcache health works as intended :p
[13:15:50] <wikibugs>	 (03PS1) 10Marostegui: Revert "db1206: Enable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/866473
[13:16:00] <wikibugs>	 (03PS4) 10Slyngshede: Add RQ support to Django [software/bitu] - 10https://gerrit.wikimedia.org/r/853290
[13:18:18] <hashar>	 MediaWiki logs look quiet
[13:18:53] <wikibugs>	 (03PS2) 10Slyngshede: deb: align package naming. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/866582
[13:22:29] <effie>	 hashar: I recon you are deploying to k8s as well ?
[13:23:22] <claime>	 effie: Nope, except mw-debug iirc
[13:23:40] <effie>	 why is that ?
[13:24:33] <Amir1>	 I've caused a repl breakage in much of s3
[13:24:34] <Amir1>	 on it
[13:24:35] <claime>	 No I lied
[13:24:52] <claime>	 I thought we hadn´t flipped the switch for it, but we did
[13:25:11] <effie>	 I need to apply some changes and I don;t want to be in hashar's way :)
[13:25:30] <claime>	 It's all done now
[13:25:53] <claime>	 cgoubert@deploy1002:/srv/deployment-charts/helmfile.d/services/mw-web$ helmfile -e eqiad status 2> /dev/null | grep LAST
[13:25:54] <hashar>	 effie: yeah it is all open please do ;)
[13:25:55] <claime>	 LAST DEPLOYED: Fri Dec  9 13:09:22 2022
[13:25:55] <Amir1>	 claime: we might a get a page right now
[13:25:57] <claime>	 LAST DEPLOYED: Fri Dec  9 13:07:57 2022
[13:26:01] <claime>	 Amir1: ack
[13:26:05] <Amir1>	 I hope it finishes asap
[13:26:07] <claime>	 Do you need me for something
[13:26:09] <claime>	 ?
[13:26:15] <hashar>	 I think the k8s deployment is automagically handled by scap now
[13:26:21] <claime>	 hashar: yeah it is
[13:26:21] <Amir1>	 emotional support
[13:26:22] <hashar>	 at least it gave me a bunch of lines about executing helm
[13:26:27] <claime>	 Amir1: *hug*
[13:26:33] <Amir1>	 <3 
[13:26:39] <hashar>	 which I am more happy to ignore / not understand as long as those lines are green / OKish
[13:26:48] <claime>	 heh fair enough
[13:26:53] <Amir1>	 I got lucky I think
[13:27:26] <Amir1>	 it made a bit of s3 read-only for a bit due to excessive lag
[13:27:35] <Amir1>	 but for a minute only
[13:28:40] <logmsgbot>	 !log mvernon@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host thanos-be1001.eqiad.wmnet with OS bullseye
[13:28:43] <wikibugs>	 10SRE-swift-storage: Run Thanos backend on Bullseye - https://phabricator.wikimedia.org/T288937 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mvernon@cumin2002 for host thanos-be1001.eqiad.wmnet with OS bullseye completed: - thanos-be1001 (**PASS**)   - Downtimed on Icinga/Alertmanager...
[13:30:48] <claime>	 Amir1: The alarm is "MariaDB sustained replica lag on s3" right?
[13:31:03] <Amir1>	 there should be multiple 
[13:31:08] <claime>	 Yeah
[13:31:09] <Amir1>	 but yeah, that's one of them
[13:32:52] <claime>	 Yeah they seem to be going away without getting to the point of alerting
[13:33:52] <jynus>	 what was the cause?
[13:34:07] <jynus>	 (keeping an eye in case backups are needed)
[13:34:19] <claime>	 (thanks for looking out for us <3)
[13:35:01] <wikibugs>	 10SRE-swift-storage: Run Thanos backend on Bullseye - https://phabricator.wikimedia.org/T288937 (10MatthewVernon)
[13:35:18] <jynus>	 claime: please never doubt to ask for help, even if not sure yet if a recovery is needed
[13:35:42] <jynus>	 oh, you meant oncall, sorry
[13:35:56] <claime>	 jynus: No no I meant you, I'm on call :P
[13:36:12] <claime>	 And I won't hesitate, thanks :)
[13:36:58] <jynus>	 the thing is, it may take some time to proceed with a recovery (not everything is fully automated, and it may never will), so preparing the nuclear weapon can take some time, even before launch!
[13:37:45] <claime>	 *nods* understood
[13:37:48] <jynus>	 Amir1: a schema or grant change, maybe?
[13:38:22] <Amir1>	 jynus: nothing major, I was running a schema change on bewiki to fix flaggedrevs drift, it turned out it was drifitng in different hosts 
[13:38:39] <jynus>	 yeah, that happens on s3, that is why it was my guess
[13:38:45] <Amir1>	 so the schema change took longer than it should in some hosts choking the replication 
[13:38:49] <jynus>	 it shouldn't, but it does
[13:38:53] <Amir1>	 it wasn't all thankfully
[13:39:06] <jynus>	 ah, so it idn't broke? it was "just" lag
[13:39:11] <Amir1>	 yeah
[13:39:16] <jynus>	 so much better
[13:39:32] <Amir1>	 the schema change was idempotent 
[13:39:35] <jynus>	 lag spikes happen all the time
[13:39:37] <Amir1>	 noop 
[13:39:57] <jynus>	 yes, even on all hosts
[13:47:55] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] Revert "db1206: Enable notifications" [puppet] - 10https://gerrit.wikimedia.org/r/866473 (owner: 10Marostegui)
[13:48:06] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1206', diff saved to https://phabricator.wikimedia.org/P42662 and previous config saved to /var/cache/conftool/dbconfig/20221209-134806-marostegui.json
[13:56:20] <wikibugs>	 (03CR) 10JMeybohm: pki: Add intermediates for wikikube and wikikube staging (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/865591 (owner: 10JMeybohm)
[14:00:00] <wikibugs>	 (03CR) 10Slyngshede: deb: align package naming. (031 comment) [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/866582 (owner: 10Slyngshede)
[14:02:42] <wikibugs>	 (03CR) 10David Caro: [V: 03+1 C: 03+2] puppetdb: restart through systemd if service dies [puppet] - 10https://gerrit.wikimedia.org/r/866552 (https://phabricator.wikimedia.org/T324812) (owner: 10David Caro)
[14:02:52] <wikibugs>	 (03CR) 10David Caro: [V: 03+1 C: 03+2] "I'll merge on monday" [puppet] - 10https://gerrit.wikimedia.org/r/866552 (https://phabricator.wikimedia.org/T324812) (owner: 10David Caro)
[14:03:08] <wikibugs>	 (03CR) 10Muehlenhoff: deb: align package naming. (031 comment) [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/866582 (owner: 10Slyngshede)
[14:04:57] <wikibugs>	 (03PS3) 10Slyngshede: deb: align package naming. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/866582
[14:05:05] <wikibugs>	 (03CR) 10Slyngshede: deb: align package naming. (031 comment) [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/866582 (owner: 10Slyngshede)
[14:06:35] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+1] "Looks good!" [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/866582 (owner: 10Slyngshede)
[14:06:55] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] deb: align package naming. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/866582 (owner: 10Slyngshede)
[14:08:25] <wikibugs>	 (03Merged) 10jenkins-bot: deb: align package naming. [software/bitu-ldap] - 10https://gerrit.wikimedia.org/r/866582 (owner: 10Slyngshede)
[14:10:15] <wikibugs>	 (03CR) 10Ilias Sarantopoulos: [C: 03+1] "LGTM!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/866569 (https://phabricator.wikimedia.org/T323023) (owner: 10AikoChou)
[14:16:49] <wikibugs>	 (03CR) 10Klausman: [V: 03+2 C: 03+2] ml-services: update revertrisk docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/866569 (https://phabricator.wikimedia.org/T323023) (owner: 10AikoChou)
[14:17:16] <wikibugs>	 (03PS1) 10Jbond: blackbox::check::http: change expiry check value from days to seconds [puppet] - 10https://gerrit.wikimedia.org/r/866594
[14:18:53] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: PSU failure for restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T324572 (10Jclark-ctr) 05Open→03Resolved removed power supply and reseated   error has removed
[14:20:00] <wikibugs>	 (03PS1) 10Effie Mouzeli: mediawiki::mcrouter_wancache: Add mc2053 to memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/866595 (https://phabricator.wikimedia.org/T293012)
[14:20:12] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops: hw troubleshooting: PSU failure for restbase1018.eqiad.wmnet - https://phabricator.wikimedia.org/T324572 (10Clement_Goubert) Thanks so much!
[14:21:16] <wikibugs>	 (03PS2) 10Effie Mouzeli: mediawiki::mcrouter_wancache: Add mc2052 to memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/866595 (https://phabricator.wikimedia.org/T293012)
[14:21:57] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: update revertrisk docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/866569 (https://phabricator.wikimedia.org/T323023) (owner: 10AikoChou)
[14:22:00] <wikibugs>	 (03CR) 10Effie Mouzeli: [C: 03+2] mediawiki::mcrouter_wancache: Add mc2052 to memcached cluster [puppet] - 10https://gerrit.wikimedia.org/r/866595 (https://phabricator.wikimedia.org/T293012) (owner: 10Effie Mouzeli)
[14:22:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job trafficserver-text in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:24:16] <wikibugs>	 (03CR) 10Jbond: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/863006 (https://phabricator.wikimedia.org/T311999) (owner: 10Jbond)
[14:28:06] <wikibugs>	 (03PS1) 10Muehlenhoff: buster updates [puppet] - 10https://gerrit.wikimedia.org/r/866597
[14:29:11] <wikibugs>	 (03PS1) 10FNegri: Reinstate innodb_large_prefix on ToolsDB [puppet] - 10https://gerrit.wikimedia.org/r/866598 (https://phabricator.wikimedia.org/T324846)
[14:32:58] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning: hw troubleshooting: power supply for an-worker1184.eqiad.wmnet - https://phabricator.wikimedia.org/T324559 (10Jclark-ctr) 05Open→03Resolved @BTullis  Reseated power supply2 fault light cleared on rear of server
[14:35:37] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 03+2] buster updates [puppet] - 10https://gerrit.wikimedia.org/r/866597 (owner: 10Muehlenhoff)
[14:37:00] <wikibugs>	 (03PS1) 10AikoChou: ml-services: fix typo for revertrisk docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/866600 (https://phabricator.wikimedia.org/T323023)
[14:38:57] <wikibugs>	 (03CR) 10Klausman: [V: 03+2 C: 03+2] ml-services: fix typo for revertrisk docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/866600 (https://phabricator.wikimedia.org/T323023) (owner: 10AikoChou)
[14:39:41] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Jclark-ctr)
[14:40:54] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10fundraising-tech-ops: Q1:rack/setup/install frbast1002, frmon1002, frpig1002 - https://phabricator.wikimedia.org/T319460 (10Jclark-ctr) @Jgreen  these have been received   is this urgent or could I wait till after fundraising to rack and cable these?
[14:41:29] <icinga-wm>	 RECOVERY - IPMI Sensor Status on an-worker1148 is OK: Sensor Type(s) Temperature, Power_Supply Status: OK https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[14:41:52] <wikibugs>	 (03CR) 10Muehlenhoff: "One further comment inline, looks good otherwise" [software/bitu] - 10https://gerrit.wikimedia.org/r/850465 (https://phabricator.wikimedia.org/T319410) (owner: 10Slyngshede)
[14:44:06] <wikibugs>	 (03Merged) 10jenkins-bot: ml-services: fix typo for revertrisk docker image [deployment-charts] - 10https://gerrit.wikimedia.org/r/866600 (https://phabricator.wikimedia.org/T323023) (owner: 10AikoChou)
[14:47:36] <logmsgbot>	 !log aikochou@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' .
[14:48:58] <wikibugs>	 10ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T323961 (10Jclark-ctr) @Andrew  we will need to preform flee power drain on server
[14:52:30] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops, 10User-Eevans: Relocate hosts: aqs10[3-5] - https://phabricator.wikimedia.org/T307035 (10Jclark-ctr) @Eevans  this could possibly happen next week is there a day that works best for you?  I am working on another project next week with Papaul  I am not available...
[14:52:51] <wikibugs>	 (03PS1) 10Jbond: cfssl::cert: add documntation and fix linting [puppet] - 10https://gerrit.wikimedia.org/r/866601
[14:52:54] <wikibugs>	 (03PS1) 10Jbond: cfssl::cert: add ability to renew based on a relative value [puppet] - 10https://gerrit.wikimedia.org/r/866602
[14:57:32] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "Wait for @Marostegi's ack though" [puppet] - 10https://gerrit.wikimedia.org/r/866598 (https://phabricator.wikimedia.org/T324846) (owner: 10FNegri)
[14:59:30] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Decommission eqiad cage WiFi - https://phabricator.wikimedia.org/T320962 (10Jclark-ctr) Removed wifi1 from rack and ran decom script
[14:59:37] <wikibugs>	 10SRE, 10ops-eqiad, 10Infrastructure-Foundations, 10netops: Decommission eqiad cage WiFi - https://phabricator.wikimedia.org/T320962 (10Jclark-ctr) 05Open→03Resolved
[15:04:13] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.5645 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[15:06:02] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Hardware): decommission cloudmetrics100[1-2].eqiad.wmnet - https://phabricator.wikimedia.org/T297444 (10Jclark-ctr)
[15:07:03] <logmsgbot>	 !log jclark@cumin1001 START - Cookbook sre.dns.netbox
[15:07:45] <wikibugs>	 10SRE, 10ops-eqiad, 10decommission-hardware, 10cloud-services-team (Hardware): decommission cloudmetrics100[1-2].eqiad.wmnet - https://phabricator.wikimedia.org/T297444 (10Jclark-ctr) 05Open→03Resolved
[15:08:13] <logmsgbot>	 !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[15:09:47] <icinga-wm>	 RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.09677 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[15:10:14] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] cfssl::cert: add documntation and fix linting [puppet] - 10https://gerrit.wikimedia.org/r/866601 (owner: 10Jbond)
[15:10:30] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Persistence (work done), 10Phabricator, and 3 others: decommission phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T323418 (10Jclark-ctr)
[15:10:53] <wikibugs>	 10SRE, 10ops-eqiad, 10Data-Persistence (work done), 10Phabricator, and 3 others: decommission phab1001.eqiad.wmnet - https://phabricator.wikimedia.org/T323418 (10Jclark-ctr) 05Open→03Resolved
[15:13:08] <wikibugs>	 10ops-codfw: Port with no description on access switch - https://phabricator.wikimedia.org/T324752 (10Papaul) @ayounsi @cmooney this interface is disable and i keep getting this task everything  i close the task can you please check thanks.
[15:14:45] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops, 10User-Eevans: Relocate hosts: aqs10[3-5] - https://phabricator.wikimedia.org/T307035 (10Eevans) >>! In T307035#8456972, @Jclark-ctr wrote: > @Eevans  this could possibly happen next week is there a day that works best for you?  I am working on another project n...
[15:16:09] <wikibugs>	 (03CR) 10Marostegui: [C: 03+1] "but toolsdb needs to be upgraded asap" [puppet] - 10https://gerrit.wikimedia.org/r/866598 (https://phabricator.wikimedia.org/T324846) (owner: 10FNegri)
[15:32:15] <wikibugs>	 (03CR) 10Hashar: "I have restarted Gerrit twice and confirmed it has fixed the issue. The H2 database files have been compacted successfully T323754#8454316" [puppet] - 10https://gerrit.wikimedia.org/r/865023 (https://phabricator.wikimedia.org/T323754) (owner: 10Hashar)
[15:35:03] <icinga-wm>	 PROBLEM - IPMI Sensor Status on db1186 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[15:37:08] <marostegui>	 ^ I will get a task for that
[15:37:19] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 04-1] Wikidata: don't show Vector search thumbnails (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848421 (https://phabricator.wikimedia.org/T316093) (owner: 10Michael Große)
[15:38:57] <wikibugs>	 10ops-eqiad, 10DBA, 10DC-Ops: db1186 power supplies not redundant - https://phabricator.wikimedia.org/T324858 (10Marostegui)
[15:39:07] <wikibugs>	 10ops-eqiad, 10DBA, 10DC-Ops: db1186 power supplies not redundant - https://phabricator.wikimedia.org/T324858 (10Marostegui) p:05Triage→03Medium
[15:40:16] <icinga-wm>	 ACKNOWLEDGEMENT - IPMI Sensor Status on db1186 is CRITICAL: Sensor Type(s) Temperature, Power_Supply Status: Critical [Status = Critical, PS Redundancy = Critical] Marostegui https://phabricator.wikimedia.org/T324858 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures
[15:40:22] <wikibugs>	 (03CR) 10Michael Große: Wikidata: don't show Vector search thumbnails (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848421 (https://phabricator.wikimedia.org/T316093) (owner: 10Michael Große)
[15:44:51] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] P:gitlab::runner: Add environment variable for kokkuri [puppet] - 10https://gerrit.wikimedia.org/r/866520 (owner: 10Dduvall)
[15:47:57] <wikibugs>	 (03PS3) 10Michael Große: Wikidata: don't show Vector search thumbnails [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848421 (https://phabricator.wikimedia.org/T316093)
[15:50:04] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Wikidata: don't show Vector search thumbnails (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848421 (https://phabricator.wikimedia.org/T316093) (owner: 10Michael Große)
[15:50:24] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Wikidata: don't show Vector search thumbnails (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848421 (https://phabricator.wikimedia.org/T316093) (owner: 10Michael Große)
[15:50:31] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.4194 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[15:51:55] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): "Hm, something isn’t right here – why does the diffConfig build now say that there are no changes?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848421 (https://phabricator.wikimedia.org/T316093) (owner: 10Michael Große)
[15:56:58] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Wikidata: don't show Vector search thumbnails (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848421 (https://phabricator.wikimedia.org/T316093) (owner: 10Michael Große)
[15:58:19] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Wikidata: don't show Vector search thumbnails (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848421 (https://phabricator.wikimedia.org/T316093) (owner: 10Michael Große)
[15:58:37] <wikibugs>	 (03CR) 10Michael Große: Wikidata: don't show Vector search thumbnails (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848421 (https://phabricator.wikimedia.org/T316093) (owner: 10Michael Große)
[15:59:47] <icinga-wm>	 RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: (C)0.3 gt (W)0.1 gt 0.03226 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[16:00:56] <wikibugs>	 (03CR) 10Michael Große: Wikidata: don't show Vector search thumbnails (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848421 (https://phabricator.wikimedia.org/T316093) (owner: 10Michael Große)
[16:04:12] <wikibugs>	 (03PS1) 10Muehlenhoff: deployment servers: Enable profile::auto_restarts::service for Envoy [puppet] - 10https://gerrit.wikimedia.org/r/866607 (https://phabricator.wikimedia.org/T135991)
[16:06:14] <wikibugs>	 (03PS2) 10Jbond: cfssl::cert: add ability to renew based on a relative value [puppet] - 10https://gerrit.wikimedia.org/r/866602
[16:08:14] <wikibugs>	 10ops-eqiad, 10DBA, 10DC-Ops: db1186 power supplies not redundant - https://phabricator.wikimedia.org/T324858 (10Jclark-ctr) a:03Jclark-ctr
[16:08:27] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Wikidata: don't show Vector search thumbnails (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848421 (https://phabricator.wikimedia.org/T316093) (owner: 10Michael Große)
[16:11:13] <wikibugs>	 (03PS1) 10Jbond: cli: handle a blank change number gracefully [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/866609
[16:12:21] <wikibugs>	 (03PS1) 10Jbond: puppet_compiler: bump version [puppet] - 10https://gerrit.wikimedia.org/r/866610
[16:16:15] <wikibugs>	 (03PS2) 10Jbond: puppet_compiler: bump version [puppet] - 10https://gerrit.wikimedia.org/r/866610
[16:16:22] <wikibugs>	 (03PS2) 10Jbond: cli: handle a blank change number gracefully [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/866609
[16:16:31] <wikibugs>	 (03PS2) 10Tsevener: Add event stream config for ios.talk_page_interaction [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866489 (https://phabricator.wikimedia.org/T324340)
[16:17:49] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): Wikidata: don't show Vector search thumbnails (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848421 (https://phabricator.wikimedia.org/T316093) (owner: 10Michael Große)
[16:18:01] <wikibugs>	 (03PS4) 10Michael Große: Wikidata: don't show Vector search thumbnails [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848421 (https://phabricator.wikimedia.org/T316093)
[16:18:21] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.3871 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[16:19:15] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] cli: handle a blank change number gracefully [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/866609 (owner: 10Jbond)
[16:21:15] <wikibugs>	 (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "diffConfig looks good now \o/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/848421 (https://phabricator.wikimedia.org/T316093) (owner: 10Michael Große)
[16:21:41] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-stretch1001.eqiad.wmnet with OS bullseye
[16:21:49] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host kafka-stretch1001.eqiad.wmnet with OS bullseye
[16:21:49] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kafka-stretch1001.eqiad.wmnet with OS bullseye
[16:21:55] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host kafka-stretch1001.eqiad.wmnet with OS bullseye...
[16:23:53] <icinga-wm>	 PROBLEM - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is CRITICAL: 0.6613 gt 0.3 https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[16:24:38] <wikibugs>	 (03CR) 10David Caro: [C: 03+1] "LGTM once the tests pass :)" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/866609 (owner: 10Jbond)
[16:27:50] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host kafka-stretch1002.eqiad.wmnet with OS bullseye
[16:27:51] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10Cmjohnson) @papaul When I try to image these servers, the process fails immediately. This is the error I receive. Any ideas on what is wrong?...
[16:27:57] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host kafka-stretch1002.eqiad.wmnet with OS bullseye
[16:30:15] <wikibugs>	 (03PS1) 10Ottomata: eventgate-analytics: bump replicas from 20 to 30 [deployment-charts] - 10https://gerrit.wikimedia.org/r/866612
[16:31:00] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] eventgate-analytics: bump replicas from 20 to 30 [deployment-charts] - 10https://gerrit.wikimedia.org/r/866612 (owner: 10Ottomata)
[16:31:17] <jinxer-wm>	 (PHPFPMTooBusy) firing: Not enough idle php7.4-fpm.service workers for Mediawiki api_appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[16:31:56] <wikibugs>	 (03CR) 10Jforrester: [C: 03+1] "Oops, knew I forgot something. Thanks!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866574 (https://phabricator.wikimedia.org/T324808) (owner: 10Reedy)
[16:32:13] <ottomata>	 Working on ^^ in #wikimedia-serviceops
[16:32:21] <wikibugs>	 (03CR) 10Ottomata: [V: 03+2] eventgate-analytics: bump replicas from 20 to 30 [deployment-charts] - 10https://gerrit.wikimedia.org/r/866612 (owner: 10Ottomata)
[16:32:23] <claime>	 page acknowledged
[16:33:54] <logmsgbot>	 !log otto@deploy1002 helmfile [eqiad] START helmfile.d/services/eventgate-analytics: apply
[16:34:35] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[16:34:58] <logmsgbot>	 !log otto@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventgate-analytics: apply
[16:35:16] <logmsgbot>	 !log otto@deploy1002 helmfile [codfw] START helmfile.d/services/eventgate-analytics: apply
[16:36:17] <jinxer-wm>	 (PHPFPMTooBusy) resolved: Not enough idle php7.4-fpm.service workers for Mediawiki api_appserver at eqiad #page - https://bit.ly/wmf-fpmsat - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=54&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DPHPFPMTooBusy
[16:36:28] <logmsgbot>	 !log otto@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventgate-analytics: apply
[16:36:48] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[16:36:55] <icinga-wm>	 RECOVERY - Some MediaWiki servers are running out of idle PHP-FPM workers in api_appserver at eqiad on alert1001 is OK: All metrics within thresholds. https://bit.ly/wmf-fpmsat https://grafana.wikimedia.org/d/fRn9VEPMz/application-servers-use-dashboard-wip?orgId=1
[16:38:25] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kafka-stretch1001 - cmjohnson@cumin1001"
[16:39:28] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kafka-stretch1001 - cmjohnson@cumin1001"
[16:39:28] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[16:39:54] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kafka-stretch1002.eqiad.wmnet with reason: host reimage
[16:42:46] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] START helmfile.d/services/eventgate-analytics: apply
[16:42:54] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kafka-stretch1002.eqiad.wmnet with reason: host reimage
[16:43:08] <logmsgbot>	 !log otto@deploy1002 helmfile [staging] DONE helmfile.d/services/eventgate-analytics: apply
[16:44:05] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10Papaul) @Cmjohnson try to delete the kafka-stretch1001.conf on install1003 and try again an let me know
[16:56:14] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox
[16:57:24] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - cmjohnson@cumin1001"
[16:58:20] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kafka-stretch1001 - cmjohnson@cumin1001"
[16:58:47] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - cmjohnson@cumin1001"
[16:58:52] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kafka-stretch1002.eqiad.wmnet with OS bullseye
[16:58:58] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host kafka-stretch1002.eqiad.wmnet with OS bullseye...
[16:59:19] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: kafka-stretch1001 - cmjohnson@cumin1001"
[16:59:20] <logmsgbot>	 !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[17:00:07] <logmsgbot>	 !log cmjohnson@cumin1001 START - Cookbook sre.hosts.provision for host kafka-stretch1001.mgmt.eqiad.wmnet with reboot policy FORCED
[17:02:50] <logmsgbot>	 !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kafka-stretch1001.mgmt.eqiad.wmnet with reboot policy FORCED
[17:03:31] <claime>	 !log eventgate-analytics bumped to 30 replicas to absorb increased load - T320518
[17:03:32] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Logstash access for contractor Wangombe - https://phabricator.wikimedia.org/T318209 (10jhathaway) @Wangombe apologies for not noticing this earlier but your developer account or wikitech account is linked to your personal email address. Would you kindly change that to your @wikim...
[17:03:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:03:34] <stashbot>	 T320518: 1.40.0-wmf.13 deployment blockers - https://phabricator.wikimedia.org/T320518
[17:09:20] <wikibugs>	 (03PS1) 10JHathaway: Add Stephanie Delbecque to the wmf group [puppet] - 10https://gerrit.wikimedia.org/r/866620 (https://phabricator.wikimedia.org/T324753)
[17:09:42] <wikibugs>	 (03PS3) 10Jbond: cli: handle a blank change number gracefully [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/866609
[17:10:06] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Access to wmf for SDelbecque - https://phabricator.wikimedia.org/T324753 (10jhathaway) 05Open→03Resolved a:03jhathaway done!
[17:10:38] <wikibugs>	 (03PS3) 10Jbond: puppet_compiler: bump version [puppet] - 10https://gerrit.wikimedia.org/r/866610
[17:13:10] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] cli: handle a blank change number gracefully [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/866609 (owner: 10Jbond)
[17:13:17] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] puppet_compiler: bump version [puppet] - 10https://gerrit.wikimedia.org/r/866610 (owner: 10Jbond)
[17:14:18] <wikibugs>	 (03PS3) 10Hashar: gerrit: script to report on git gc durations [puppet] - 10https://gerrit.wikimedia.org/r/856601 (https://phabricator.wikimedia.org/T237807)
[17:15:24] <wikibugs>	 (03CR) 10Hashar: "I have updated the script shebang to point to /usr/bin/python3.  I have tested the script on gerrt1001." [puppet] - 10https://gerrit.wikimedia.org/r/856601 (https://phabricator.wikimedia.org/T237807) (owner: 10Hashar)
[17:19:16] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38673/console" [puppet] - 10https://gerrit.wikimedia.org/r/866602 (owner: 10Jbond)
[17:19:58] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/38674/console" [puppet] - 10https://gerrit.wikimedia.org/r/866598 (https://phabricator.wikimedia.org/T324846) (owner: 10FNegri)
[17:44:03] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mx2001.wikimedia.org with reason: Moar Disk
[17:44:16] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx2001.wikimedia.org with reason: Moar Disk
[17:59:37] <logmsgbot>	 !log jnuche@deploy1002 Installing scap version "4.30.2" for 563 hosts
[18:01:55] <wikibugs>	 (03CR) 10Andrew Bogott: rsyslog: allow specifying a hiera-defined certfile (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/865174 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott)
[18:02:11] <wikibugs>	 (03CR) 10Andrew Bogott: rsyslog: allow specifying a hiera-defined certfile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/865174 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott)
[18:03:05] <wikibugs>	 (03PS13) 10Andrew Bogott: rsyslog: allow specifying a hiera-defined certfile [puppet] - 10https://gerrit.wikimedia.org/r/865174 (https://phabricator.wikimedia.org/T127717)
[18:03:07] <wikibugs>	 (03PS7) 10Andrew Bogott: remote syslog: allow hiera config of rsyslog TLS CA [puppet] - 10https://gerrit.wikimedia.org/r/865184 (https://phabricator.wikimedia.org/T127717)
[18:04:16] <logmsgbot>	 !log jnuche@deploy1002 Installing scap version "4.30.2" for 562 hosts
[18:04:40] <logmsgbot>	 !log jnuche@deploy1002 Installation of scap version "4.30.2" completed for 562 hosts
[18:06:06] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+1] rsyslog: add support for openssl netstream driver (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/865731 (https://phabricator.wikimedia.org/T324623) (owner: 10Southparkfan)
[18:14:55] <wikibugs>	 10SRE, 10Traffic, 10Patch-For-Review: Upgrade Traffic hosts to bullseye - https://phabricator.wikimedia.org/T321309 (10ssingh)
[18:16:45] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mx2001.wikimedia.org with reason: Moar Disk 2!
[18:16:48] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx2001.wikimedia.org with reason: Moar Disk 2!
[18:22:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job trafficserver-text in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[18:22:58] <logmsgbot>	 !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)
[18:23:34] <wikibugs>	 (03PS1) 10Andrew Bogott: Added some comments about where/how cloud hiera settings are applied [puppet] - 10https://gerrit.wikimedia.org/r/866625
[18:24:26] <wikibugs>	 (03CR) 10Andrew Bogott: "Adding the two of you as reviewers to this extremely trivial patch because I never cease to be surprised at how hiera lookup works." [puppet] - 10https://gerrit.wikimedia.org/r/866625 (owner: 10Andrew Bogott)
[18:27:56] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:32:44] <wikibugs>	 (03PS1) 10Daniel Kinzler: hewiki: set VisualEditor to direct mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866627 (https://phabricator.wikimedia.org/T320529)
[18:32:52] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] hewiki: set VisualEditor to direct mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866627 (https://phabricator.wikimedia.org/T320529) (owner: 10Daniel Kinzler)
[18:39:04] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[18:41:07] <wikibugs>	 (03PS14) 10Andrew Bogott: rsyslog: allow specifying a hiera-defined certfile [puppet] - 10https://gerrit.wikimedia.org/r/865174 (https://phabricator.wikimedia.org/T127717)
[18:41:09] <wikibugs>	 (03PS8) 10Andrew Bogott: remote syslog: allow hiera config of rsyslog TLS CA [puppet] - 10https://gerrit.wikimedia.org/r/865184 (https://phabricator.wikimedia.org/T127717)
[18:41:11] <wikibugs>	 (03PS1) 10Andrew Bogott: Turn on central auth logging for all eqiad1 VMs [puppet] - 10https://gerrit.wikimedia.org/r/866628 (https://phabricator.wikimedia.org/T127717)
[18:46:39] <wikibugs>	 (03CR) 10Andrew Bogott: "https://puppet-compiler.wmflabs.org/output/866628/38675/tools-sgebastion-10.tools.eqiad1.wikimedia.cloud/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/866628 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott)
[18:52:41] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmde for Muhammad Jaziraly - https://phabricator.wikimedia.org/T324477 (10KFrancis) @jhathaway The NDA has been signed.  Please proceed with the access request.  Thanks!
[18:54:20] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, and 2 others: Q1:rack/setup/install kafka-stretch100[12] - https://phabricator.wikimedia.org/T314156 (10Cmjohnson) KS1002 was installed without an issue, I started over with KS1001 but the mgmt IP address changed and the provision script didn't wor...
[19:09:32] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Access to wmde for Muhammad Jaziraly - https://phabricator.wikimedia.org/T324477 (10jhathaway) 05Open→03Resolved a:03jhathaway @Muhammad_Yasser_Jazirahly_WMDE groups added, enjoy!
[19:13:24] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:13:38] <wikibugs>	 (03PS11) 10Ottomata: flink and flink-kubernetes-operator image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519)
[19:14:14] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:15:12] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 5.756 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:15:34] <wikibugs>	 10SRE, 10Data-Engineering-Planning, 10WMF-Communications: LDAP access for Sondes to access Matomo - https://phabricator.wikimedia.org/T324696 (10EdErhart-WMF)
[19:16:39] <wikibugs>	 (03PS2) 10Ottomata: [WIP] - flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576)
[19:17:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [WIP] - flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[19:17:37] <wikibugs>	 (03PS10) 10Ottomata: flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576)
[19:19:42] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49121 bytes in 0.107 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[19:20:07] <wikibugs>	 (03PS3) 10Ottomata: [WIP] - flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576)
[19:20:56] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [WIP] - flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[19:23:03] <wikibugs>	 10SRE, 10Data-Engineering-Planning, 10WMF-Communications: LDAP access for Sondes to access Matomo - https://phabricator.wikimedia.org/T324696 (10EdErhart-WMF) Tagging @Marostegui and @jcrespo per their recent involvement with LDAP access requests
[19:34:10] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mx1001.wikimedia.org with reason: Moar Disk!
[19:34:23] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx1001.wikimedia.org with reason: Moar Disk!
[19:35:54] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[19:39:06] <wikibugs>	 (03CR) 10Herron: [C: 03+1] netmon: Remove the netmon1002 instance as passive node [puppet] - 10https://gerrit.wikimedia.org/r/866526 (https://phabricator.wikimedia.org/T322321) (owner: 10Andrea Denisse)
[19:39:38] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[19:42:19] <wikibugs>	 (03PS1) 10Eevans: Configure new cassandra-dev cluster [puppet] - 10https://gerrit.wikimedia.org/r/866640 (https://phabricator.wikimedia.org/T324113)
[19:46:14] <wikibugs>	 (03PS2) 10Eevans: Configure new cassandra-dev cluster [puppet] - 10https://gerrit.wikimedia.org/r/866640 (https://phabricator.wikimedia.org/T324113)
[19:47:06] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[19:47:53] <wikibugs>	 (03PS3) 10Eevans: Configure new cassandra-dev cluster [puppet] - 10https://gerrit.wikimedia.org/r/866640 (https://phabricator.wikimedia.org/T324113)
[19:48:56] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[19:49:56] <logmsgbot>	 !log jhathaway@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on mx1001.wikimedia.org with reason: Moar Disk!
[19:49:58] <logmsgbot>	 !log jhathaway@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on mx1001.wikimedia.org with reason: Moar Disk!
[19:50:29] <wikibugs>	 (03CR) 10Herron: librenms: Increase the TTL for LibreNMS (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/866496 (https://phabricator.wikimedia.org/T322695) (owner: 10Andrea Denisse)
[19:55:10] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Mail: MX: increasing disk space - https://phabricator.wikimedia.org/T305567 (10jhathaway) 05Open→03Resolved bumped both MXes, `mx{1001,2001}.wikimedia.org` to 50G root partitions
[20:05:22] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:06:26] <wikibugs>	 (03PS4) 10Ottomata: flink-kubernetes-operator - Initial commit of upstream helm chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/865100 (https://phabricator.wikimedia.org/T324576)
[20:06:28] <wikibugs>	 (03PS11) 10Ottomata: flink-kubernetes-operator - modify for WMF and add an admin_ng helmfile [deployment-charts] - 10https://gerrit.wikimedia.org/r/865158 (https://phabricator.wikimedia.org/T324576)
[20:06:31] <wikibugs>	 (03PS4) 10Ottomata: [WIP] - flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576)
[20:07:23] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] [WIP] - flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[20:11:12] <icinga-wm>	 PROBLEM - High average POST latency for mw requests on api_appserver in codfw on alert1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www-7.4.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method
[20:11:23] <jinxer-wm>	 (ThanosQueryHttpRequestQueryRangeErrorRateHigh) firing: Thanos Query is failing to handle requests. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryHttpRequestQueryRangeErrorRateHigh
[20:12:23] <jinxer-wm>	 (ThanosQueryRangeLatencyHigh) firing: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryRangeLatencyHigh
[20:12:48] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) firing: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[20:13:02] <icinga-wm>	 RECOVERY - High average POST latency for mw requests on api_appserver in codfw on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=codfw+prometheus/ops&var-cluster=api_appserver&var-method=POST
[20:15:11] <wikibugs>	 (03PS1) 10Andrew Bogott: puppetmasters: cache cleanup [puppet] - 10https://gerrit.wikimedia.org/r/866644
[20:15:29] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] puppetmasters: cache cleanup [puppet] - 10https://gerrit.wikimedia.org/r/866644 (owner: 10Andrew Bogott)
[20:16:23] <jinxer-wm>	 (ThanosQueryHttpRequestQueryRangeErrorRateHigh) resolved: Thanos Query is failing to handle requests. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryHttpRequestQueryRangeErrorRateHigh
[20:16:30] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[20:17:23] <jinxer-wm>	 (ThanosQueryRangeLatencyHigh) resolved: Thanos Query has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/af36c91291a603f1d9fbdabdd127ac4a/thanos-query - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryRangeLatencyHigh
[20:17:48] <jinxer-wm>	 (ThanosQueryInstantLatencyHigh) resolved: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh
[20:19:23] <wikibugs>	 (03PS2) 10Andrew Bogott: puppetmasters: cache cleanup [puppet] - 10https://gerrit.wikimedia.org/r/866644
[20:19:52] <wikibugs>	 (03PS12) 10Ottomata: flink and flink-kubernetes-operator image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519)
[20:20:09] <wikibugs>	 (03PS13) 10Ottomata: flink and flink-kubernetes-operator image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/858356 (https://phabricator.wikimedia.org/T316519)
[20:21:08] <wikibugs>	 10SRE, 10Data-Engineering-Planning, 10WMF-Communications: LDAP access for Sondes to access Matomo - https://phabricator.wikimedia.org/T324696 (10jcrespo) Hi, @EdErhart-WMF . There is no need to tag anyone- SRE has a clinic duty procedure in which someone on rotation attends LDAP requests every week. I sugges...
[20:26:21] <wikibugs>	 (03PS3) 10Andrew Bogott: puppetmasters: cache cleanup [puppet] - 10https://gerrit.wikimedia.org/r/866644
[20:28:07] <wikibugs>	 (03PS1) 10Eevans: keys & certs for (new) cassandra-dev cluster [labs/private] - 10https://gerrit.wikimedia.org/r/866646 (https://phabricator.wikimedia.org/T324113)
[20:29:57] <wikibugs>	 (03PS4) 10Andrew Bogott: puppetmasters: cache cleanup [puppet] - 10https://gerrit.wikimedia.org/r/866644
[20:32:58] <wikibugs>	 (03CR) 10Andrew Bogott: "https://puppet-compiler.wmflabs.org/output/866644/38681/puppetmaster1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/866644 (owner: 10Andrew Bogott)
[20:33:03] <wikibugs>	 (03CR) 10Eevans: [V: 03+2 C: 03+2] keys & certs for (new) cassandra-dev cluster [labs/private] - 10https://gerrit.wikimedia.org/r/866646 (https://phabricator.wikimedia.org/T324113) (owner: 10Eevans)
[20:33:11] <wikibugs>	 (03PS2) 10Sbailey: enable migrate namespace function on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866518 (https://phabricator.wikimedia.org/T299612)
[20:34:34] <wikibugs>	 (03CR) 10Eevans: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/866640 (https://phabricator.wikimedia.org/T324113) (owner: 10Eevans)
[20:36:48] <jinxer-wm>	 (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on cloudcumin2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
[20:57:14] <wikibugs>	 (03PS4) 10Eevans: Configure new cassandra-dev cluster [puppet] - 10https://gerrit.wikimedia.org/r/866640 (https://phabricator.wikimedia.org/T324113)
[20:57:32] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Configure new cassandra-dev cluster [puppet] - 10https://gerrit.wikimedia.org/r/866640 (https://phabricator.wikimedia.org/T324113) (owner: 10Eevans)
[21:11:12] <icinga-wm>	 PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:22:22] <icinga-wm>	 RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:31:42] <wikibugs>	 (03PS1) 10JHathaway: Add Kwaku Addo Ofori to ops & wmf [puppet] - 10https://gerrit.wikimedia.org/r/866649
[21:55:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[21:56:45] <wikibugs>	 (03PS1) 10Aqu: HDFS FSImage is backed up to HDFS on monday [puppet] - 10https://gerrit.wikimedia.org/r/866650 (https://phabricator.wikimedia.org/T324850)
[21:57:02] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] HDFS FSImage is backed up to HDFS on monday [puppet] - 10https://gerrit.wikimedia.org/r/866650 (https://phabricator.wikimedia.org/T324850) (owner: 10Aqu)
[21:58:26] <wikibugs>	 (03PS2) 10Aqu: HDFS FSImage is backed up to HDFS on monday [puppet] - 10https://gerrit.wikimedia.org/r/866650 (https://phabricator.wikimedia.org/T324850)
[22:00:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[22:15:10] <wikibugs>	 (03PS2) 10Arlolra: Disable wgParserEnableLegacyMediaDOM on group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/865214 (https://phabricator.wikimedia.org/T314318)
[22:15:12] <wikibugs>	 (03PS1) 10Arlolra: Disable wgParserEnableLegacyMediaDOM on cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866653 (https://phabricator.wikimedia.org/T297984)
[22:18:44] <wikibugs>	 (03CR) 10Subramanya Sastry: [C: 03+1] Disable wgParserEnableLegacyMediaDOM on cawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866653 (https://phabricator.wikimedia.org/T297984) (owner: 10Arlolra)
[22:22:00] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job trafficserver-text in ops@eqsin - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[22:31:28] <wikibugs>	 (03CR) 10Subramanya Sastry: [C: 03+2] enable migrate namespace function on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866518 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey)
[22:32:13] <wikibugs>	 (03Merged) 10jenkins-bot: enable migrate namespace function on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/866518 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey)
[22:47:58] <wikibugs>	 (03PS5) 10Ottomata: flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576)
[22:48:44] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] flink-app chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/866510 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata)
[23:03:34] <wikibugs>	 10SRE, 10ops-eqiad, 10Cassandra, 10DC-Ops, 10User-Eevans: Relocate hosts: aqs10[3-5] - https://phabricator.wikimedia.org/T307035 (10Eevans) >>! In T307035#8456972, @Jclark-ctr wrote: > @Eevans  this could possibly happen next week is there a day that works best for you?  I am working on another project n...
[23:08:02] <wikibugs>	 (03PS3) 10Herron: slo_dashboards: dynamic slo dashboard panels [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/861947 (https://phabricator.wikimedia.org/T320749)
[23:11:33] <wikibugs>	 (03CR) 10Herron: "updated to include new wdqs slo and improve panel layout using rows per slo" [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/861947 (https://phabricator.wikimedia.org/T320749) (owner: 10Herron)
[23:17:03] <wikibugs>	 (03PS2) 10Cwhite: hiera: map logstash.wm.o to kibana7.eqiad [puppet] - 10https://gerrit.wikimedia.org/r/828112 (https://phabricator.wikimedia.org/T304440)
[23:24:36] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] netmon: Remove the netmon1002 instance as passive node [puppet] - 10https://gerrit.wikimedia.org/r/866526 (https://phabricator.wikimedia.org/T322321) (owner: 10Andrea Denisse)
[23:24:39] <wikibugs>	 (03PS1) 10Krinkle: Add Largest Contentful Paint (LCP) [extensions/NavigationTiming] (wmf/1.40.0-wmf.13) - 10https://gerrit.wikimedia.org/r/866480 (https://phabricator.wikimedia.org/T319329)
[23:31:26] <wikibugs>	 (03CR) 10Cwhite: rsyslog: allow specifying a hiera-defined certfile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/865174 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott)
[23:32:03] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] rsyslog: add support for openssl netstream driver (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/865731 (https://phabricator.wikimedia.org/T324623) (owner: 10Southparkfan)
[23:37:42] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/866628 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott)
[23:39:16] <logmsgbot>	 !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.UPGRADE (3 nodes at a time) for ElasticSearch cluster search_eqiad: search_eqiad elasticsearch and plugin upgrade - ryankemper@cumin1001 - T322776
[23:39:22] <stashbot>	 T322776: Deploy Ukrainian Analyzer Plugin - https://phabricator.wikimedia.org/T322776
[23:47:45] <wikibugs>	 (03CR) 10Southparkfan: [C: 03+1] "After https://gerrit.wikimedia.org/r/c/operations/puppet/+/865174/, https://gerrit.wikimedia.org/r/c/operations/puppet/+/865731/ and https" [puppet] - 10https://gerrit.wikimedia.org/r/866628 (https://phabricator.wikimedia.org/T127717) (owner: 10Andrew Bogott)
[23:59:53] <logmsgbot>	 !log bking@cumin2002 END (FAIL) - Cookbook sre.wdqs.data-reload (exit_code=99)