Fork me on GitHub

Wikimedia IRC logs browser - #wikimedia-operations

Filter:
Start date
End date

Displaying 1634 items:

2023-08-23 00:34:21 <wikibugs> ('PS1) ''Ssingh: wmf-config: remove public subnets from reverse-proxy.php [mediawiki-config] - ''https://gerrit.wikimedia.org/r/951591 (https://phabricator.wikimedia.org/T344704)'
2023-08-23 00:35:58 <logmsgbot> !log eevans@cumin1001 END (PASS) - Cookbook sre.cassandra.roll-restart (exit_code=0) for nodes matching A:aqs-eqiad: Disable legacy SSL port — T339299 - eevans@cumin1001
2023-08-23 00:36:02 <stashbot> T339299: Upgrade aqs cluster to Cassandra 4.1.1 - https://phabricator.wikimedia.org/T339299
2023-08-23 00:38:24 <wikibugs> ('PS1) ''TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - ''https://gerrit.wikimedia.org/r/951104'
2023-08-23 00:38:30 <wikibugs> ('CR) ''TrainBranchBot: [C: ''+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - ''https://gerrit.wikimedia.org/r/951104 (owner: ''TrainBranchBot)'
2023-08-23 00:41:05 <jinxer-wm> (SwiftTooManyMediaUploads) firing: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
2023-08-23 00:46:59 <icinga-wm> PROBLEM - Check systemd state on logstash2026 is CRITICAL: CRITICAL - degraded: The following units failed: curator_actions_cluster_wide.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2023-08-23 00:54:04 <wikibugs> ('Merged) ''jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - ''https://gerrit.wikimedia.org/r/951104 (owner: ''TrainBranchBot)'
2023-08-23 01:01:05 <jinxer-wm> (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
2023-08-23 01:22:34 <wikibugs> ('PS1) ''Andrea Denisse: icinga: Add notification when purging nagios resources [puppet] - ''https://gerrit.wikimedia.org/r/951592 (https://phabricator.wikimedia.org/T263027)'
2023-08-23 01:41:05 <jinxer-wm> (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
2023-08-23 01:42:37 <wikibugs> ('CR) ''Andrea Denisse: "PCC results: https://puppet-compiler.wmflabs.org/output/951592/42980/"; [puppet] - ''https://gerrit.wikimedia.org/r/951592 (https://phabricator.wikimedia.org/T263027) (owner: ''Andrea Denisse)'
2023-08-23 02:11:43 <jinxer-wm> (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2023-08-23 02:31:43 <jinxer-wm> (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2023-08-23 03:30:16 <jinxer-wm> (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
2023-08-23 03:40:16 <jinxer-wm> (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
2023-08-23 03:41:05 <jinxer-wm> (SwiftTooManyMediaUploads) firing: Too many codfw mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://grafana.wikimedia.org/d/OPgmB1Eiz/swift?panelId=26&fullscreen&orgId=1&var-DC=codfw - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
2023-08-23 03:45:00 <logmsgbot> !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance
2023-08-23 03:45:13 <logmsgbot> !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2152.codfw.wmnet with reason: Maintenance
2023-08-23 03:45:19 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2152 (T344589)', diff saved to https://phabricator.wikimedia.org/P50966 and previous config saved to /var/cache/conftool/dbconfig/20230823-034519-ladsgroup.json
2023-08-23 03:45:23 <logmsgbot> !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance
2023-08-23 03:45:37 <logmsgbot> !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1167.eqiad.wmnet with reason: Maintenance
2023-08-23 03:45:39 <logmsgbot> !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
2023-08-23 03:45:43 <logmsgbot> !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1016,1020-1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
2023-08-23 03:45:49 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1167 (T344589)', diff saved to https://phabricator.wikimedia.org/P50967 and previous config saved to /var/cache/conftool/dbconfig/20230823-034549-ladsgroup.json
2023-08-23 03:46:01 <logmsgbot> !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance
2023-08-23 03:46:13 <logmsgbot> !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2109.codfw.wmnet with reason: Maintenance
2023-08-23 03:46:14 <logmsgbot> !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1140.eqiad.wmnet with reason: Maintenance
2023-08-23 03:46:37 <logmsgbot> !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2109.codfw.wmnet with reason: Maintenance
2023-08-23 03:46:43 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2109 (T343718)', diff saved to https://phabricator.wikimedia.org/P50968 and previous config saved to /var/cache/conftool/dbconfig/20230823-034643-ladsgroup.json
2023-08-23 03:46:47 <stashbot> T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
2023-08-23 03:46:56 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P50969 and previous config saved to /var/cache/conftool/dbconfig/20230823-034656-ladsgroup.json
2023-08-23 03:50:42 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T344589)', diff saved to https://phabricator.wikimedia.org/P50970 and previous config saved to /var/cache/conftool/dbconfig/20230823-035042-ladsgroup.json
2023-08-23 03:51:05 <jinxer-wm> (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
2023-08-23 03:51:58 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T344589)', diff saved to https://phabricator.wikimedia.org/P50971 and previous config saved to /var/cache/conftool/dbconfig/20230823-035157-ladsgroup.json
2023-08-23 03:54:39 <icinga-wm> PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
2023-08-23 03:54:55 <icinga-wm> PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
2023-08-23 03:56:48 <logmsgbot> !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1028.eqiad.wmnet with reason: Maintenance
2023-08-23 03:57:02 <logmsgbot> !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1028.eqiad.wmnet with reason: Maintenance
2023-08-23 03:57:08 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es1028 (T344589)', diff saved to https://phabricator.wikimedia.org/P50972 and previous config saved to /var/cache/conftool/dbconfig/20230823-035707-ladsgroup.json
2023-08-23 04:01:49 <icinga-wm> RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.276 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
2023-08-23 04:01:58 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1028 (T344589)', diff saved to https://phabricator.wikimedia.org/P50973 and previous config saved to /var/cache/conftool/dbconfig/20230823-040158-ladsgroup.json
2023-08-23 04:02:03 <icinga-wm> RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50420 bytes in 0.065 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
2023-08-23 04:02:07 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P50974 and previous config saved to /var/cache/conftool/dbconfig/20230823-040207-ladsgroup.json
2023-08-23 04:05:48 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P50975 and previous config saved to /var/cache/conftool/dbconfig/20230823-040548-ladsgroup.json
2023-08-23 04:07:04 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P50976 and previous config saved to /var/cache/conftool/dbconfig/20230823-040704-ladsgroup.json
2023-08-23 04:17:05 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1028', diff saved to https://phabricator.wikimedia.org/P50977 and previous config saved to /var/cache/conftool/dbconfig/20230823-041704-ladsgroup.json
2023-08-23 04:17:12 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P50978 and previous config saved to /var/cache/conftool/dbconfig/20230823-041712-ladsgroup.json
2023-08-23 04:19:42 <logmsgbot> !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance
2023-08-23 04:19:55 <logmsgbot> !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1150.eqiad.wmnet with reason: Maintenance
2023-08-23 04:20:54 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152', diff saved to https://phabricator.wikimedia.org/P50979 and previous config saved to /var/cache/conftool/dbconfig/20230823-042054-ladsgroup.json
2023-08-23 04:22:10 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167', diff saved to https://phabricator.wikimedia.org/P50980 and previous config saved to /var/cache/conftool/dbconfig/20230823-042210-ladsgroup.json
2023-08-23 04:25:16 <jinxer-wm> (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
2023-08-23 04:27:32 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T343718)', diff saved to https://phabricator.wikimedia.org/P50981 and previous config saved to /var/cache/conftool/dbconfig/20230823-042732-ladsgroup.json
2023-08-23 04:27:37 <stashbot> T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
2023-08-23 04:30:16 <jinxer-wm> (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
2023-08-23 04:32:11 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1028', diff saved to https://phabricator.wikimedia.org/P50982 and previous config saved to /var/cache/conftool/dbconfig/20230823-043210-ladsgroup.json
2023-08-23 04:32:17 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'es1022 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P50983 and previous config saved to /var/cache/conftool/dbconfig/20230823-043216-ladsgroup.json
2023-08-23 04:33:34 <wikibugs> 'ops-eqiad: Inbound interface errors - https://phabricator.wikimedia.org/T342502 (''phaultfinder)'
2023-08-23 04:34:03 <jinxer-wm> (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
2023-08-23 04:36:01 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2152 (T344589)', diff saved to https://phabricator.wikimedia.org/P50984 and previous config saved to /var/cache/conftool/dbconfig/20230823-043600-ladsgroup.json
2023-08-23 04:36:06 <logmsgbot> !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance
2023-08-23 04:36:19 <logmsgbot> !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2154.codfw.wmnet with reason: Maintenance
2023-08-23 04:36:26 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2154 (T344589)', diff saved to https://phabricator.wikimedia.org/P50985 and previous config saved to /var/cache/conftool/dbconfig/20230823-043625-ladsgroup.json
2023-08-23 04:37:17 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1167 (T344589)', diff saved to https://phabricator.wikimedia.org/P50986 and previous config saved to /var/cache/conftool/dbconfig/20230823-043716-ladsgroup.json
2023-08-23 04:37:22 <logmsgbot> !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance
2023-08-23 04:37:35 <logmsgbot> !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1172.eqiad.wmnet with reason: Maintenance
2023-08-23 04:37:41 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1172 (T344589)', diff saved to https://phabricator.wikimedia.org/P50987 and previous config saved to /var/cache/conftool/dbconfig/20230823-043741-ladsgroup.json
2023-08-23 04:39:03 <jinxer-wm> (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
2023-08-23 04:42:39 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P50988 and previous config saved to /var/cache/conftool/dbconfig/20230823-044238-ladsgroup.json
2023-08-23 04:42:51 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T344589)', diff saved to https://phabricator.wikimedia.org/P50989 and previous config saved to /var/cache/conftool/dbconfig/20230823-044251-ladsgroup.json
2023-08-23 04:43:35 <logmsgbot> !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster reboot - ryankemper@cumin1001 - T344587
2023-08-23 04:43:53 <logmsgbot> !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.elasticsearch.rolling-operation (exit_code=99) Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster reboot - ryankemper@cumin1001 - T344587
2023-08-23 04:43:57 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T344589)', diff saved to https://phabricator.wikimedia.org/P50990 and previous config saved to /var/cache/conftool/dbconfig/20230823-044356-ladsgroup.json
2023-08-23 04:44:33 <logmsgbot> !log ryankemper@cumin1001 START - Cookbook sre.elasticsearch.rolling-operation Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster reboot - ryankemper@cumin1001 - T344587
2023-08-23 04:47:17 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1028 (T344589)', diff saved to https://phabricator.wikimedia.org/P50991 and previous config saved to /var/cache/conftool/dbconfig/20230823-044717-ladsgroup.json
2023-08-23 04:47:21 <logmsgbot> !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on es1031.eqiad.wmnet with reason: Maintenance
2023-08-23 04:47:35 <logmsgbot> !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on es1031.eqiad.wmnet with reason: Maintenance
2023-08-23 04:47:41 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling es1031 (T344589)', diff saved to https://phabricator.wikimedia.org/P50992 and previous config saved to /var/cache/conftool/dbconfig/20230823-044741-ladsgroup.json
2023-08-23 04:48:03 <icinga-wm> PROBLEM - ElasticSearch health check for shards on 9200 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 99 threshold =0.15 breach: cluster_name: relforge-eqiad, status: yellow, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 191, active_shards: 283, relocating_shards: 0, initializing_shards: 2, unassigned_shards: 97, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, numbe
2023-08-23 04:48:03 <icinga-wm> flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 74.08376963350786 https://wikitech.wikimedia.org/wiki/Search%23Administration
2023-08-23 04:49:29 <icinga-wm> RECOVERY - ElasticSearch health check for shards on 9200 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: green, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 191, active_shards: 382, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max
2023-08-23 04:49:29 <icinga-wm> _in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
2023-08-23 04:50:19 <logmsgbot> !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance
2023-08-23 04:50:32 <logmsgbot> !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1166.eqiad.wmnet with reason: Maintenance
2023-08-23 04:50:39 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1166 (T343718)', diff saved to https://phabricator.wikimedia.org/P50993 and previous config saved to /var/cache/conftool/dbconfig/20230823-045038-ladsgroup.json
2023-08-23 04:50:43 <stashbot> T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
2023-08-23 04:53:37 <logmsgbot> !log ryankemper@cumin1001 END (PASS) - Cookbook sre.elasticsearch.rolling-operation (exit_code=0) Operation.REBOOT (1 nodes at a time) for ElasticSearch cluster relforge: relforge cluster reboot - ryankemper@cumin1001 - T344587
2023-08-23 04:56:06 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T343718)', diff saved to https://phabricator.wikimedia.org/P50994 and previous config saved to /var/cache/conftool/dbconfig/20230823-045606-ladsgroup.json
2023-08-23 04:56:09 <logmsgbot> !log ryankemper@cumin1001 START - Cookbook sre.wdqs.reboot
2023-08-23 04:56:10 <stashbot> T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
2023-08-23 04:56:25 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1031 (T344589)', diff saved to https://phabricator.wikimedia.org/P50995 and previous config saved to /var/cache/conftool/dbconfig/20230823-045625-ladsgroup.json
2023-08-23 04:57:43 <jinxer-wm> (SystemdUnitFailed) resolved: (2) prometheus-wmf-elasticsearch-exporter-9200.service Failed on relforge1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-08-23 04:57:45 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109', diff saved to https://phabricator.wikimedia.org/P50996 and previous config saved to /var/cache/conftool/dbconfig/20230823-045744-ladsgroup.json
2023-08-23 04:57:57 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P50997 and previous config saved to /var/cache/conftool/dbconfig/20230823-045757-ladsgroup.json
2023-08-23 04:59:03 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P50998 and previous config saved to /var/cache/conftool/dbconfig/20230823-045902-ladsgroup.json
2023-08-23 05:03:35 <icinga-wm> PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
2023-08-23 05:11:12 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P50999 and previous config saved to /var/cache/conftool/dbconfig/20230823-051112-ladsgroup.json
2023-08-23 05:11:32 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1031', diff saved to https://phabricator.wikimedia.org/P51000 and previous config saved to /var/cache/conftool/dbconfig/20230823-051131-ladsgroup.json
2023-08-23 05:12:51 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2109 (T343718)', diff saved to https://phabricator.wikimedia.org/P51001 and previous config saved to /var/cache/conftool/dbconfig/20230823-051251-ladsgroup.json
2023-08-23 05:12:53 <logmsgbot> !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2127.codfw.wmnet with reason: Maintenance
2023-08-23 05:12:55 <stashbot> T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
2023-08-23 05:13:04 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154', diff saved to https://phabricator.wikimedia.org/P51002 and previous config saved to /var/cache/conftool/dbconfig/20230823-051303-ladsgroup.json
2023-08-23 05:13:06 <logmsgbot> !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2127.codfw.wmnet with reason: Maintenance
2023-08-23 05:13:12 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2127 (T343718)', diff saved to https://phabricator.wikimedia.org/P51003 and previous config saved to /var/cache/conftool/dbconfig/20230823-051312-ladsgroup.json
2023-08-23 05:14:09 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172', diff saved to https://phabricator.wikimedia.org/P51004 and previous config saved to /var/cache/conftool/dbconfig/20230823-051409-ladsgroup.json
2023-08-23 05:26:18 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166', diff saved to https://phabricator.wikimedia.org/P51005 and previous config saved to /var/cache/conftool/dbconfig/20230823-052618-ladsgroup.json
2023-08-23 05:26:38 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1031', diff saved to https://phabricator.wikimedia.org/P51006 and previous config saved to /var/cache/conftool/dbconfig/20230823-052637-ladsgroup.json
2023-08-23 05:28:10 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2154 (T344589)', diff saved to https://phabricator.wikimedia.org/P51007 and previous config saved to /var/cache/conftool/dbconfig/20230823-052809-ladsgroup.json
2023-08-23 05:28:15 <logmsgbot> !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2162.codfw.wmnet with reason: Maintenance
2023-08-23 05:28:29 <logmsgbot> !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2162.codfw.wmnet with reason: Maintenance
2023-08-23 05:28:35 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2162 (T344589)', diff saved to https://phabricator.wikimedia.org/P51008 and previous config saved to /var/cache/conftool/dbconfig/20230823-052834-ladsgroup.json
2023-08-23 05:29:15 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1172 (T344589)', diff saved to https://phabricator.wikimedia.org/P51009 and previous config saved to /var/cache/conftool/dbconfig/20230823-052915-ladsgroup.json
2023-08-23 05:29:20 <logmsgbot> !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance
2023-08-23 05:29:33 <logmsgbot> !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1177.eqiad.wmnet with reason: Maintenance
2023-08-23 05:29:39 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1177 (T344589)', diff saved to https://phabricator.wikimedia.org/P51010 and previous config saved to /var/cache/conftool/dbconfig/20230823-052939-ladsgroup.json
2023-08-23 05:29:45 <icinga-wm> PROBLEM - MediaWiki EtcdConfig up-to-date on mw1463 is CRITICAL: etcd last index (2367197) is outdated compared to the master one (2367200) https://wikitech.wikimedia.org/wiki/Etcd
2023-08-23 05:29:45 <icinga-wm> PROBLEM - MediaWiki EtcdConfig up-to-date on mw1452 is CRITICAL: etcd last index (2367197) is outdated compared to the master one (2367200) https://wikitech.wikimedia.org/wiki/Etcd
2023-08-23 05:29:45 <icinga-wm> PROBLEM - MediaWiki EtcdConfig up-to-date on mw2451 is CRITICAL: etcd last index (3964209) is outdated compared to the master one (3964215) https://wikitech.wikimedia.org/wiki/Etcd
2023-08-23 05:29:47 <icinga-wm> PROBLEM - MediaWiki EtcdConfig up-to-date on mw1396 is CRITICAL: etcd last index (2367197) is outdated compared to the master one (2367200) https://wikitech.wikimedia.org/wiki/Etcd
2023-08-23 05:29:47 <icinga-wm> PROBLEM - MediaWiki EtcdConfig up-to-date on mw1394 is CRITICAL: etcd last index (2367197) is outdated compared to the master one (2367200) https://wikitech.wikimedia.org/wiki/Etcd
2023-08-23 05:29:47 <icinga-wm> PROBLEM - MediaWiki EtcdConfig up-to-date on mw2436 is CRITICAL: etcd last index (3964209) is outdated compared to the master one (3964215) https://wikitech.wikimedia.org/wiki/Etcd
2023-08-23 05:29:47 <icinga-wm> PROBLEM - MediaWiki EtcdConfig up-to-date on mw2447 is CRITICAL: etcd last index (3964209) is outdated compared to the master one (3964215) https://wikitech.wikimedia.org/wiki/Etcd
2023-08-23 05:29:47 <icinga-wm> PROBLEM - MediaWiki EtcdConfig up-to-date on mw2439 is CRITICAL: etcd last index (3964209) is outdated compared to the master one (3964215) https://wikitech.wikimedia.org/wiki/Etcd
2023-08-23 05:29:47 <icinga-wm> PROBLEM - MediaWiki EtcdConfig up-to-date on mw2302 is CRITICAL: etcd last index (3964209) is outdated compared to the master one (3964215) https://wikitech.wikimedia.org/wiki/Etcd
2023-08-23 05:29:49 <icinga-wm> PROBLEM - MediaWiki EtcdConfig up-to-date on mw1378 is CRITICAL: etcd last index (2367197) is outdated compared to the master one (2367200) https://wikitech.wikimedia.org/wiki/Etcd
2023-08-23 05:29:49 <icinga-wm> PROBLEM - MediaWiki EtcdConfig up-to-date on mw2437 is CRITICAL: etcd last index (3964209) is outdated compared to the master one (3964215) https://wikitech.wikimedia.org/wiki/Etcd
2023-08-23 05:29:49 <icinga-wm> PROBLEM - MediaWiki EtcdConfig up-to-date on mw2432 is CRITICAL: etcd last index (3964209) is outdated compared to the master one (3964215) https://wikitech.wikimedia.org/wiki/Etcd
2023-08-23 05:29:49 <icinga-wm> PROBLEM - MediaWiki EtcdConfig up-to-date on mw1413 is CRITICAL: etcd last index (2367197) is outdated compared to the master one (2367200) https://wikitech.wikimedia.org/wiki/Etcd
2023-08-23 05:29:49 <icinga-wm> PROBLEM - MediaWiki EtcdConfig up-to-date on mw2418 is CRITICAL: etcd last index (3964209) is outdated compared to the master one (3964215) https://wikitech.wikimedia.org/wiki/Etcd
2023-08-23 05:31:09 <icinga-wm> RECOVERY - MediaWiki EtcdConfig up-to-date on mw1463 is OK: etcd last index (2367200) matches the master one (2367200) https://wikitech.wikimedia.org/wiki/Etcd
2023-08-23 05:31:09 <icinga-wm> RECOVERY - MediaWiki EtcdConfig up-to-date on mw1452 is OK: etcd last index (2367200) matches the master one (2367200) https://wikitech.wikimedia.org/wiki/Etcd
2023-08-23 05:31:11 <icinga-wm> RECOVERY - MediaWiki EtcdConfig up-to-date on mw2451 is OK: etcd last index (3964215) matches the master one (3964215) https://wikitech.wikimedia.org/wiki/Etcd
2023-08-23 05:31:11 <icinga-wm> RECOVERY - MediaWiki EtcdConfig up-to-date on mw1396 is OK: etcd last index (2367200) matches the master one (2367200) https://wikitech.wikimedia.org/wiki/Etcd
2023-08-23 05:31:11 <icinga-wm> RECOVERY - MediaWiki EtcdConfig up-to-date on mw1394 is OK: etcd last index (2367200) matches the master one (2367200) https://wikitech.wikimedia.org/wiki/Etcd
2023-08-23 05:31:13 <icinga-wm> RECOVERY - MediaWiki EtcdConfig up-to-date on mw2447 is OK: etcd last index (3964215) matches the master one (3964215) https://wikitech.wikimedia.org/wiki/Etcd
2023-08-23 05:31:13 <icinga-wm> RECOVERY - MediaWiki EtcdConfig up-to-date on mw2436 is OK: etcd last index (3964215) matches the master one (3964215) https://wikitech.wikimedia.org/wiki/Etcd
2023-08-23 05:31:13 <icinga-wm> RECOVERY - MediaWiki EtcdConfig up-to-date on mw2439 is OK: etcd last index (3964215) matches the master one (3964215) https://wikitech.wikimedia.org/wiki/Etcd
2023-08-23 05:31:13 <icinga-wm> RECOVERY - MediaWiki EtcdConfig up-to-date on mw2302 is OK: etcd last index (3964215) matches the master one (3964215) https://wikitech.wikimedia.org/wiki/Etcd
2023-08-23 05:31:13 <icinga-wm> RECOVERY - MediaWiki EtcdConfig up-to-date on mw1378 is OK: etcd last index (2367200) matches the master one (2367200) https://wikitech.wikimedia.org/wiki/Etcd
2023-08-23 05:31:13 <icinga-wm> RECOVERY - MediaWiki EtcdConfig up-to-date on mw2432 is OK: etcd last index (3964215) matches the master one (3964215) https://wikitech.wikimedia.org/wiki/Etcd
2023-08-23 05:31:13 <icinga-wm> RECOVERY - MediaWiki EtcdConfig up-to-date on mw2437 is OK: etcd last index (3964215) matches the master one (3964215) https://wikitech.wikimedia.org/wiki/Etcd
2023-08-23 05:31:14 <icinga-wm> RECOVERY - MediaWiki EtcdConfig up-to-date on mw1413 is OK: etcd last index (2367200) matches the master one (2367200) https://wikitech.wikimedia.org/wiki/Etcd
2023-08-23 05:31:14 <icinga-wm> RECOVERY - MediaWiki EtcdConfig up-to-date on mw2418 is OK: etcd last index (3964215) matches the master one (3964215) https://wikitech.wikimedia.org/wiki/Etcd
2023-08-23 05:34:55 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T344589)', diff saved to https://phabricator.wikimedia.org/P51011 and previous config saved to /var/cache/conftool/dbconfig/20230823-053454-ladsgroup.json
2023-08-23 05:35:54 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T344589)', diff saved to https://phabricator.wikimedia.org/P51012 and previous config saved to /var/cache/conftool/dbconfig/20230823-053553-ladsgroup.json
2023-08-23 05:41:25 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1166 (T343718)', diff saved to https://phabricator.wikimedia.org/P51013 and previous config saved to /var/cache/conftool/dbconfig/20230823-054124-ladsgroup.json
2023-08-23 05:41:26 <logmsgbot> !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance
2023-08-23 05:41:29 <stashbot> T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
2023-08-23 05:41:39 <logmsgbot> !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1175.eqiad.wmnet with reason: Maintenance
2023-08-23 05:41:44 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance es1031 (T344589)', diff saved to https://phabricator.wikimedia.org/P51014 and previous config saved to /var/cache/conftool/dbconfig/20230823-054144-ladsgroup.json
2023-08-23 05:50:01 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P51015 and previous config saved to /var/cache/conftool/dbconfig/20230823-055000-ladsgroup.json
2023-08-23 05:50:41 <wikibugs> ('CR) ''Zabe: [C: ''+2] wmf-config: remove public subnets from reverse-proxy.php [mediawiki-config] - ''https://gerrit.wikimedia.org/r/951591 (https://phabricator.wikimedia.org/T344704) (owner: ''Ssingh)'
2023-08-23 05:50:52 <logmsgbot> !log zabe@deploy1002 Backport cancelled.
2023-08-23 05:51:00 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P51016 and previous config saved to /var/cache/conftool/dbconfig/20230823-055059-ladsgroup.json
2023-08-23 05:51:03 <taavi> zabe: please don't deploy that reverse-proxy change
2023-08-23 05:51:06 <jinxer-wm> (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
2023-08-23 05:51:09 <wikibugs> ('CR) ''Majavah: [C: ''-2] wmf-config: remove public subnets from reverse-proxy.php [mediawiki-config] - ''https://gerrit.wikimedia.org/r/951591 (https://phabricator.wikimedia.org/T344704) (owner: ''Ssingh)'
2023-08-23 05:51:38 <wikibugs> ('PS1) ''Marostegui: dbproxy1012: Host decommissioned [puppet] - ''https://gerrit.wikimedia.org/r/951709'
2023-08-23 05:52:15 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T343718)', diff saved to https://phabricator.wikimedia.org/P51017 and previous config saved to /var/cache/conftool/dbconfig/20230823-055215-ladsgroup.json
2023-08-23 05:52:19 <stashbot> T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
2023-08-23 05:53:16 <wikibugs> ('CR) ''Marostegui: [C: ''+2] dbproxy1012: Host decommissioned [puppet] - ''https://gerrit.wikimedia.org/r/951709 (owner: ''Marostegui)'
2023-08-23 05:53:43 <wikibugs> ('CR) ''Majavah: [C: ''-1] "So there's one specific edge case here, which is cloudweb* (wikitech app servers). Removing the edges is ok, but for eqiad and codfw we ne" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/951591 (https://phabricator.wikimedia.org/T344704) (owner: ''Ssingh)'
2023-08-23 05:54:04 <zabe> taavi: is there a reason for that?
2023-08-23 05:55:01 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T343718)', diff saved to https://phabricator.wikimedia.org/P51018 and previous config saved to /var/cache/conftool/dbconfig/20230823-055500-ladsgroup.json
2023-08-23 05:55:05 <wikibugs> ('PS1) ''Marostegui: dbproxy1013: Host decommissioned [puppet] - ''https://gerrit.wikimedia.org/r/951710'
2023-08-23 05:55:53 <wikibugs> ('CR) ''Marostegui: [C: ''+2] dbproxy1013: Host decommissioned [puppet] - ''https://gerrit.wikimedia.org/r/951710 (owner: ''Marostegui)'
2023-08-23 05:56:12 <taavi> zabe: I left a comment on the patch, but basically as is it would currently break wikitech
2023-08-23 05:56:48 <zabe> ok
2023-08-23 06:00:05 <jouncebot> Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230823T0600)
2023-08-23 06:02:12 <wikibugs> ('PS1) ''Marostegui: dbproxy1014: Host decommissioned [puppet] - ''https://gerrit.wikimedia.org/r/951711'
2023-08-23 06:02:58 <wikibugs> ('CR) ''Marostegui: [C: ''+2] dbproxy1014: Host decommissioned [puppet] - ''https://gerrit.wikimedia.org/r/951711 (owner: ''Marostegui)'
2023-08-23 06:05:07 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162', diff saved to https://phabricator.wikimedia.org/P51019 and previous config saved to /var/cache/conftool/dbconfig/20230823-060506-ladsgroup.json
2023-08-23 06:05:43 <wikibugs> ('CR) ''Gmodena: [C: ''+2] Expose mediawiki.page_change.v1 publicly. [deployment-charts] - ''https://gerrit.wikimedia.org/r/951426 (https://phabricator.wikimedia.org/T336817) (owner: ''Gmodena)'
2023-08-23 06:06:06 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177', diff saved to https://phabricator.wikimedia.org/P51020 and previous config saved to /var/cache/conftool/dbconfig/20230823-060606-ladsgroup.json
2023-08-23 06:06:34 <wikibugs> ('Merged) ''jenkins-bot: Expose mediawiki.page_change.v1 publicly. [deployment-charts] - ''https://gerrit.wikimedia.org/r/951426 (https://phabricator.wikimedia.org/T336817) (owner: ''Gmodena)'
2023-08-23 06:07:21 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P51021 and previous config saved to /var/cache/conftool/dbconfig/20230823-060721-ladsgroup.json
2023-08-23 06:10:07 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P51022 and previous config saved to /var/cache/conftool/dbconfig/20230823-061007-ladsgroup.json
2023-08-23 06:18:51 <jelto> Short Gerrit maintenance starts in 15 minutes. It will take around 10 minutes.
2023-08-23 06:20:13 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2162 (T344589)', diff saved to https://phabricator.wikimedia.org/P51023 and previous config saved to /var/cache/conftool/dbconfig/20230823-062013-ladsgroup.json
2023-08-23 06:20:19 <logmsgbot> !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance
2023-08-23 06:20:32 <logmsgbot> !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2163.codfw.wmnet with reason: Maintenance
2023-08-23 06:20:39 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2163 (T344589)', diff saved to https://phabricator.wikimedia.org/P51024 and previous config saved to /var/cache/conftool/dbconfig/20230823-062038-ladsgroup.json
2023-08-23 06:21:12 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1177 (T344589)', diff saved to https://phabricator.wikimedia.org/P51025 and previous config saved to /var/cache/conftool/dbconfig/20230823-062112-ladsgroup.json
2023-08-23 06:21:17 <logmsgbot> !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance
2023-08-23 06:21:30 <logmsgbot> !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1178.eqiad.wmnet with reason: Maintenance
2023-08-23 06:21:36 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1178 (T344589)', diff saved to https://phabricator.wikimedia.org/P51026 and previous config saved to /var/cache/conftool/dbconfig/20230823-062136-ladsgroup.json
2023-08-23 06:22:13 <wikibugs> ('CR) ''Jelto: [C: ''+2] gerrit: raise maxConnectionsPerUser to 8 [puppet] - ''https://gerrit.wikimedia.org/r/949026 (https://phabricator.wikimedia.org/T344238) (owner: ''Jelto)'
2023-08-23 06:22:28 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175', diff saved to https://phabricator.wikimedia.org/P51027 and previous config saved to /var/cache/conftool/dbconfig/20230823-062227-ladsgroup.json
2023-08-23 06:25:13 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127', diff saved to https://phabricator.wikimedia.org/P51028 and previous config saved to /var/cache/conftool/dbconfig/20230823-062513-ladsgroup.json
2023-08-23 06:27:02 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T344589)', diff saved to https://phabricator.wikimedia.org/P51029 and previous config saved to /var/cache/conftool/dbconfig/20230823-062701-ladsgroup.json
2023-08-23 06:30:04 <jouncebot> Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230823T0600)
2023-08-23 06:30:04 <jouncebot> Deploy window [https://wikitech.wikimedia.org/wiki/Gerrit (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230823T0630)
2023-08-23 06:31:44 <jinxer-wm> (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2023-08-23 06:32:03 <jelto> Gerrit will restart now, should take less than 10 minutes
2023-08-23 06:33:08 <logmsgbot> !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host gerrit1003.wikimedia.org
2023-08-23 06:37:34 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1175 (T343718)', diff saved to https://phabricator.wikimedia.org/P51030 and previous config saved to /var/cache/conftool/dbconfig/20230823-063733-ladsgroup.json
2023-08-23 06:37:36 <logmsgbot> !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: Maintenance
2023-08-23 06:37:38 <stashbot> T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
2023-08-23 06:37:49 <logmsgbot> !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1189.eqiad.wmnet with reason: Maintenance
2023-08-23 06:37:55 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1189 (T343718)', diff saved to https://phabricator.wikimedia.org/P51031 and previous config saved to /var/cache/conftool/dbconfig/20230823-063754-ladsgroup.json
2023-08-23 06:40:20 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2127 (T343718)', diff saved to https://phabricator.wikimedia.org/P51032 and previous config saved to /var/cache/conftool/dbconfig/20230823-064019-ladsgroup.json
2023-08-23 06:40:21 <logmsgbot> !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance
2023-08-23 06:40:35 <logmsgbot> !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2139.codfw.wmnet with reason: Maintenance
2023-08-23 06:40:37 <logmsgbot> !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host gerrit1003.wikimedia.org
2023-08-23 06:41:00 <jelto> Gerrit maintenance done
2023-08-23 06:42:08 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P51033 and previous config saved to /var/cache/conftool/dbconfig/20230823-064207-ladsgroup.json
2023-08-23 06:44:22 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T343718)', diff saved to https://phabricator.wikimedia.org/P51034 and previous config saved to /var/cache/conftool/dbconfig/20230823-064421-ladsgroup.json
2023-08-23 06:44:26 <stashbot> T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
2023-08-23 06:53:31 <wikibugs> ('PS1) ''Marostegui: dbproxy1015: Host decommissioned [puppet] - ''https://gerrit.wikimedia.org/r/951821'
2023-08-23 06:53:56 <wikibugs> ('CR) ''Marostegui: [C: ''+2] dbproxy1015: Host decommissioned [puppet] - ''https://gerrit.wikimedia.org/r/951821 (owner: ''Marostegui)'
2023-08-23 06:54:35 <wikibugs> 'SRE, ''ops-knams, ''DC-Ops, ''Patch-For-Review: Replacement of esams VMs in knams Ganeti clusters - https://phabricator.wikimedia.org/T344355 (''MoritzMuehlenhoff) ''Open''Resolved This is complete'
2023-08-23 06:54:44 <wikibugs> 'SRE, ''ops-knams, ''DC-Ops, ''Patch-For-Review: Main Tracking Task for ESAMS Migration to KNAMS - https://phabricator.wikimedia.org/T329219 (''MoritzMuehlenhoff)'
2023-08-23 06:57:14 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163', diff saved to https://phabricator.wikimedia.org/P51035 and previous config saved to /var/cache/conftool/dbconfig/20230823-065714-ladsgroup.json
2023-08-23 06:59:28 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P51036 and previous config saved to /var/cache/conftool/dbconfig/20230823-065927-ladsgroup.json
2023-08-23 07:00:04 <jouncebot> Amir1 and Urbanecm: Time to snap out of that daydream and deploy UTC morning backport window. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230823T0700).
2023-08-23 07:00:04 <jouncebot> No Gerrit patches in the queue for this window AFAICS.
2023-08-23 07:09:42 <wikibugs> ('CR) ''Muehlenhoff: [C: ''+1] "Looks good" [puppet] - ''https://gerrit.wikimedia.org/r/951534 (owner: ''Jbond)'
2023-08-23 07:12:21 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2163 (T344589)', diff saved to https://phabricator.wikimedia.org/P51037 and previous config saved to /var/cache/conftool/dbconfig/20230823-071220-ladsgroup.json
2023-08-23 07:12:27 <logmsgbot> !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance
2023-08-23 07:12:40 <logmsgbot> !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2164.codfw.wmnet with reason: Maintenance
2023-08-23 07:12:41 <logmsgbot> !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance
2023-08-23 07:12:44 <logmsgbot> !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance
2023-08-23 07:12:50 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2164 (T344589)', diff saved to https://phabricator.wikimedia.org/P51038 and previous config saved to /var/cache/conftool/dbconfig/20230823-071249-ladsgroup.json
2023-08-23 07:13:57 <logmsgbot> !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host idp2002.wikimedia.org
2023-08-23 07:14:34 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189', diff saved to https://phabricator.wikimedia.org/P51039 and previous config saved to /var/cache/conftool/dbconfig/20230823-071433-ladsgroup.json
2023-08-23 07:17:54 <logmsgbot> !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host idp2002.wikimedia.org
2023-08-23 07:19:17 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T344589)', diff saved to https://phabricator.wikimedia.org/P51040 and previous config saved to /var/cache/conftool/dbconfig/20230823-071916-ladsgroup.json
2023-08-23 07:19:34 <logmsgbot> !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2149.codfw.wmnet with reason: Maintenance
2023-08-23 07:19:47 <logmsgbot> !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2149.codfw.wmnet with reason: Maintenance
2023-08-23 07:19:53 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2149 (T343718)', diff saved to https://phabricator.wikimedia.org/P51041 and previous config saved to /var/cache/conftool/dbconfig/20230823-071953-ladsgroup.json
2023-08-23 07:19:57 <stashbot> T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
2023-08-23 07:25:24 <wikibugs> ('CR) ''Muehlenhoff: Adapt monitoring/metrics rules for nft and ferm providers (''2 comments) [puppet] - ''https://gerrit.wikimedia.org/r/951512 (https://phabricator.wikimedia.org/T336497) (owner: ''Muehlenhoff)'
2023-08-23 07:29:40 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1189 (T343718)', diff saved to https://phabricator.wikimedia.org/P51042 and previous config saved to /var/cache/conftool/dbconfig/20230823-072940-ladsgroup.json
2023-08-23 07:29:42 <logmsgbot> !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1198.eqiad.wmnet with reason: Maintenance
2023-08-23 07:29:45 <stashbot> T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
2023-08-23 07:29:55 <logmsgbot> !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1198.eqiad.wmnet with reason: Maintenance
2023-08-23 07:30:01 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1198 (T343718)', diff saved to https://phabricator.wikimedia.org/P51043 and previous config saved to /var/cache/conftool/dbconfig/20230823-073001-ladsgroup.json
2023-08-23 07:34:23 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P51044 and previous config saved to /var/cache/conftool/dbconfig/20230823-073422-ladsgroup.json
2023-08-23 07:34:57 <wikibugs> ('CR) ''JMeybohm: "I would argue that this is a prometheus alert, not a k8s one (similar to JobUnavailable but more explicit and with higher severity). Shoul" [alerts] - ''https://gerrit.wikimedia.org/r/951526 (https://phabricator.wikimedia.org/T343529) (owner: ''Filippo Giunchedi)'
2023-08-23 07:35:21 <icinga-wm> PROBLEM - restbase endpoints health on restbase1028 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
2023-08-23 07:35:21 <icinga-wm> PROBLEM - restbase endpoints health on restbase1021 is CRITICAL: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
2023-08-23 07:35:21 <icinga-wm> PROBLEM - restbase endpoints health on restbase1026 is CRITICAL: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
2023-08-23 07:35:21 <icinga-wm> PROBLEM - restbase endpoints health on restbase1017 is CRITICAL: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
2023-08-23 07:35:21 <icinga-wm> PROBLEM - restbase endpoints health on restbase1024 is CRITICAL: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
2023-08-23 07:35:29 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T343718)', diff saved to https://phabricator.wikimedia.org/P51045 and previous config saved to /var/cache/conftool/dbconfig/20230823-073529-ladsgroup.json
2023-08-23 07:35:33 <stashbot> T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
2023-08-23 07:35:35 <icinga-wm> PROBLEM - restbase endpoints health on restbase1031 is CRITICAL: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
2023-08-23 07:35:35 <icinga-wm> PROBLEM - restbase endpoints health on restbase1022 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
2023-08-23 07:35:35 <icinga-wm> PROBLEM - restbase endpoints health on restbase1025 is CRITICAL: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
2023-08-23 07:35:41 <icinga-wm> PROBLEM - cassandra-b CQL 10.64.48.185:9042 on restbase1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886
2023-08-23 07:35:53 <icinga-wm> PROBLEM - restbase endpoints health on restbase1020 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
2023-08-23 07:36:02 <logmsgbot> !log slyngshede@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM idm2001.wikimedia.org
2023-08-23 07:36:15 <icinga-wm> PROBLEM - cassandra-a CQL 10.64.48.184:9042 on restbase1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886
2023-08-23 07:36:17 <icinga-wm> PROBLEM - restbase endpoints health on restbase1030 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
2023-08-23 07:36:17 <icinga-wm> PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
2023-08-23 07:36:17 <icinga-wm> PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
2023-08-23 07:36:21 <icinga-wm> PROBLEM - cassandra-c CQL 10.64.48.186:9042 on restbase1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886
2023-08-23 07:36:43 <icinga-wm> RECOVERY - restbase endpoints health on restbase1017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
2023-08-23 07:36:43 <icinga-wm> RECOVERY - restbase endpoints health on restbase1021 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
2023-08-23 07:36:43 <icinga-wm> RECOVERY - restbase endpoints health on restbase1026 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
2023-08-23 07:36:45 <icinga-wm> RECOVERY - restbase endpoints health on restbase1024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
2023-08-23 07:36:59 <icinga-wm> RECOVERY - restbase endpoints health on restbase1022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
2023-08-23 07:37:15 <icinga-wm> RECOVERY - restbase endpoints health on restbase1020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
2023-08-23 07:37:35 <icinga-wm> RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
2023-08-23 07:37:35 <icinga-wm> RECOVERY - restbase endpoints health on restbase1030 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
2023-08-23 07:37:39 <icinga-wm> RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
2023-08-23 07:38:09 <icinga-wm> RECOVERY - restbase endpoints health on restbase1028 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
2023-08-23 07:38:19 <icinga-wm> RECOVERY - restbase endpoints health on restbase1031 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
2023-08-23 07:38:21 <icinga-wm> RECOVERY - restbase endpoints health on restbase1025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
2023-08-23 07:39:55 <logmsgbot> !log slyngshede@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idm2001.wikimedia.org
2023-08-23 07:40:19 <wikibugs> ('PS1) ''Muehlenhoff: Make firewall logging conditional on ferm and rename the profile [puppet] - ''https://gerrit.wikimedia.org/r/951828 (https://phabricator.wikimedia.org/T336497)'
2023-08-23 07:40:42 <wikibugs> ('CR) ''CI reject: [V: ''-1] Make firewall logging conditional on ferm and rename the profile [puppet] - ''https://gerrit.wikimedia.org/r/951828 (https://phabricator.wikimedia.org/T336497) (owner: ''Muehlenhoff)'
2023-08-23 07:42:46 <wikibugs> ('PS2) ''Muehlenhoff: Make firewall logging conditional on ferm and rename the profile [puppet] - ''https://gerrit.wikimedia.org/r/951828 (https://phabricator.wikimedia.org/T336497)'
2023-08-23 07:48:00 <jelto> etherpad needs a short maintenance in 15 minutes
2023-08-23 07:49:29 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164', diff saved to https://phabricator.wikimedia.org/P51046 and previous config saved to /var/cache/conftool/dbconfig/20230823-074928-ladsgroup.json
2023-08-23 07:49:47 <icinga-wm> PROBLEM - Check systemd state on mirror1001 is CRITICAL: CRITICAL - degraded: The following units failed: update-ubuntu-mirror.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2023-08-23 07:50:35 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P51047 and previous config saved to /var/cache/conftool/dbconfig/20230823-075035-ladsgroup.json
2023-08-23 07:53:16 <jinxer-wm> (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
2023-08-23 07:53:36 <wikibugs> ('CR) ''Muehlenhoff: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/951828 (https://phabricator.wikimedia.org/T336497) (owner: ''Muehlenhoff)'
2023-08-23 07:53:37 <icinga-wm> PROBLEM - Host wdqs1005 is DOWN: PING CRITICAL - Packet loss = 100%
2023-08-23 07:56:19 <logmsgbot> !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-text_ulsfo and A:cp
2023-08-23 07:58:16 <jinxer-wm> (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
2023-08-23 08:00:16 <logmsgbot> !log jelto@cumin1001 START - Cookbook sre.hosts.reboot-single for host etherpad1003.eqiad.wmnet
2023-08-23 08:00:17 <fabfur> !log running puppet agent on lvs5006
2023-08-23 08:00:30 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-08-23 08:01:28 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T343718)', diff saved to https://phabricator.wikimedia.org/P51048 and previous config saved to /var/cache/conftool/dbconfig/20230823-080127-ladsgroup.json
2023-08-23 08:01:34 <stashbot> T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
2023-08-23 08:02:47 <wikibugs> ('PS1) ''Muehlenhoff: Convert the monitoring/prometheus ferm rules to a firewall::service [puppet] - ''https://gerrit.wikimedia.org/r/951830 (https://phabricator.wikimedia.org/T336497)'
2023-08-23 08:03:11 <icinga-wm> PROBLEM - SSH on restbase1027 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
2023-08-23 08:03:52 <logmsgbot> !log slyngshede@cumin1001 START - Cookbook sre.ganeti.reboot-vm for VM idm-test1001.wikimedia.org
2023-08-23 08:04:15 <logmsgbot> !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host etherpad1003.eqiad.wmnet
2023-08-23 08:04:35 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2164 (T344589)', diff saved to https://phabricator.wikimedia.org/P51049 and previous config saved to /var/cache/conftool/dbconfig/20230823-080435-ladsgroup.json
2023-08-23 08:04:37 <icinga-wm> RECOVERY - SSH on restbase1027 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
2023-08-23 08:04:41 <logmsgbot> !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance
2023-08-23 08:04:54 <logmsgbot> !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2165.codfw.wmnet with reason: Maintenance
2023-08-23 08:05:00 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2165 (T344589)', diff saved to https://phabricator.wikimedia.org/P51050 and previous config saved to /var/cache/conftool/dbconfig/20230823-080500-ladsgroup.json
2023-08-23 08:05:42 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198', diff saved to https://phabricator.wikimedia.org/P51051 and previous config saved to /var/cache/conftool/dbconfig/20230823-080541-ladsgroup.json
2023-08-23 08:07:43 <logmsgbot> !log slyngshede@cumin1001 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM idm-test1001.wikimedia.org
2023-08-23 08:08:50 <wikibugs> ('PS1) ''Fabfur: haproxy: sanitize eventual duplicate content-length header [puppet] - ''https://gerrit.wikimedia.org/r/951832 (https://phabricator.wikimedia.org/T344047)'
2023-08-23 08:09:00 <wikibugs> ('PS1) ''Dreamy Jazz: clienthints: Remove duplicate entries when converting to DB rows [extensions/CheckUser] (wmf/1.41.0-wmf.22) - ''https://gerrit.wikimedia.org/r/951846 (https://phabricator.wikimedia.org/T344787)'
2023-08-23 08:09:12 <wikibugs> ('PS1) ''Dreamy Jazz: clienthints: Remove duplicate entries when converting to DB rows [extensions/CheckUser] (wmf/1.41.0-wmf.23) - ''https://gerrit.wikimedia.org/r/951847 (https://phabricator.wikimedia.org/T344787)'
2023-08-23 08:11:22 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T344589)', diff saved to https://phabricator.wikimedia.org/P51052 and previous config saved to /var/cache/conftool/dbconfig/20230823-081122-ladsgroup.json
2023-08-23 08:12:05 <wikibugs> ('CR) ''Btullis: [C: ''+1] switch an-worker[17-48] to reuse-analytics-hadoop recipe [puppet] - ''https://gerrit.wikimedia.org/r/951458 (https://phabricator.wikimedia.org/T332570) (owner: ''Stevemunene)'
2023-08-23 08:14:18 <wikibugs> ('CR) ''Btullis: [C: ''+1] "Definitely worth a try." [deployment-charts] - ''https://gerrit.wikimedia.org/r/951518 (https://phabricator.wikimedia.org/T305874) (owner: ''Stevemunene)'
2023-08-23 08:15:42 <logmsgbot> !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.reboot (exit_code=99)
2023-08-23 08:16:34 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P51053 and previous config saved to /var/cache/conftool/dbconfig/20230823-081633-ladsgroup.json
2023-08-23 08:20:48 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1198 (T343718)', diff saved to https://phabricator.wikimedia.org/P51054 and previous config saved to /var/cache/conftool/dbconfig/20230823-082047-ladsgroup.json
2023-08-23 08:20:51 <logmsgbot> !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1212.eqiad.wmnet with reason: Maintenance
2023-08-23 08:20:52 <stashbot> T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
2023-08-23 08:21:00 <wikibugs> ('CR) ''Jbond: [C: ''+2] ferm::service: make port optional so we can use port_range [puppet] - ''https://gerrit.wikimedia.org/r/951534 (owner: ''Jbond)'
2023-08-23 08:21:04 <logmsgbot> !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1212.eqiad.wmnet with reason: Maintenance
2023-08-23 08:21:05 <logmsgbot> !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
2023-08-23 08:21:10 <logmsgbot> !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on clouddb[1013,1017,1021].eqiad.wmnet,db1154.eqiad.wmnet with reason: Maintenance
2023-08-23 08:21:16 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1212 (T343718)', diff saved to https://phabricator.wikimedia.org/P51055 and previous config saved to /var/cache/conftool/dbconfig/20230823-082116-ladsgroup.json
2023-08-23 08:21:20 <wikibugs> ('PS1) ''Dreamy Jazz: clienthints: Lower API max lag time to 5 minutes on group0 and 1 [mediawiki-config] - ''https://gerrit.wikimedia.org/r/951833 (https://phabricator.wikimedia.org/T344797)'
2023-08-23 08:21:28 <logmsgbot> !log btullis@cumin1001 START - Cookbook sre.kafka.reboot-workers for Kafka jumbo-eqiad cluster: Reboot kafka nodes
2023-08-23 08:24:02 <logmsgbot> !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host druid1011.eqiad.wmnet
2023-08-23 08:26:20 <wikibugs> ('CR) ''Stevemunene: [C: ''+2] datahub: set the oidc client authentication method [deployment-charts] - ''https://gerrit.wikimedia.org/r/951518 (https://phabricator.wikimedia.org/T305874) (owner: ''Stevemunene)'
2023-08-23 08:26:29 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P51056 and previous config saved to /var/cache/conftool/dbconfig/20230823-082628-ladsgroup.json
2023-08-23 08:26:57 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T343718)', diff saved to https://phabricator.wikimedia.org/P51057 and previous config saved to /var/cache/conftool/dbconfig/20230823-082657-ladsgroup.json
2023-08-23 08:27:01 <stashbot> T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
2023-08-23 08:27:11 <wikibugs> ('Merged) ''jenkins-bot: datahub: set the oidc client authentication method [deployment-charts] - ''https://gerrit.wikimedia.org/r/951518 (https://phabricator.wikimedia.org/T305874) (owner: ''Stevemunene)'
2023-08-23 08:27:36 <wikibugs> ('PS1) ''JMeybohm: deployment_server::helmfile: Iterate over clusters, not services [puppet] - ''https://gerrit.wikimedia.org/r/951835 (https://phabricator.wikimedia.org/T297417)'
2023-08-23 08:28:09 <icinga-wm> PROBLEM - Restbase root url on restbase1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/RESTBase
2023-08-23 08:28:27 <wikibugs> ('CR) ''Jbond: [C: ''+1] "i would have probably gone with profile::ferm::log but lgtm" [puppet] - ''https://gerrit.wikimedia.org/r/951828 (https://phabricator.wikimedia.org/T336497) (owner: ''Muehlenhoff)'
2023-08-23 08:28:44 <wikibugs> ('CR) ''Jbond: [C: ''+1] "lgtm" [puppet] - ''https://gerrit.wikimedia.org/r/951830 (https://phabricator.wikimedia.org/T336497) (owner: ''Muehlenhoff)'
2023-08-23 08:28:53 <logmsgbot> !log stevemunene@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
2023-08-23 08:29:26 <logmsgbot> !log stevemunene@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
2023-08-23 08:29:29 <logmsgbot> !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host druid1011.eqiad.wmnet
2023-08-23 08:29:52 <wikibugs> ('CR) ''JMeybohm: [V: ''+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42981/console"; [puppet] - ''https://gerrit.wikimedia.org/r/951835 (https://phabricator.wikimedia.org/T297417) (owner: ''JMeybohm)'
2023-08-23 08:31:40 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149', diff saved to https://phabricator.wikimedia.org/P51058 and previous config saved to /var/cache/conftool/dbconfig/20230823-083140-ladsgroup.json
2023-08-23 08:32:12 <wikibugs> ('PS2) ''JMeybohm: deployment_server::helmfile: Iterate over clusters groups first [puppet] - ''https://gerrit.wikimedia.org/r/951835 (https://phabricator.wikimedia.org/T297417)'
2023-08-23 08:34:14 <wikibugs> ('CR) ''JMeybohm: [V: ''+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42982/console"; [puppet] - ''https://gerrit.wikimedia.org/r/951835 (https://phabricator.wikimedia.org/T297417) (owner: ''JMeybohm)'
2023-08-23 08:34:35 <wikibugs> ('CR) ''Muehlenhoff: Make firewall logging conditional on ferm and rename the profile (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/951828 (https://phabricator.wikimedia.org/T336497) (owner: ''Muehlenhoff)'
2023-08-23 08:34:45 <icinga-wm> PROBLEM - SSH on restbase1027 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
2023-08-23 08:35:49 <vgutierrez> !log fetch HAProxy 2.6.15 on thirdparty/haproxy26 for bullseye (apt.wm.o) - T344047
2023-08-23 08:35:52 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-08-23 08:35:57 <wikibugs> 'ops-codfw, ''serviceops-radar, ''Maps (Maps-data): maps2009 is unreachable - https://phabricator.wikimedia.org/T344110 (''Jgiannelos) On the OSM sync side of things, it might worth checking if the system catches up with the diffs (~1 week worth of diffs could be manageable). The idea is that we avoid havin...'
2023-08-23 08:36:09 <icinga-wm> RECOVERY - SSH on restbase1027 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
2023-08-23 08:39:07 <wikibugs> ('PS1) ''JMeybohm: deployment_server/kubernetes: Readd admin_services secrets [labs/private] - ''https://gerrit.wikimedia.org/r/951836 (https://phabricator.wikimedia.org/T297417)'
2023-08-23 08:41:35 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2165', diff saved to https://phabricator.wikimedia.org/P51059 and previous config saved to /var/cache/conftool/dbconfig/20230823-084134-ladsgroup.json
2023-08-23 08:42:03 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P51060 and previous config saved to /var/cache/conftool/dbconfig/20230823-084203-ladsgroup.json
2023-08-23 08:43:15 <icinga-wm> PROBLEM - SSH on restbase1027 is CRITICAL: Server answer: https://wikitech.wikimedia.org/wiki/SSH/monitoring
2023-08-23 08:44:40 <wikibugs> ('PS1) ''Stevemunene: datahub:main chart version bump [deployment-charts] - ''https://gerrit.wikimedia.org/r/951839 (https://phabricator.wikimedia.org/T305874)'
2023-08-23 08:46:46 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2149 (T343718)', diff saved to https://phabricator.wikimedia.org/P51061 and previous config saved to /var/cache/conftool/dbconfig/20230823-084646-ladsgroup.json
2023-08-23 08:46:49 <logmsgbot> !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2156.codfw.wmnet with reason: Maintenance
2023-08-23 08:46:52 <stashbot> T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
2023-08-23 08:47:02 <logmsgbot> !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2156.codfw.wmnet with reason: Maintenance
2023-08-23 08:47:03 <logmsgbot> !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance
2023-08-23 08:47:06 <logmsgbot> !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on db2186.codfw.wmnet with reason: Maintenance
2023-08-23 08:47:12 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2156 (T343718)', diff saved to https://phabricator.wikimedia.org/P51062 and previous config saved to /var/cache/conftool/dbconfig/20230823-084711-ladsgroup.json
2023-08-23 08:47:37 <fabfur> !log run puppet agent on lvs5004 to clear alert
2023-08-23 08:47:39 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-08-23 08:49:58 <wikibugs> 'sre-alert-triage, ''Data-Platform-SRE: Alert: Low percentage of enriched events produced by mw_page_content_change_enrich in codfw - https://phabricator.wikimedia.org/T343318 (''BTullis) ''Open''Resolved a:''BTullis Thanks @gmodena'
2023-08-23 08:52:25 <icinga-wm> RECOVERY - Check systemd state on kubernetes2024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2023-08-23 08:53:35 <icinga-wm> RECOVERY - Check systemd state on kubernetes1026 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2023-08-23 08:54:33 <icinga-wm> RECOVERY - SSH on restbase1027 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring
2023-08-23 08:55:55 <wikibugs> ('PS6) ''Clément Goubert: k8s::proxy: Start kube-proxy after ferm [puppet] - ''https://gerrit.wikimedia.org/r/915461'
2023-08-23 08:56:41 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2165 (T344589)', diff saved to https://phabricator.wikimedia.org/P51063 and previous config saved to /var/cache/conftool/dbconfig/20230823-085640-ladsgroup.json
2023-08-23 08:56:47 <logmsgbot> !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance
2023-08-23 08:57:00 <logmsgbot> !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2166.codfw.wmnet with reason: Maintenance
2023-08-23 08:57:06 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2166 (T344589)', diff saved to https://phabricator.wikimedia.org/P51064 and previous config saved to /var/cache/conftool/dbconfig/20230823-085706-ladsgroup.json
2023-08-23 08:57:16 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1212', diff saved to https://phabricator.wikimedia.org/P51065 and previous config saved to /var/cache/conftool/dbconfig/20230823-085715-ladsgroup.json
2023-08-23 08:57:29 <wikibugs> 'SRE, ''Data-Platform-SRE, ''User-MoritzMuehlenhoff: Configure the Hadoop MapReduce ports to use a fixed range - https://phabricator.wikimedia.org/T111433 (''BTullis)'
2023-08-23 08:58:22 <logmsgbot> !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp2042.codfw.wmnet} and A:cp
2023-08-23 08:59:13 <wikibugs> ('PS7) ''Clément Goubert: k8s::proxy: Start kube-proxy after ferm [puppet] - ''https://gerrit.wikimedia.org/r/915461'
2023-08-23 08:59:19 <vgutierrez> !log update to HAProxy 2.6.15 in cp2042 (upload) - T344047
2023-08-23 08:59:21 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-08-23 08:59:38 <logmsgbot> !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp2042.codfw.wmnet} and A:cp
2023-08-23 09:01:48 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T343718)', diff saved to https://phabricator.wikimedia.org/P51066 and previous config saved to /var/cache/conftool/dbconfig/20230823-090147-ladsgroup.json
2023-08-23 09:01:51 <icinga-wm> PROBLEM - SSH on restbase1027 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring
2023-08-23 09:01:55 <stashbot> T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
2023-08-23 09:02:27 <wikibugs> ('CR) ''Clément Goubert: [V: ''+1] "PCC SUCCESS (CORE_DIFF 12): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42984/console"; [puppet] - ''https://gerrit.wikimedia.org/r/915461 (owner: ''Clément Goubert)'
2023-08-23 09:03:33 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T344589)', diff saved to https://phabricator.wikimedia.org/P51067 and previous config saved to /var/cache/conftool/dbconfig/20230823-090332-ladsgroup.json
2023-08-23 09:05:09 <logmsgbot> !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp2041.codfw.wmnet} and A:cp
2023-08-23 09:05:23 <vgutierrez> !log update to HAProxy 2.6.15 in cp2041 (text) - T344047
2023-08-23 09:05:26 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-08-23 09:06:03 <wikibugs> ('PS1) ''JMeybohm: deployment_server/helmfile: Write admin_services_secrets to files [puppet] - ''https://gerrit.wikimedia.org/r/951843 (https://phabricator.wikimedia.org/T297417)'
2023-08-23 09:06:57 <logmsgbot> !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp2041.codfw.wmnet} and A:cp
2023-08-23 09:07:51 <icinga-wm> RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes1026 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
2023-08-23 09:09:46 <wikibugs> ('CR) ''Btullis: [C: ''+1] datahub:main chart version bump [deployment-charts] - ''https://gerrit.wikimedia.org/r/951839 (https://phabricator.wikimedia.org/T305874) (owner: ''Stevemunene)'
2023-08-23 09:10:03 <icinga-wm> RECOVERY - Check whether ferm is active by checking the default input chain on kubernetes2024 is OK: OK ferm input default policy is set https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm
2023-08-23 09:11:17 <wikibugs> ('CR) ''Btullis: [C: ''+1] "Looks good to me." [deployment-charts] - ''https://gerrit.wikimedia.org/r/951551 (https://phabricator.wikimedia.org/T344614) (owner: ''Bking)'
2023-08-23 09:12:16 <wikibugs> ('PS2) ''JMeybohm: deployment_server/helmfile: Write admin_services_secrets to files [puppet] - ''https://gerrit.wikimedia.org/r/951843 (https://phabricator.wikimedia.org/T297417)'
2023-08-23 09:12:22 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1212 (T343718)', diff saved to https://phabricator.wikimedia.org/P51068 and previous config saved to /var/cache/conftool/dbconfig/20230823-091221-ladsgroup.json
2023-08-23 09:12:23 <logmsgbot> !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1223.eqiad.wmnet with reason: Maintenance
2023-08-23 09:12:27 <stashbot> T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
2023-08-23 09:12:37 <logmsgbot> !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1223.eqiad.wmnet with reason: Maintenance
2023-08-23 09:12:43 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db1223 (T343718)', diff saved to https://phabricator.wikimedia.org/P51069 and previous config saved to /var/cache/conftool/dbconfig/20230823-091242-ladsgroup.json
2023-08-23 09:13:05 <wikibugs> ('CR) ''Clément Goubert: [V: ''+1] "Bumping this because it happened again after the last round of kubelet restarts." [puppet] - ''https://gerrit.wikimedia.org/r/915461 (owner: ''Clément Goubert)'
2023-08-23 09:16:39 <wikibugs> ('CR) ''Stevemunene: [C: ''+2] datahub:main chart version bump [deployment-charts] - ''https://gerrit.wikimedia.org/r/951839 (https://phabricator.wikimedia.org/T305874) (owner: ''Stevemunene)'
2023-08-23 09:16:54 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P51070 and previous config saved to /var/cache/conftool/dbconfig/20230823-091653-ladsgroup.json
2023-08-23 09:17:24 <wikibugs> ('Merged) ''jenkins-bot: datahub:main chart version bump [deployment-charts] - ''https://gerrit.wikimedia.org/r/951839 (https://phabricator.wikimedia.org/T305874) (owner: ''Stevemunene)'
2023-08-23 09:18:22 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T343718)', diff saved to https://phabricator.wikimedia.org/P51071 and previous config saved to /var/cache/conftool/dbconfig/20230823-091821-ladsgroup.json
2023-08-23 09:18:26 <stashbot> T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
2023-08-23 09:18:37 <logmsgbot> !log stevemunene@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
2023-08-23 09:18:39 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P51072 and previous config saved to /var/cache/conftool/dbconfig/20230823-091838-ladsgroup.json
2023-08-23 09:19:14 <wikibugs> ('CR) ''Btullis: C:bigtop::hadoop move net-topology.py to files. (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: ''Slyngshede)'
2023-08-23 09:20:06 <wikibugs> ('PS3) ''JMeybohm: deployment_server/helmfile: Write admin_services_secrets to files [puppet] - ''https://gerrit.wikimedia.org/r/951843 (https://phabricator.wikimedia.org/T297417)'
2023-08-23 09:20:35 <wikibugs> ('CR) ''Btullis: [C: ''+1] "Looks good, thanks." [puppet] - ''https://gerrit.wikimedia.org/r/945785 (owner: ''Muehlenhoff)'
2023-08-23 09:21:02 <wikibugs> ('CR) ''JMeybohm: [C: ''+1] k8s::proxy: Start kube-proxy after ferm [puppet] - ''https://gerrit.wikimedia.org/r/915461 (owner: ''Clément Goubert)'
2023-08-23 09:21:12 <logmsgbot> !log stevemunene@deploy1002 helmfile [staging] START helmfile.d/services/datahub: apply on main
2023-08-23 09:22:07 <wikibugs> ('CR) ''JMeybohm: [V: ''+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42987/console"; [puppet] - ''https://gerrit.wikimedia.org/r/951843 (https://phabricator.wikimedia.org/T297417) (owner: ''JMeybohm)'
2023-08-23 09:22:18 <wikibugs> ('CR) ''Jbond: [C: ''-1] "see inline still some issues with the exec" [puppet] - ''https://gerrit.wikimedia.org/r/951580 (https://phabricator.wikimedia.org/T337570) (owner: ''Dduvall)'
2023-08-23 09:24:40 <logmsgbot> !log stevemunene@deploy1002 helmfile [staging] DONE helmfile.d/services/datahub: sync on main
2023-08-23 09:24:43 <logmsgbot> !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host druid1009.eqiad.wmnet
2023-08-23 09:26:52 <wikibugs> ('PS21) ''Slyngshede: C:bigtop::hadoop move net-topology.py to files. [puppet] - ''https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480)'
2023-08-23 09:26:57 <wikibugs> ('CR) ''Slyngshede: C:bigtop::hadoop move net-topology.py to files. (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: ''Slyngshede)'
2023-08-23 09:27:06 <wikibugs> ('CR) ''CI reject: [V: ''-1] C:bigtop::hadoop move net-topology.py to files. [puppet] - ''https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: ''Slyngshede)'
2023-08-23 09:29:07 <wikibugs> ('CR) ''Btullis: C:bigtop::hadoop move net-topology.py to files. (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: ''Slyngshede)'
2023-08-23 09:31:32 <logmsgbot> !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host druid1009.eqiad.wmnet
2023-08-23 09:31:34 <logmsgbot> !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host druid1010.eqiad.wmnet
2023-08-23 09:32:00 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156', diff saved to https://phabricator.wikimedia.org/P51073 and previous config saved to /var/cache/conftool/dbconfig/20230823-093200-ladsgroup.json
2023-08-23 09:33:04 <wikibugs> ('CR) ''Clément Goubert: [V: ''+1 C: ''+2] k8s::proxy: Start kube-proxy after ferm [puppet] - ''https://gerrit.wikimedia.org/r/915461 (owner: ''Clément Goubert)'
2023-08-23 09:33:28 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P51074 and previous config saved to /var/cache/conftool/dbconfig/20230823-093327-ladsgroup.json
2023-08-23 09:33:45 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166', diff saved to https://phabricator.wikimedia.org/P51075 and previous config saved to /var/cache/conftool/dbconfig/20230823-093345-ladsgroup.json
2023-08-23 09:35:51 <wikibugs> ('CR) ''Jbond: [C: ''+1] Make firewall logging conditional on ferm and rename the profile (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/951828 (https://phabricator.wikimedia.org/T336497) (owner: ''Muehlenhoff)'
2023-08-23 09:37:25 <wikibugs> ('CR) ''Jbond: C:bigtop::hadoop move net-topology.py to files. (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: ''Slyngshede)'
2023-08-23 09:37:41 <wikibugs> ('CR) ''Jcrespo: "Let me run puppet compiler on bacula dir and on the hosts to make sure it is a noop (or mostly a noop)." [puppet] - ''https://gerrit.wikimedia.org/r/935408 (https://phabricator.wikimedia.org/T221083) (owner: ''Slyngshede)'
2023-08-23 09:38:40 <logmsgbot> !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host druid1010.eqiad.wmnet
2023-08-23 09:40:43 <wikibugs> ('CR) ''Jbond: [C: ''+1] "lgtm" [puppet] - ''https://gerrit.wikimedia.org/r/942641 (https://phabricator.wikimedia.org/T271196) (owner: ''Slyngshede)'
2023-08-23 09:41:02 <wikibugs> ('PS1) ''Muehlenhoff: Make nftables::service types more compatible [puppet] - ''https://gerrit.wikimedia.org/r/951889 (https://phabricator.wikimedia.org/T336497)'
2023-08-23 09:41:22 <wikibugs> ('CR) ''CI reject: [V: ''-1] Make nftables::service types more compatible [puppet] - ''https://gerrit.wikimedia.org/r/951889 (https://phabricator.wikimedia.org/T336497) (owner: ''Muehlenhoff)'
2023-08-23 09:42:16 <jinxer-wm> (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
2023-08-23 09:42:24 <wikibugs> 'ops-codfw, ''Content-Transform-Team, ''serviceops-radar, ''Maps (Maps-data): maps2009 is unreachable - https://phabricator.wikimedia.org/T344110 (''MSantos)'
2023-08-23 09:44:39 <wikibugs> ('PS1) ''Jbond: admin: add lwatson to ldap config [puppet] - ''https://gerrit.wikimedia.org/r/951890 (https://phabricator.wikimedia.org/T344772)'
2023-08-23 09:45:15 <wikibugs> ('CR) ''Jbond: [C: ''+2] admin: add lwatson to ldap config [puppet] - ''https://gerrit.wikimedia.org/r/951890 (https://phabricator.wikimedia.org/T344772) (owner: ''Jbond)'
2023-08-23 09:45:25 <wikibugs> ('CR) ''Clément Goubert: [C: ''+1] deployment_server::helmfile: Iterate over clusters groups first [puppet] - ''https://gerrit.wikimedia.org/r/951835 (https://phabricator.wikimedia.org/T297417) (owner: ''JMeybohm)'
2023-08-23 09:46:12 <wikibugs> ('CR) ''Clément Goubert: [C: ''+1] deployment_server/helmfile: Write admin_services_secrets to files [puppet] - ''https://gerrit.wikimedia.org/r/951843 (https://phabricator.wikimedia.org/T297417) (owner: ''JMeybohm)'
2023-08-23 09:46:20 <wikibugs> ('PS2) ''Jbond: admin: add lwatson to ldap config [puppet] - ''https://gerrit.wikimedia.org/r/951890 (https://phabricator.wikimedia.org/T344772)'
2023-08-23 09:46:46 <wikibugs> ('CR) ''Jbond: [C: ''+2] admin: add lwatson to ldap config [puppet] - ''https://gerrit.wikimedia.org/r/951890 (https://phabricator.wikimedia.org/T344772) (owner: ''Jbond)'
2023-08-23 09:47:06 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2156 (T343718)', diff saved to https://phabricator.wikimedia.org/P51078 and previous config saved to /var/cache/conftool/dbconfig/20230823-094706-ladsgroup.json
2023-08-23 09:47:08 <logmsgbot> !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2177.codfw.wmnet with reason: Maintenance
2023-08-23 09:47:11 <stashbot> T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
2023-08-23 09:47:16 <jinxer-wm> (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
2023-08-23 09:47:21 <logmsgbot> !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2177.codfw.wmnet with reason: Maintenance
2023-08-23 09:47:27 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2177 (T343718)', diff saved to https://phabricator.wikimedia.org/P51079 and previous config saved to /var/cache/conftool/dbconfig/20230823-094727-ladsgroup.json
2023-08-23 09:48:34 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1223', diff saved to https://phabricator.wikimedia.org/P51081 and previous config saved to /var/cache/conftool/dbconfig/20230823-094834-ladsgroup.json
2023-08-23 09:48:51 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2166 (T344589)', diff saved to https://phabricator.wikimedia.org/P51082 and previous config saved to /var/cache/conftool/dbconfig/20230823-094851-ladsgroup.json
2023-08-23 09:48:57 <logmsgbot> !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance
2023-08-23 09:49:10 <logmsgbot> !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2167.codfw.wmnet with reason: Maintenance
2023-08-23 09:49:17 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3311 (T344589)', diff saved to https://phabricator.wikimedia.org/P51083 and previous config saved to /var/cache/conftool/dbconfig/20230823-094916-ladsgroup.json
2023-08-23 09:49:54 <wikibugs> ('PS1) ''Jbond: admin: add vriley to ldap only [puppet] - ''https://gerrit.wikimedia.org/r/951891 (https://phabricator.wikimedia.org/T344770)'
2023-08-23 09:50:41 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2167:3318 (T344589)', diff saved to https://phabricator.wikimedia.org/P51084 and previous config saved to /var/cache/conftool/dbconfig/20230823-095040-ladsgroup.json
2023-08-23 09:52:53 <wikibugs> ('CR) ''Jbond: [C: ''+2] admin: add vriley to ldap only [puppet] - ''https://gerrit.wikimedia.org/r/951891 (https://phabricator.wikimedia.org/T344770) (owner: ''Jbond)'
2023-08-23 09:55:25 <wikibugs> 'SRE, ''LDAP-Access-Requests, ''Patch-For-Review: Grant Access to ldap/wmf for lwatson - https://phabricator.wikimedia.org/T344772 (''jbond) ''Open''Resolved @lwatson you are now part of the wmf group so should be able to access all the listed sites.'
2023-08-23 09:57:41 <logmsgbot> !log klausman@cumin1001 START - Cookbook sre.k8s.reboot-nodes rolling reboot on A:ml-serve-worker-codfw
2023-08-23 09:57:50 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T344589)', diff saved to https://phabricator.wikimedia.org/P51085 and previous config saved to /var/cache/conftool/dbconfig/20230823-095749-ladsgroup.json
2023-08-23 10:00:04 <jouncebot> Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230823T1000)
2023-08-23 10:00:08 <wikibugs> ('CR) ''Alexandros Kosiaris: [C: ''-1] deployment_server/helmfile: Write admin_services_secrets to files (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/951843 (https://phabricator.wikimedia.org/T297417) (owner: ''JMeybohm)'
2023-08-23 10:02:39 <wikibugs> ('PS11) ''Ilias Sarantopoulos: ores-extension: replace thresholds with numeric values [mediawiki-config] - ''https://gerrit.wikimedia.org/r/948542 (https://phabricator.wikimedia.org/T343308)'
2023-08-23 10:03:07 <wikibugs> ('PS4) ''JMeybohm: deployment_server/helmfile: Write admin_services_secrets to files [puppet] - ''https://gerrit.wikimedia.org/r/951843 (https://phabricator.wikimedia.org/T297417)'
2023-08-23 10:03:22 <wikibugs> 'SRE, ''LDAP-Access-Requests, ''Patch-For-Review: Grant Access to wmf for Valerie Riley - https://phabricator.wikimedia.org/T344770 (''jbond) ''Open''Resolved a:''jbond @VRiley-WMF you where already part of the WMF group so should have read-only access to netbox. For shell access please [[ https://w...'
2023-08-23 10:03:27 <wikibugs> ('CR) ''Gmodena: rdf-streaming-updater-dse-k8s: Add Zookeeper HA (''1 comment) [deployment-charts] - ''https://gerrit.wikimedia.org/r/951551 (https://phabricator.wikimedia.org/T344614) (owner: ''Bking)'
2023-08-23 10:03:40 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db1223 (T343718)', diff saved to https://phabricator.wikimedia.org/P51086 and previous config saved to /var/cache/conftool/dbconfig/20230823-100340-ladsgroup.json
2023-08-23 10:03:42 <logmsgbot> !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
2023-08-23 10:03:45 <stashbot> T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
2023-08-23 10:03:55 <logmsgbot> !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on dbstore1007.eqiad.wmnet with reason: Maintenance
2023-08-23 10:03:58 <wikibugs> ('PS12) ''Ilias Sarantopoulos: ores-extension: replace thresholds with numeric values [mediawiki-config] - ''https://gerrit.wikimedia.org/r/948542 (https://phabricator.wikimedia.org/T343308)'
2023-08-23 10:04:34 <wikibugs> 'SRE, ''AQS2.0, ''Cassandra, ''serviceops, ''Service-deployment-requests: AQS 2.0 differentially private pageviews deploy API - https://phabricator.wikimedia.org/T343855 (''JAllemandou) >>! In T343855#9111286, @Htriedman wrote: > 1. Ensure that (a) historical data is loaded into cassandra (currently thi...'
2023-08-23 10:04:42 <wikibugs> ('PS13) ''Ilias Sarantopoulos: ores-extension: replace thresholds with numeric values [mediawiki-config] - ''https://gerrit.wikimedia.org/r/948542 (https://phabricator.wikimedia.org/T343308)'
2023-08-23 10:05:08 <wikibugs> ('CR) ''JMeybohm: [V: ''+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42990/console"; [puppet] - ''https://gerrit.wikimedia.org/r/951843 (https://phabricator.wikimedia.org/T297417) (owner: ''JMeybohm)'
2023-08-23 10:05:43 <icinga-wm> PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
2023-08-23 10:05:56 <wikibugs> ('PS2) ''Muehlenhoff: Make nftables::service types more compatible [puppet] - ''https://gerrit.wikimedia.org/r/951889 (https://phabricator.wikimedia.org/T336497)'
2023-08-23 10:06:25 <wikibugs> ('CR) ''Jelto: [C: ''+1] "lgtm, nice addition!" [puppet] - ''https://gerrit.wikimedia.org/r/951429 (https://phabricator.wikimedia.org/T344620) (owner: ''EoghanGaffney)'
2023-08-23 10:06:44 <wikibugs> ('CR) ''CI reject: [V: ''-1] Make nftables::service types more compatible [puppet] - ''https://gerrit.wikimedia.org/r/951889 (https://phabricator.wikimedia.org/T336497) (owner: ''Muehlenhoff)'
2023-08-23 10:07:49 <wikibugs> ('PS1) ''Jbond: admin: offboard oleksandrtsyba-wmde and nosc [puppet] - ''https://gerrit.wikimedia.org/r/951893 (https://phabricator.wikimedia.org/T344766)'
2023-08-23 10:08:04 <wikibugs> ('CR) ''EoghanGaffney: [C: ''+2] gitlab: Add warning banner to replica instances [puppet] - ''https://gerrit.wikimedia.org/r/951429 (https://phabricator.wikimedia.org/T344620) (owner: ''EoghanGaffney)'
2023-08-23 10:09:01 <icinga-wm> PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
2023-08-23 10:09:54 <fabfur> !log temporary depool/repool cp4040 for haproxy service restart
2023-08-23 10:09:57 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-08-23 10:10:59 <wikibugs> ('PS8) ''EoghanGaffney: gitlab: Add locking to backups [puppet] - ''https://gerrit.wikimedia.org/r/930182'
2023-08-23 10:12:56 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P51087 and previous config saved to /var/cache/conftool/dbconfig/20230823-101255-ladsgroup.json
2023-08-23 10:14:29 <vgutierrez> !log depool cp2039 to run some HAProxy experiments
2023-08-23 10:14:32 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-08-23 10:17:20 <wikibugs> ('CR) ''EoghanGaffney: [C: ''+2] gitlab: Add locking to backups [puppet] - ''https://gerrit.wikimedia.org/r/930182 (owner: ''EoghanGaffney)'
2023-08-23 10:22:34 <wikibugs> 'SRE, ''LDAP-Access-Requests, ''Patch-For-Review: Offboard Norman Schwirz, Oleksandr Tsyba from WMF systems - https://phabricator.wikimedia.org/T344766 (''jbond) @WMDE-leszek i have dropped the ldap permissions. Are you able to confirm the Phabricator accounts so i can also offboard them from here. thanks'
2023-08-23 10:23:59 <wikibugs> ('CR) ''Giuseppe Lavagetto: [C: ''+1] "You convinced me this is mostly transitory, so LGTM 😊" [deployment-charts] - ''https://gerrit.wikimedia.org/r/950187 (https://phabricator.wikimedia.org/T344177) (owner: ''JMeybohm)'
2023-08-23 10:24:09 <icinga-wm> PROBLEM - Host ml-serve2001 is DOWN: PING CRITICAL - Packet loss = 100%
2023-08-23 10:25:38 <jinxer-wm> (KubernetesCalicoDown) firing: ml-serve2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
2023-08-23 10:28:02 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311', diff saved to https://phabricator.wikimedia.org/P51088 and previous config saved to /var/cache/conftool/dbconfig/20230823-102801-ladsgroup.json
2023-08-23 10:29:13 <icinga-wm> RECOVERY - Host ml-serve2001 is UP: PING WARNING - Packet loss = 77%, RTA = 31.72 ms
2023-08-23 10:29:39 <icinga-wm> RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 115, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
2023-08-23 10:29:39 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T343718)', diff saved to https://phabricator.wikimedia.org/P51089 and previous config saved to /var/cache/conftool/dbconfig/20230823-102939-ladsgroup.json
2023-08-23 10:29:52 <stashbot> T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
2023-08-23 10:30:23 <wikibugs> ('CR) ''Muehlenhoff: [C: ''+1] "Looks good" [puppet] - ''https://gerrit.wikimedia.org/r/951893 (https://phabricator.wikimedia.org/T344766) (owner: ''Jbond)'
2023-08-23 10:30:38 <jinxer-wm> (KubernetesCalicoDown) resolved: ml-serve2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
2023-08-23 10:31:44 <jinxer-wm> (JobUnavailable) firing: Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2023-08-23 10:34:34 <wikibugs> 'SRE, ''LDAP-Access-Requests, ''Patch-For-Review: Offboard Norman Schwirz, Oleksandr Tsyba from WMF systems - https://phabricator.wikimedia.org/T344766 (''WMDE-leszek) thanks @jbond . Phabricator accounts have been @WMDE_Norman and @oleksandr_tsyba_WMDE - both disabled already'
2023-08-23 10:37:50 <vgutierrez> !log repool cp2039
2023-08-23 10:37:53 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-08-23 10:39:19 <wikibugs> ('PS1) ''EoghanGaffney: gitlab: Fix paths for backup common functions [puppet] - ''https://gerrit.wikimedia.org/r/951896 (https://phabricator.wikimedia.org/T338332)'
2023-08-23 10:40:18 <vgutierrez> !log rolling upgrade to HAProxy 2.6.15 - T344047
2023-08-23 10:40:20 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-08-23 10:41:42 <wikibugs> ('CR) ''Muehlenhoff: [C: ''+2] Make firewall logging conditional on ferm and rename the profile [puppet] - ''https://gerrit.wikimedia.org/r/951828 (https://phabricator.wikimedia.org/T336497) (owner: ''Muehlenhoff)'
2023-08-23 10:43:08 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3311 (T344589)', diff saved to https://phabricator.wikimedia.org/P51090 and previous config saved to /var/cache/conftool/dbconfig/20230823-104308-ladsgroup.json
2023-08-23 10:44:48 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P51091 and previous config saved to /var/cache/conftool/dbconfig/20230823-104445-ladsgroup.json
2023-08-23 10:45:15 <icinga-wm> PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
2023-08-23 10:45:53 <wikibugs> ('CR) ''Muehlenhoff: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/951830 (https://phabricator.wikimedia.org/T336497) (owner: ''Muehlenhoff)'
2023-08-23 10:46:21 <logmsgbot> !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_codfw and not P{cp2041.*} and not P{cp2039.*} and A:cp
2023-08-23 10:46:46 <logmsgbot> !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_codfw and not P{cp2042.*} and A:cp
2023-08-23 10:47:25 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318 (T344589)', diff saved to https://phabricator.wikimedia.org/P51092 and previous config saved to /var/cache/conftool/dbconfig/20230823-104725-ladsgroup.json
2023-08-23 10:48:27 <icinga-wm> PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
2023-08-23 10:48:33 <wikibugs> ('PS3) ''Giuseppe Lavagetto: termbox-test: call mw-api-int [deployment-charts] - ''https://gerrit.wikimedia.org/r/951043 (https://phabricator.wikimedia.org/T334064)'
2023-08-23 10:49:22 <wikibugs> ('PS1) ''Sergio Gimeno: GrowthExperiments: enable AddLink frontend 13th round of wikis [mediawiki-config] - ''https://gerrit.wikimedia.org/r/951897 (https://phabricator.wikimedia.org/T308138)'
2023-08-23 10:49:51 <icinga-wm> RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 115, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
2023-08-23 10:49:58 <wikibugs> ('CR) ''Giuseppe Lavagetto: [C: ''+2] termbox-test: call mw-api-int (''2 comments) [deployment-charts] - ''https://gerrit.wikimedia.org/r/951043 (https://phabricator.wikimedia.org/T334064) (owner: ''Giuseppe Lavagetto)'
2023-08-23 10:50:44 <wikibugs> ('Merged) ''jenkins-bot: termbox-test: call mw-api-int [deployment-charts] - ''https://gerrit.wikimedia.org/r/951043 (https://phabricator.wikimedia.org/T334064) (owner: ''Giuseppe Lavagetto)'
2023-08-23 10:53:03 <wikibugs> ('CR) ''Jcrespo: "One second, in the last moment I thought of an option that may be much easier for both of us, but I want to do some tests first!" [puppet] - ''https://gerrit.wikimedia.org/r/935408 (https://phabricator.wikimedia.org/T221083) (owner: ''Slyngshede)'
2023-08-23 10:54:18 <logmsgbot> !log oblivian@deploy1002 helmfile [staging] START helmfile.d/services/termbox: apply
2023-08-23 10:54:29 <logmsgbot> !log oblivian@deploy1002 helmfile [staging] DONE helmfile.d/services/termbox: apply
2023-08-23 10:58:07 <icinga-wm> PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
2023-08-23 10:59:16 <wikibugs> ('CR) ''Sergio Gimeno: [C: ''-1] "Awaiting to inform communities, T308138#9112945" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/951897 (https://phabricator.wikimedia.org/T308138) (owner: ''Sergio Gimeno)'
2023-08-23 10:59:53 <icinga-wm> PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
2023-08-23 10:59:55 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177', diff saved to https://phabricator.wikimedia.org/P51093 and previous config saved to /var/cache/conftool/dbconfig/20230823-105954-ladsgroup.json
2023-08-23 11:00:07 <logmsgbot> !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_codfw and not P{cp2041.*} and not P{cp2039.*} and A:cp
2023-08-23 11:00:47 <logmsgbot> !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_eqsin and A:cp
2023-08-23 11:01:59 <logmsgbot> !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_codfw and not P{cp2042.*} and A:cp
2023-08-23 11:02:32 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318', diff saved to https://phabricator.wikimedia.org/P51094 and previous config saved to /var/cache/conftool/dbconfig/20230823-110231-ladsgroup.json
2023-08-23 11:02:37 <logmsgbot> !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_eqsin and A:cp
2023-08-23 11:02:41 <icinga-wm> RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 115, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
2023-08-23 11:03:37 <wikibugs> ('PS3) ''Hnowlan: service, conftool: add base configuration for geo-analytics [puppet] - ''https://gerrit.wikimedia.org/r/947864 (https://phabricator.wikimedia.org/T336400)'
2023-08-23 11:03:39 <wikibugs> ('PS2) ''Hnowlan: kubernetes: add users for media_analytics service, cassandra config [puppet] - ''https://gerrit.wikimedia.org/r/951547 (https://phabricator.wikimedia.org/T336380)'
2023-08-23 11:10:15 <wikibugs> ('CR) ''Jbond: [C: ''+2] admin: offboard oleksandrtsyba-wmde and nosc [puppet] - ''https://gerrit.wikimedia.org/r/951893 (https://phabricator.wikimedia.org/T344766) (owner: ''Jbond)'
2023-08-23 11:11:03 <icinga-wm> PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
2023-08-23 11:11:47 <icinga-wm> PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2023-08-23 11:12:01 <wikibugs> ('PS3) ''Giuseppe Lavagetto: termbox: Switch to mw-api-int-async on k8s [deployment-charts] - ''https://gerrit.wikimedia.org/r/935397 (https://phabricator.wikimedia.org/T334064) (owner: ''Clément Goubert)'
2023-08-23 11:13:59 <wikibugs> ('CR) ''Alexandros Kosiaris: [C: ''+1] deployment_server/helmfile: Write admin_services_secrets to files [puppet] - ''https://gerrit.wikimedia.org/r/951843 (https://phabricator.wikimedia.org/T297417) (owner: ''JMeybohm)'
2023-08-23 11:14:11 <icinga-wm> PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
2023-08-23 11:14:26 <wikibugs> 'SRE, ''LDAP-Access-Requests, ''Patch-For-Review: Offboard Norman Schwirz, Oleksandr Tsyba from WMF systems - https://phabricator.wikimedia.org/T344766 (''jbond) ''Open''Resolved a:''jbond @WMDE-leszek Thanks looks like we are all done then. but please reopen if you see anything else that needs clea...'
2023-08-23 11:15:01 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2177 (T343718)', diff saved to https://phabricator.wikimedia.org/P51095 and previous config saved to /var/cache/conftool/dbconfig/20230823-111500-ladsgroup.json
2023-08-23 11:15:07 <stashbot> T343718: Drop old columns of externallinks - https://phabricator.wikimedia.org/T343718
2023-08-23 11:15:57 <logmsgbot> !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling reboot on A:ldap-replicas-codfw
2023-08-23 11:17:03 <logmsgbot> !log ayounsi@cumin1001 START - Cookbook sre.ganeti.makevm for new host atlas2001.wikimedia.org
2023-08-23 11:17:04 <logmsgbot> !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox
2023-08-23 11:17:38 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318', diff saved to https://phabricator.wikimedia.org/P51096 and previous config saved to /var/cache/conftool/dbconfig/20230823-111737-ladsgroup.json
2023-08-23 11:18:12 <wikibugs> ('CR) ''Clément Goubert: [C: ''+1] termbox: Switch to mw-api-int-async on k8s [deployment-charts] - ''https://gerrit.wikimedia.org/r/935397 (https://phabricator.wikimedia.org/T334064) (owner: ''Clément Goubert)'
2023-08-23 11:18:35 <icinga-wm> RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 115, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
2023-08-23 11:19:03 <logmsgbot> !log ayounsi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM atlas2001.wikimedia.org - ayounsi@cumin1001"
2023-08-23 11:20:31 <icinga-wm> RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2023-08-23 11:21:02 <logmsgbot> !log ayounsi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add records for VM atlas2001.wikimedia.org - ayounsi@cumin1001"
2023-08-23 11:21:02 <logmsgbot> !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
2023-08-23 11:21:02 <logmsgbot> !log ayounsi@cumin1001 START - Cookbook sre.dns.wipe-cache atlas2001.wikimedia.org on all recursors
2023-08-23 11:21:05 <logmsgbot> !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.wipe-cache (exit_code=0) atlas2001.wikimedia.org on all recursors
2023-08-23 11:23:53 <logmsgbot> !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling reboot on A:ldap-replicas-codfw
2023-08-23 11:24:44 <logmsgbot> !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host debmonitor2002.codfw.wmnet
2023-08-23 11:24:47 <logmsgbot> !log ayounsi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM atlas2001.wikimedia.org - ayounsi@cumin1001"
2023-08-23 11:24:49 <icinga-wm> PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2023-08-23 11:25:34 <logmsgbot> !log ayounsi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.ganeti.makevm: created new VM atlas2001.wikimedia.org - ayounsi@cumin1001"
2023-08-23 11:25:34 <logmsgbot> !log ayounsi@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host atlas2001.wikimedia.org
2023-08-23 11:25:42 <logmsgbot> !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-presto1002.eqiad.wmnet
2023-08-23 11:27:05 <icinga-wm> PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
2023-08-23 11:27:10 <wikibugs> ('CR) ''Jelto: [C: ''+1] "lgtm" [puppet] - ''https://gerrit.wikimedia.org/r/951896 (https://phabricator.wikimedia.org/T338332) (owner: ''EoghanGaffney)'
2023-08-23 11:28:06 <logmsgbot> !log jmm@cumin2002 START - Cookbook sre.ldap.roll-restart-reboot-replica rolling reboot on A:ldap-replicas-eqiad
2023-08-23 11:28:28 <logmsgbot> !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor2002.codfw.wmnet
2023-08-23 11:29:15 <icinga-wm> RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2023-08-23 11:30:11 <wikibugs> ('CR) ''EoghanGaffney: [C: ''+2] gitlab: Fix paths for backup common functions [puppet] - ''https://gerrit.wikimedia.org/r/951896 (https://phabricator.wikimedia.org/T338332) (owner: ''EoghanGaffney)'
2023-08-23 11:30:15 <wikibugs> ('PS1) ''Hnowlan: service: add media-analytics service entry [puppet] - ''https://gerrit.wikimedia.org/r/951901 (https://phabricator.wikimedia.org/T336380)'
2023-08-23 11:30:16 <logmsgbot> !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_eqsin and A:cp
2023-08-23 11:31:03 <logmsgbot> !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_eqsin and A:cp
2023-08-23 11:31:33 <wikibugs> ('PS1) ''Jbond: netbox: add datacenter-ops group as a super user [puppet] - ''https://gerrit.wikimedia.org/r/951902 (https://phabricator.wikimedia.org/T341581)'
2023-08-23 11:31:35 <wikibugs> ('PS1) ''Jbond: idp: add datacenter-ops group to other services they should have access to [puppet] - ''https://gerrit.wikimedia.org/r/951903 (https://phabricator.wikimedia.org/T341581)'
2023-08-23 11:32:00 <logmsgbot> !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-presto1002.eqiad.wmnet
2023-08-23 11:32:44 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2167:3318 (T344589)', diff saved to https://phabricator.wikimedia.org/P51097 and previous config saved to /var/cache/conftool/dbconfig/20230823-113244-ladsgroup.json
2023-08-23 11:32:51 <logmsgbot> !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance
2023-08-23 11:32:59 <wikibugs> 'SRE-tools, ''Ganeti, ''Spicerack: cookbook sre.ganeti.makevm calls wrong netbox_ganeti_codfw_sync.service - https://phabricator.wikimedia.org/T344812 (''ayounsi)'
2023-08-23 11:33:05 <logmsgbot> !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db2181.codfw.wmnet with reason: Maintenance
2023-08-23 11:33:11 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depooling db2181 (T344589)', diff saved to https://phabricator.wikimedia.org/P51098 and previous config saved to /var/cache/conftool/dbconfig/20230823-113310-ladsgroup.json
2023-08-23 11:34:53 <wikibugs> 'SRE-tools, ''Ganeti, ''Infrastructure-Foundations: cookbook sre.ganeti.makevm fails when no group is set - https://phabricator.wikimedia.org/T344813 (''ayounsi)'
2023-08-23 11:35:03 <wikibugs> ('PS10) ''Jbond: wmcs: add wmcs-roots to roles where it is missing [puppet] - ''https://gerrit.wikimedia.org/r/923681 (https://phabricator.wikimedia.org/T337848)'
2023-08-23 11:35:10 <wikibugs> ('CR) ''JMeybohm: [V: ''+1 C: ''+2] deployment_server::helmfile: Iterate over clusters groups first [puppet] - ''https://gerrit.wikimedia.org/r/951835 (https://phabricator.wikimedia.org/T297417) (owner: ''JMeybohm)'
2023-08-23 11:35:13 <wikibugs> ('CR) ''JMeybohm: [V: ''+1 C: ''+2] deployment_server/helmfile: Write admin_services_secrets to files [puppet] - ''https://gerrit.wikimedia.org/r/951843 (https://phabricator.wikimedia.org/T297417) (owner: ''JMeybohm)'
2023-08-23 11:35:35 <logmsgbot> !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_drmrs and A:cp
2023-08-23 11:35:38 <wikibugs> ('CR) ''JMeybohm: [V: ''+2 C: ''+2] deployment_server/kubernetes: Readd admin_services secrets [labs/private] - ''https://gerrit.wikimedia.org/r/951836 (https://phabricator.wikimedia.org/T297417) (owner: ''JMeybohm)'
2023-08-23 11:35:43 <logmsgbot> !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_drmrs and A:cp
2023-08-23 11:36:16 <logmsgbot> !log jmm@cumin2002 END (PASS) - Cookbook sre.ldap.roll-restart-reboot-replica (exit_code=0) rolling reboot on A:ldap-replicas-eqiad
2023-08-23 11:36:25 <wikibugs> ('PS3) ''Muehlenhoff: Make nftables::service types more compatible [puppet] - ''https://gerrit.wikimedia.org/r/951889 (https://phabricator.wikimedia.org/T336497)'
2023-08-23 11:36:43 <wikibugs> ('PS11) ''Jbond: wmcs: add wmcs-roots to roles where it is missing [puppet] - ''https://gerrit.wikimedia.org/r/923681 (https://phabricator.wikimedia.org/T337848)'
2023-08-23 11:37:14 <wikibugs> ('CR) ''CI reject: [V: ''-1] Make nftables::service types more compatible [puppet] - ''https://gerrit.wikimedia.org/r/951889 (https://phabricator.wikimedia.org/T336497) (owner: ''Muehlenhoff)'
2023-08-23 11:37:58 <logmsgbot> !log btullis@cumin1001 START - Cookbook sre.druid.reboot-workers for Druid public cluster: Reboot Druid nodes
2023-08-23 11:38:52 <wikibugs> 'SRE, ''Infrastructure-Foundations, ''Puppet-Core, ''Patch-For-Review, ''User-jbond: puppet fact: migrate away from the uniqueid fact - https://phabricator.wikimedia.org/T221083 (''jcrespo) Hey, I just reached this ticket by accident. Could you refer to me the documentation where there was consensus and...'
2023-08-23 11:38:55 <icinga-wm> PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
2023-08-23 11:39:22 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T344589)', diff saved to https://phabricator.wikimedia.org/P51099 and previous config saved to /var/cache/conftool/dbconfig/20230823-113921-ladsgroup.json
2023-08-23 11:39:23 <wikibugs> ('CR) ''Jbond: "@Amir, could i get a +1 from you specifically in relation to the comments starting from https://phabricator.wikimedia.org/T344599#9106167"; [puppet] - ''https://gerrit.wikimedia.org/r/923681 (https://phabricator.wikimedia.org/T337848) (owner: ''Jbond)'
2023-08-23 11:39:44 <wikibugs> 'SRE, ''MW-on-K8s, ''Traffic, ''serviceops, ''Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (''Clement_Goubert)'
2023-08-23 11:40:59 <icinga-wm> PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2023-08-23 11:41:23 <wikibugs> ('CR) ''Jbond: [C: ''+2] httpyaml: replace URI.escape [puppet] - ''https://gerrit.wikimedia.org/r/919291 (https://phabricator.wikimedia.org/T330490) (owner: ''Jbond)'
2023-08-23 11:41:58 <logmsgbot> !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-text_ulsfo and A:cp
2023-08-23 11:41:59 <icinga-wm> PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv6: Connect - kubernetes-ml-codfw, AS64607/IPv4: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
2023-08-23 11:42:46 <wikibugs> ('PS4) ''Muehlenhoff: Make nftables::service types more compatible [puppet] - ''https://gerrit.wikimedia.org/r/951889 (https://phabricator.wikimedia.org/T336497)'
2023-08-23 11:43:27 <icinga-wm> RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 115, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
2023-08-23 11:44:00 <wikibugs> 'SRE-tools, ''Ganeti, ''Infrastructure-Foundations, ''User-MoritzMuehlenhoff: cookbook sre.ganeti.makevm fails when no group is set - https://phabricator.wikimedia.org/T344813 (''MoritzMuehlenhoff)'
2023-08-23 11:44:19 <wikibugs> 'SRE-tools, ''Ganeti, ''Infrastructure-Foundations, ''Spicerack, ''User-MoritzMuehlenhoff: cookbook sre.ganeti.makevm calls wrong netbox_ganeti_codfw_sync.service - https://phabricator.wikimedia.org/T344812 (''MoritzMuehlenhoff)'
2023-08-23 11:46:18 <wikibugs> ('PS1) ''JMeybohm: deployment_server/helmfile: Don't define admin_service_dir twice [puppet] - ''https://gerrit.wikimedia.org/r/951907 (https://phabricator.wikimedia.org/T297417)'
2023-08-23 11:46:55 <wikibugs> 'SRE, ''MW-on-K8s, ''serviceops: mw-on-k8s tls-proxy container CPU throttling at low average load - https://phabricator.wikimedia.org/T344814 (''Clement_Goubert)'
2023-08-23 11:47:11 <wikibugs> ('PS2) ''Muehlenhoff: Convert the monitoring/prometheus ferm rules to a firewall::service [puppet] - ''https://gerrit.wikimedia.org/r/951830 (https://phabricator.wikimedia.org/T336497)'
2023-08-23 11:48:31 <wikibugs> 'SRE, ''MW-on-K8s, ''Traffic, ''serviceops, ''Release-Engineering-Team (Seen): Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (''Clement_Goubert)'
2023-08-23 11:48:49 <wikibugs> ('PS1) ''JMeybohm: Add cfssl-issuer admin secrets to ml-serve [labs/private] - ''https://gerrit.wikimedia.org/r/951908 (https://phabricator.wikimedia.org/T297417)'
2023-08-23 11:49:00 <wikibugs> ('CR) ''Muehlenhoff: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/951889 (https://phabricator.wikimedia.org/T336497) (owner: ''Muehlenhoff)'
2023-08-23 11:49:47 <icinga-wm> RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2023-08-23 11:51:01 <logmsgbot> !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host debmonitor1002.eqiad.wmnet
2023-08-23 11:51:04 <logmsgbot> !log btullis@cumin1001 END (PASS) - Cookbook sre.kafka.reboot-workers (exit_code=0) for Kafka jumbo-eqiad cluster: Reboot kafka nodes
2023-08-23 11:51:30 <wikibugs> ('PS2) ''JMeybohm: deployment_server/helmfile: Don't define admin_service_dir twice [puppet] - ''https://gerrit.wikimedia.org/r/951907 (https://phabricator.wikimedia.org/T297417)'
2023-08-23 11:53:03 <wikibugs> ('CR) ''JMeybohm: [V: ''+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42993/console"; [puppet] - ''https://gerrit.wikimedia.org/r/951907 (https://phabricator.wikimedia.org/T297417) (owner: ''JMeybohm)'
2023-08-23 11:54:16 <wikibugs> ('CR) ''JMeybohm: [V: ''+1 C: ''+2] deployment_server/helmfile: Don't define admin_service_dir twice [puppet] - ''https://gerrit.wikimedia.org/r/951907 (https://phabricator.wikimedia.org/T297417) (owner: ''JMeybohm)'
2023-08-23 11:54:28 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P51100 and previous config saved to /var/cache/conftool/dbconfig/20230823-115427-ladsgroup.json
2023-08-23 11:54:48 <logmsgbot> !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host debmonitor1002.eqiad.wmnet
2023-08-23 11:54:59 <wikibugs> ('CR) ''JMeybohm: [V: ''+2 C: ''+2] Add cfssl-issuer admin secrets to ml-serve [labs/private] - ''https://gerrit.wikimedia.org/r/951908 (https://phabricator.wikimedia.org/T297417) (owner: ''JMeybohm)'
2023-08-23 11:58:07 <icinga-wm> PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
2023-08-23 11:59:58 <wikibugs> ('CR) ''Muehlenhoff: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/951830 (https://phabricator.wikimedia.org/T336497) (owner: ''Muehlenhoff)'
2023-08-23 11:59:59 <logmsgbot> !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_drmrs and A:cp
2023-08-23 12:00:59 <logmsgbot> !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_drmrs and A:cp
2023-08-23 12:01:05 <icinga-wm> PROBLEM - BGP status on cr1-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Active - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
2023-08-23 12:02:02 <wikibugs> ('PS1) ''PipelineBot: citoid: pipeline bot promote [deployment-charts] - ''https://gerrit.wikimedia.org/r/951867'
2023-08-23 12:02:31 <icinga-wm> RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 115, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
2023-08-23 12:03:34 <wikibugs> ('CR) ''Gmodena: [C: ''+2] mw-page-content-change-enrich: stream version bump [deployment-charts] - ''https://gerrit.wikimedia.org/r/951446 (https://phabricator.wikimedia.org/T307959) (owner: ''Gmodena)'
2023-08-23 12:03:56 <logmsgbot> !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas rolling reboot on A:schema-codfw
2023-08-23 12:04:39 <wikibugs> ('Merged) ''jenkins-bot: mw-page-content-change-enrich: stream version bump [deployment-charts] - ''https://gerrit.wikimedia.org/r/951446 (https://phabricator.wikimedia.org/T307959) (owner: ''Gmodena)'
2023-08-23 12:05:43 <icinga-wm> PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2023-08-23 12:09:34 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181', diff saved to https://phabricator.wikimedia.org/P51101 and previous config saved to /var/cache/conftool/dbconfig/20230823-120933-ladsgroup.json
2023-08-23 12:11:47 <logmsgbot> !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/mw-page-content-change-enrich: apply
2023-08-23 12:11:51 <logmsgbot> !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/mw-page-content-change-enrich: apply
2023-08-23 12:12:01 <logmsgbot> !log btullis@cumin1001 START - Cookbook sre.kafka.reboot-workers for Kafka test-eqiad cluster: Reboot kafka nodes
2023-08-23 12:12:33 <icinga-wm> PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64607/IPv4: Connect - kubernetes-ml-codfw, AS64607/IPv6: Connect - kubernetes-ml-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
2023-08-23 12:14:17 <logmsgbot> !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas (exit_code=0) rolling reboot on A:schema-codfw
2023-08-23 12:17:51 <logmsgbot> !log klausman@cumin1001 END (PASS) - Cookbook sre.k8s.reboot-nodes (exit_code=0) rolling reboot on A:ml-serve-worker-codfw
2023-08-23 12:19:01 <wikibugs> ('CR) ''Jelto: [V: ''+1] "PCC SUCCESS (CORE_DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42995/console"; [puppet] - ''https://gerrit.wikimedia.org/r/951484 (https://phabricator.wikimedia.org/T337474) (owner: ''Jaime Nuche)'
2023-08-23 12:19:06 <logmsgbot> !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-upload_eqiad and A:cp
2023-08-23 12:19:21 <logmsgbot> !log fabfur@cumin1001 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-text_eqiad and A:cp
2023-08-23 12:20:17 <icinga-wm> RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2023-08-23 12:21:13 <wikibugs> ('CR) ''Jelto: [V: ''+1 C: ''+1] "lgtm" [puppet] - ''https://gerrit.wikimedia.org/r/951484 (https://phabricator.wikimedia.org/T337474) (owner: ''Jaime Nuche)'
2023-08-23 12:22:57 <wikibugs> 'SRE, ''Infrastructure-Foundations, ''Puppet-Core, ''Patch-For-Review, ''User-jbond: puppet fact: migrate away from the uniqueid fact - https://phabricator.wikimedia.org/T221083 (''jbond) p:''Medium''Low >>! In T221083#9113102, @jcrespo wrote: > Hey, I just reached this ticket by accident. you ha...'
2023-08-23 12:23:43 <wikibugs> ('CR) ''Muehlenhoff: idp: add datacenter-ops group to other services they should have access to (''2 comments) [puppet] - ''https://gerrit.wikimedia.org/r/951903 (https://phabricator.wikimedia.org/T341581) (owner: ''Jbond)'
2023-08-23 12:24:40 <logmsgbot> !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Repooling after maintenance db2181 (T344589)', diff saved to https://phabricator.wikimedia.org/P51102 and previous config saved to /var/cache/conftool/dbconfig/20230823-122440-ladsgroup.json
2023-08-23 12:25:43 <wikibugs> ('PS1) ''Zoranzoki21: [pawiki] Enable the SandboxLink extension [mediawiki-config] - ''https://gerrit.wikimedia.org/r/951913 (https://phabricator.wikimedia.org/T344815)'
2023-08-23 12:25:45 <wikibugs> ('CR) ''Jbond: [C: ''+1] "lgtm" [puppet] - ''https://gerrit.wikimedia.org/r/951830 (https://phabricator.wikimedia.org/T336497) (owner: ''Muehlenhoff)'
2023-08-23 12:26:33 <logmsgbot> !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/mw-page-content-change-enrich: apply
2023-08-23 12:26:39 <logmsgbot> !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/mw-page-content-change-enrich: apply
2023-08-23 12:26:56 <wikibugs> ('CR) ''Muehlenhoff: Convert the monitoring/prometheus ferm rules to a firewall::service (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/951830 (https://phabricator.wikimedia.org/T336497) (owner: ''Muehlenhoff)'
2023-08-23 12:27:52 <wikibugs> ('PS1) ''Zoranzoki21: [enwiktionary] Remove the Index and Index_talk namespaces [mediawiki-config] - ''https://gerrit.wikimedia.org/r/951914 (https://phabricator.wikimedia.org/T344816)'
2023-08-23 12:27:55 <wikibugs> ('CR) ''Btullis: [C: ''+1] "Thanks for this. I'm happy with this, once the CI issue is fixed." [puppet] - ''https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: ''Slyngshede)'
2023-08-23 12:28:11 <wikibugs> ('PS2) ''Zoranzoki21: [enwiktionary] Remove the Index and Index_talk namespaces [mediawiki-config] - ''https://gerrit.wikimedia.org/r/951914 (https://phabricator.wikimedia.org/T344816)'
2023-08-23 12:29:25 <logmsgbot> !log jmm@cumin2002 START - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas rolling reboot on A:schema-eqiad
2023-08-23 12:31:09 <wikibugs> ('CR) ''Jbond: "cheers will update" [puppet] - ''https://gerrit.wikimedia.org/r/951903 (https://phabricator.wikimedia.org/T341581) (owner: ''Jbond)'
2023-08-23 12:31:39 <wikibugs> ('PS1) ''JMeybohm: admin_ng: Include admin service secrets [deployment-charts] - ''https://gerrit.wikimedia.org/r/951915 (https://phabricator.wikimedia.org/T297417)'
2023-08-23 12:31:58 <wikibugs> ('PS1) ''Muehlenhoff: Update cookbook header to reflect the fact that we also support VMs these days [cookbooks] - ''https://gerrit.wikimedia.org/r/951916'
2023-08-23 12:32:39 <logmsgbot> !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host testvm2004.codfw.wmnet with OS bookworm
2023-08-23 12:32:48 <wikibugs> ('PS2) ''JMeybohm: admin_ng: Include admin service secrets [deployment-charts] - ''https://gerrit.wikimedia.org/r/951915 (https://phabricator.wikimedia.org/T297417)'
2023-08-23 12:32:48 <logmsgbot> !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host stat1010.eqiad.wmnet
2023-08-23 12:34:02 <logmsgbot> !log gmodena@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply
2023-08-23 12:34:06 <logmsgbot> !log gmodena@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply
2023-08-23 12:34:19 <logmsgbot> !log gmodena@deploy1002 helmfile [eqiad] START helmfile.d/services/mw-page-content-change-enrich: apply
2023-08-23 12:34:21 <logmsgbot> !log gmodena@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mw-page-content-change-enrich: apply
2023-08-23 12:35:14 <wikibugs> ('CR) ''Muehlenhoff: idp: add datacenter-ops group to other services they should have access to (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/951903 (https://phabricator.wikimedia.org/T341581) (owner: ''Jbond)'
2023-08-23 12:37:46 <wikibugs> ('PS2) ''Jbond: netbox: add datacenter-ops group as a super user [puppet] - ''https://gerrit.wikimedia.org/r/951902 (https://phabricator.wikimedia.org/T341581)'
2023-08-23 12:37:48 <wikibugs> ('PS2) ''Jbond: idp: add datacenter-ops to puppetboard [puppet] - ''https://gerrit.wikimedia.org/r/951903 (https://phabricator.wikimedia.org/T341581)'
2023-08-23 12:37:55 <wikibugs> ('PS1) ''Jbond: idp: drop superfluous permissions [puppet] - ''https://gerrit.wikimedia.org/r/951917 (https://phabricator.wikimedia.org/T341581)'
2023-08-23 12:38:54 <logmsgbot> !log jmm@cumin2002 END (PASS) - Cookbook sre.misc-clusters.roll-restart-reboot-eventschemas (exit_code=0) rolling reboot on A:schema-eqiad
2023-08-23 12:40:30 <logmsgbot> !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host stat1010.eqiad.wmnet
2023-08-23 12:42:55 <logmsgbot> !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host dse-k8s-ctrl1001.eqiad.wmnet
2023-08-23 12:47:06 <logmsgbot> !log btullis@cumin1001 END (PASS) - Cookbook sre.druid.reboot-workers (exit_code=0) for Druid public cluster: Reboot Druid nodes
2023-08-23 12:48:06 <jelto> !log update jwt-authorizer package to v1.2.0
2023-08-23 12:48:08 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-08-23 12:48:15 <wikibugs> ('CR) ''Jbond: "thanks see inline, perhaps its best to move this discussion back to the task?" [puppet] - ''https://gerrit.wikimedia.org/r/951903 (https://phabricator.wikimedia.org/T341581) (owner: ''Jbond)'
2023-08-23 12:48:39 <jelto> !log update jwt-authorizer package to v1.2.0 - T337474
2023-08-23 12:48:42 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-08-23 12:48:43 <stashbot> T337474: Replace deprecated `CI_JOB_JWT` CI variable in Kokkuri - https://phabricator.wikimedia.org/T337474
2023-08-23 12:48:57 <wikibugs> ('CR) ''Jbond: [C: ''+1] "lgtm" [cookbooks] - ''https://gerrit.wikimedia.org/r/951916 (owner: ''Muehlenhoff)'
2023-08-23 12:49:08 <logmsgbot> !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-ctrl1001.eqiad.wmnet
2023-08-23 12:49:25 <logmsgbot> !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_ulsfo and A:cp
2023-08-23 12:49:30 <logmsgbot> !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_ulsfo and A:cp
2023-08-23 12:54:50 <wikibugs> ('CR) ''Jelto: [V: ''+1 C: ''+2] jwt_authorizer: reflect changes to accept multiple issuers [puppet] - ''https://gerrit.wikimedia.org/r/951484 (https://phabricator.wikimedia.org/T337474) (owner: ''Jaime Nuche)'
2023-08-23 12:55:55 <wikibugs> ('PS1) ''Effie Mouzeli: Revert "tegola-vector-tiles: use tegola image with debug enabled on codfw" [deployment-charts] - ''https://gerrit.wikimedia.org/r/951850'
2023-08-23 12:56:09 <wikibugs> ('PS2) ''Effie Mouzeli: Revert "tegola-vector-tiles: use tegola image with debug enabled on codfw" [deployment-charts] - ''https://gerrit.wikimedia.org/r/951850'
2023-08-23 12:56:35 <jelto> !log registry* - upgrade jwt-authorizer package on all 4 hosts to version 1.2.0-1 - T337474
2023-08-23 12:56:39 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-08-23 12:56:41 <stashbot> T337474: Replace deprecated `CI_JOB_JWT` CI variable in Kokkuri - https://phabricator.wikimedia.org/T337474
2023-08-23 12:58:05 <wikibugs> ('CR) ''Muehlenhoff: [C: ''+2] Update cookbook header to reflect the fact that we also support VMs these days [cookbooks] - ''https://gerrit.wikimedia.org/r/951916 (owner: ''Muehlenhoff)'
2023-08-23 12:58:07 <logmsgbot> !log gmodena@deploy1002 helmfile [staging] START helmfile.d/services/eventstreams: apply
2023-08-23 12:58:17 <logmsgbot> !log gmodena@deploy1002 helmfile [staging] DONE helmfile.d/services/eventstreams: apply
2023-08-23 13:00:05 <jouncebot> RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: How many deployers does it take to do UTC afternoon backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230823T1300).
2023-08-23 13:00:05 <jouncebot> Dreamy_Jazz and kizule: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
2023-08-23 13:00:11 <Dreamy_Jazz> \o
2023-08-23 13:00:13 <Lucas_WMDE> o/
2023-08-23 13:00:18 <Kizule> \o
2023-08-23 13:00:24 <Lucas_WMDE> I can deploy!
2023-08-23 13:00:34 <wikibugs> ('CR) ''Muehlenhoff: [C: ''+1] "Looks good! We need the same for idp_test.yaml as well, BTW." [puppet] - ''https://gerrit.wikimedia.org/r/951917 (https://phabricator.wikimedia.org/T341581) (owner: ''Jbond)'
2023-08-23 13:01:00 <logmsgbot> !log gmodena@deploy1002 helmfile [codfw] START helmfile.d/services/eventstreams: apply
2023-08-23 13:01:10 <Lucas_WMDE> hm, logspam-watch on mwlog1002 isn’t coming up yet
2023-08-23 13:01:15 <Lucas_WMDE> ah there it is, nevermind
2023-08-23 13:01:25 <logmsgbot> !log gmodena@deploy1002 helmfile [codfw] DONE helmfile.d/services/eventstreams: apply
2023-08-23 13:01:58 <wikibugs> ('CR) ''Jgiannelos: [C: ''+1] Revert "tegola-vector-tiles: use tegola image with debug enabled on codfw" [deployment-charts] - ''https://gerrit.wikimedia.org/r/951850 (owner: ''Effie Mouzeli)'
2023-08-23 13:02:45 <wikibugs> ('CR) ''Lucas Werkmeister (WMDE): [C: ''+2] "kick off gate-and-submit while I go deploy some config changes first" [extensions/CheckUser] (wmf/1.41.0-wmf.23) - ''https://gerrit.wikimedia.org/r/951847 (https://phabricator.wikimedia.org/T344787) (owner: ''Dreamy Jazz)'
2023-08-23 13:03:04 <logmsgbot> !log gmodena@deploy1002 helmfile [eqiad] START helmfile.d/services/eventstreams: apply
2023-08-23 13:03:16 <jinxer-wm> (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
2023-08-23 13:03:29 <logmsgbot> !log gmodena@deploy1002 helmfile [eqiad] DONE helmfile.d/services/eventstreams: apply
2023-08-23 13:03:29 <wikibugs> ('CR) ''TrainBranchBot: [C: ''+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/951913 (https://phabricator.wikimedia.org/T344815) (owner: ''Zoranzoki21)'
2023-08-23 13:04:12 <wikibugs> ('Merged) ''jenkins-bot: [pawiki] Enable the SandboxLink extension [mediawiki-config] - ''https://gerrit.wikimedia.org/r/951913 (https://phabricator.wikimedia.org/T344815) (owner: ''Zoranzoki21)'
2023-08-23 13:05:03 <logmsgbot> !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:951913|[pawiki] Enable the SandboxLink extension (T344815)]]
2023-08-23 13:05:10 <stashbot> T344815: Install SandboxLink Extension in Pawiki - https://phabricator.wikimedia.org/T344815
2023-08-23 13:06:43 <logmsgbot> !log lucaswerkmeister-wmde@deploy1002 zoranzoki21 and lucaswerkmeister-wmde: Backport for [[gerrit:951913|[pawiki] Enable the SandboxLink extension (T344815)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
2023-08-23 13:06:56 <Lucas_WMDE> Kizule: please test :)
2023-08-23 13:07:50 <Lucas_WMDE> seems to work for me, though someone™ should probably translate the word “Sandbox” soon™
2023-08-23 13:08:08 <Kizule> testing
2023-08-23 13:08:16 <jinxer-wm> (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-site=eqiad&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded
2023-08-23 13:08:42 <logmsgbot> !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_ulsfo and A:cp
2023-08-23 13:08:57 <logmsgbot> !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_ulsfo and A:cp
2023-08-23 13:09:41 <Kizule> Lucas_WMDE: Is it already deployed out of mwdebug?
2023-08-23 13:09:45 <Lucas_WMDE> only mwdebug
2023-08-23 13:09:50 <Lucas_WMDE> waiting for confirmation before syncing
2023-08-23 13:10:18 <Lucas_WMDE> (my own test was just for curiosity, I don’t consider that sufficient for deploying unless you suddenly vanish or something ^^)
2023-08-23 13:10:31 <Kizule> I'm confused because there is link to sandbox right after link to talk page, out of mwdebug.
2023-08-23 13:10:39 <Lucas_WMDE> did you try Ctrl+F5?
2023-08-23 13:10:47 <Lucas_WMDE> for me a normal F5 sometimes didn’t trigger the change (in either direction)
2023-08-23 13:11:15 <Kizule> Ohhh.. They have added it manually, that was confusing me. Now I see link, yeah, this is good to go.
2023-08-23 13:11:21 <logmsgbot> !log lucaswerkmeister-wmde@deploy1002 zoranzoki21 and lucaswerkmeister-wmde: Continuing with sync
2023-08-23 13:11:21 <wikibugs> 'SRE, ''LDAP-Access-Requests: Grant Access to ldap/wmf for lwatson - https://phabricator.wikimedia.org/T344772 (''lwatson) Great, thanks! @jbond'
2023-08-23 13:11:28 <Lucas_WMDE> ah ok
2023-08-23 13:11:41 <Lucas_WMDE> so there’s a manually added link with correct translation, and the untranslated one is new?
2023-08-23 13:11:47 <Kizule> Yes
2023-08-23 13:11:55 <Lucas_WMDE> ah ok
2023-08-23 13:12:05 <Lucas_WMDE> I couldn’t distinguish the translated one from any other link ;)
2023-08-23 13:12:12 <Lucas_WMDE> (well, I suppose I could hover it and see which link goes to my user page. whatever)
2023-08-23 13:12:18 <Lucas_WMDE> syncing now
2023-08-23 13:13:12 <Kizule> Okay, thanks!
2023-08-23 13:13:37 <Lucas_WMDE> and then I’ll do one of Dreamy_Jazz’ backports before continuing with your other config change fyi
2023-08-23 13:13:44 <Lucas_WMDE> checks on enwiktionary in the meantime
2023-08-23 13:14:42 <wikibugs> ('CR) ''Lucas Werkmeister (WMDE): [C: ''+1] "Can confirm there are no pages in it: https://en.wiktionary.org/wiki/Special:AllPages?namespace=104, https://en.wiktionary.org/wiki/Specia"; [mediawiki-config] - ''https://gerrit.wikimedia.org/r/951914 (https://phabricator.wikimedia.org/T344816) (owner: ''Zoranzoki21)'
2023-08-23 13:16:37 <logmsgbot> !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host dse-k8s-ctrl1002.eqiad.wmnet
2023-08-23 13:16:38 <wikibugs> ('Merged) ''jenkins-bot: clienthints: Remove duplicate entries when converting to DB rows [extensions/CheckUser] (wmf/1.41.0-wmf.23) - ''https://gerrit.wikimedia.org/r/951847 (https://phabricator.wikimedia.org/T344787) (owner: ''Dreamy Jazz)'
2023-08-23 13:16:49 <logmsgbot> !log btullis@cumin1001 START - Cookbook sre.druid.reboot-workers for Druid analytics cluster: Reboot Druid nodes
2023-08-23 13:17:04 <Lucas_WMDE> dangit, the backport merged just a moment before I could `scap backport` it ^^
2023-08-23 13:17:06 <Lucas_WMDE> ah well
2023-08-23 13:17:10 <logmsgbot> !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:951913|[pawiki] Enable the SandboxLink extension (T344815)]] (duration: 12m 06s)
2023-08-23 13:17:14 <stashbot> T344815: Install SandboxLink Extension in Pawiki - https://phabricator.wikimedia.org/T344815
2023-08-23 13:17:34 <jinxer-wm> (KubernetesAPILatency) firing: High Kubernetes API latency (POST serviceaccounts) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
2023-08-23 13:17:42 <logmsgbot> !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:951847|clienthints: Remove duplicate entries when converting to DB rows (T344787)]]
2023-08-23 13:17:47 <stashbot> T344787: [{reqId}] {exception_url} Wikimedia\Rdbms\DBQueryError: Error 1062: Duplicate entry 'X-X-X' for key 'PRIMARY' Function: MediaWiki\CheckUser\Services\UserAgentClientHintsManager::insertMappingRows - https://phabricator.wikimedia.org/T344787
2023-08-23 13:18:00 <Lucas_WMDE> Dreamy_Jazz: is the “remove duplicate entries” change testable on mwdebug?
2023-08-23 13:18:07 <Lucas_WMDE> (not deployed yet, asking in advance)
2023-08-23 13:18:09 <Dreamy_Jazz> Yes
2023-08-23 13:18:15 <Lucas_WMDE> ok
2023-08-23 13:18:29 <wikibugs> ('PS1) ''Herron: thanos-fe: switch to cfssl [puppet] - ''https://gerrit.wikimedia.org/r/951851 (https://phabricator.wikimedia.org/T343987)'
2023-08-23 13:18:36 <wikibugs> ('PS1) ''Papaul: Add new kubernetes node to site.pp [puppet] - ''https://gerrit.wikimedia.org/r/951921 (https://phabricator.wikimedia.org/T342534)'
2023-08-23 13:18:46 <Dreamy_Jazz> Does take a little while to test (requires some multi-browser editing), but shouldn't take more than a few minutes
2023-08-23 13:18:55 <Lucas_WMDE> ok cool
2023-08-23 13:19:17 <logmsgbot> !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and dreamyjazz: Backport for [[gerrit:951847|clienthints: Remove duplicate entries when converting to DB rows (T344787)]] synced to the testservers mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
2023-08-23 13:19:24 <Dreamy_Jazz> Testing...
2023-08-23 13:19:24 <Lucas_WMDE> then please test now :)
2023-08-23 13:19:27 <Lucas_WMDE> grabs a cup of tea
2023-08-23 13:21:37 <wikibugs> ('CR) ''Herron: [C: ''+2] thanos-fe: switch to cfssl [puppet] - ''https://gerrit.wikimedia.org/r/951851 (https://phabricator.wikimedia.org/T343987) (owner: ''Herron)'
2023-08-23 13:22:34 <jinxer-wm> (KubernetesAPILatency) resolved: High Kubernetes API latency (POST serviceaccounts) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
2023-08-23 13:22:49 <wikibugs> ('CR) ''Effie Mouzeli: [C: ''+1] thanos-fe: switch to cfssl [puppet] - ''https://gerrit.wikimedia.org/r/951851 (https://phabricator.wikimedia.org/T343987) (owner: ''Herron)'
2023-08-23 13:23:32 <logmsgbot> !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host dse-k8s-ctrl1002.eqiad.wmnet
2023-08-23 13:23:39 <Dreamy_Jazz> Hmm. Testing works, but the issue doesn't appear when I'm not using mwdebug
2023-08-23 13:23:46 <Lucas_WMDE> hm
2023-08-23 13:24:36 <Dreamy_Jazz> Let me try on enwiki
2023-08-23 13:24:45 <icinga-wm> RECOVERY - Check systemd state on maps1009 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2023-08-23 13:26:05 <Dreamy_Jazz> Lucas_WMDE: Can you check if that change isn't already in production on wmf.23?
2023-08-23 13:26:08 <Dreamy_Jazz> It fails on enwiki
2023-08-23 13:26:12 <Lucas_WMDE> I can try
2023-08-23 13:26:13 <Dreamy_Jazz> But doesn't fail on test.wikipedia.org
2023-08-23 13:26:13 <Lucas_WMDE> let me see
2023-08-23 13:26:26 <Dreamy_Jazz> test.wikipedia.org is on wmf.23 and enwiki is on wmf.22
2023-08-23 13:26:39 <logmsgbot> !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_esams and A:cp
2023-08-23 13:26:45 <logmsgbot> !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_esams and A:cp
2023-08-23 13:26:50 <Dreamy_Jazz> I'm wondering whether it being merged before the previous scap finished meant that it applied?
2023-08-23 13:27:55 <Dreamy_Jazz> The fix for the issue wasn't made until today, so there shouldn't be a way for the issue to not apply to wmf.23.
2023-08-23 13:28:17 <Lucas_WMDE> mw1430 (random appserver) /srv/mediawiki/php-1.41.0-wmf.23/extensions/CheckUser/src/ClientHints/ClientHintsData.php doesn’t have the code yet afaict
2023-08-23 13:28:37 <Dreamy_Jazz> Let me try again.
2023-08-23 13:28:42 <Lucas_WMDE> it only merged during the last few php-fpm restarts, I don’t think it should’ve gotten synced anywhere
2023-08-23 13:29:42 <Dreamy_Jazz> Going to try another wiki
2023-08-23 13:30:46 <wikibugs> ('PS1) ''Muehlenhoff: firewall::service: Replace whitespace in resource title with underscores [puppet] - ''https://gerrit.wikimedia.org/r/951922 (https://phabricator.wikimedia.org/T336497)'
2023-08-23 13:30:48 <Lucas_WMDE> ok
2023-08-23 13:31:39 <Dreamy_Jazz> It is also not doing as expected on test.wikidata.org
2023-08-23 13:32:00 <Lucas_WMDE> in that it’s not erroring even without mwdebug?
2023-08-23 13:32:02 <Dreamy_Jazz> i.e. test.wikidata.org doesn't have the server error when not on debug
2023-08-23 13:32:06 <Lucas_WMDE> hmph
2023-08-23 13:32:07 <Dreamy_Jazz> Yes
2023-08-23 13:32:10 <wikibugs> ('PS1) ''FNegri: New files/templates for OpenStack Antelope (2023.1) [puppet] - ''https://gerrit.wikimedia.org/r/951923 (https://phabricator.wikimedia.org/T341285)'
2023-08-23 13:32:19 <Lucas_WMDE> at least it’s not the other way around…
2023-08-23 13:32:23 <Dreamy_Jazz> Ikr
2023-08-23 13:32:35 <Lucas_WMDE> I’d still finish this sync just so that the merged state is consistent with what’s deployed
2023-08-23 13:32:42 <Dreamy_Jazz> Sure
2023-08-23 13:32:59 <Lucas_WMDE> and it failed on enwiki right? so we could still do the wmf.22 one afterwards, that one isn’t behaving unexpectedly so far IIUC
2023-08-23 13:33:05 <Dreamy_Jazz> Yes. It failed on enwiki.
2023-08-23 13:33:10 <Lucas_WMDE> ok, then let’s do that
2023-08-23 13:33:11 <logmsgbot> !log lucaswerkmeister-wmde@deploy1002 lucaswerkmeister-wmde and dreamyjazz: Continuing with sync
2023-08-23 13:33:28 <Lucas_WMDE> but let’s remove some variables by not having concurrent gate-and-submits ^^
2023-08-23 13:33:38 <Dreamy_Jazz> Sure.
2023-08-23 13:33:43 <Lucas_WMDE> enwiktionary has had its Index namespace unused for 2 years, it can wait a bit longer
2023-08-23 13:33:55 <Dreamy_Jazz> :)
2023-08-23 13:34:00 <Lucas_WMDE> (fyi Kizule – I might not do that one today)
2023-08-23 13:34:15 <Dreamy_Jazz> I'm happy to move my config change to a later window if desired.
2023-08-23 13:34:27 <Dreamy_Jazz> If that makes room for the other change.
2023-08-23 13:34:40 <Lucas_WMDE> well, I’ll do the wmf.22 change and then see what else there’s time for
2023-08-23 13:34:42 <Lucas_WMDE> jouncebot: next
2023-08-23 13:34:42 <jouncebot> In 1 hour(s) and 25 minute(s): Phabricator to Phorge migration (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230823T1500)
2023-08-23 13:34:46 <Lucas_WMDE> ooooooooooh
2023-08-23 13:34:50 <Dreamy_Jazz> Ikr
2023-08-23 13:34:53 <Lucas_WMDE> (but until then we could theoretically overrun the window a bit)
2023-08-23 13:35:31 <Kizule> :+1
2023-08-23 13:35:40 <Kizule> (y)
2023-08-23 13:37:04 <Kizule> Well, I would love to get my patch deployed in this window, since it's quick and easy one, but I'm hoping that we will be able to have all scheduled patches deployed. :)
2023-08-23 13:37:47 <wikibugs> ('PS1) ''Jbond: idp: drop superfluous permissions [puppet] - ''https://gerrit.wikimedia.org/r/951924 (https://phabricator.wikimedia.org/T341581)'
2023-08-23 13:38:33 <wikibugs> ('CR) ''Muehlenhoff: [C: ''+1] "Looks good" [puppet] - ''https://gerrit.wikimedia.org/r/951924 (https://phabricator.wikimedia.org/T341581) (owner: ''Jbond)'
2023-08-23 13:38:39 <Dreamy_Jazz> I've also re-tested the error locally and confirmed that without the fix (even on the master branch) the server still errors out.
2023-08-23 13:38:55 <logmsgbot> !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:951847|clienthints: Remove duplicate entries when converting to DB rows (T344787)]] (duration: 21m 12s)
2023-08-23 13:38:58 <Dreamy_Jazz> So no idea why it was fixed on non-mwdebug on wmf.23
2023-08-23 13:39:00 <stashbot> T344787: [{reqId}] {exception_url} Wikimedia\Rdbms\DBQueryError: Error 1062: Duplicate entry 'X-X-X' for key 'PRIMARY' Function: MediaWiki\CheckUser\Services\UserAgentClientHintsManager::insertMappingRows - https://phabricator.wikimedia.org/T344787
2023-08-23 13:39:30 <wikibugs> ('CR) ''Muehlenhoff: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/951922 (https://phabricator.wikimedia.org/T336497) (owner: ''Muehlenhoff)'
2023-08-23 13:39:32 <wikibugs> ('CR) ''TrainBranchBot: [C: ''+2] "Approved by lucaswerkmeister-wmde@deploy1002 using scap backport" [extensions/CheckUser] (wmf/1.41.0-wmf.22) - ''https://gerrit.wikimedia.org/r/951846 (https://phabricator.wikimedia.org/T344787) (owner: ''Dreamy Jazz)'
2023-08-23 13:45:25 <wikibugs> ('CR) ''Jbond: "LGTM minor nit optimisation inline" [puppet] - ''https://gerrit.wikimedia.org/r/951889 (https://phabricator.wikimedia.org/T336497) (owner: ''Muehlenhoff)'
2023-08-23 13:46:07 <icinga-wm> RECOVERY - Check systemd state on mirror1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2023-08-23 13:47:24 <logmsgbot> !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_esams and A:cp
2023-08-23 13:48:10 <wikibugs> ('CR) ''Herron: [C: ''+1] icinga: Add notification when purging nagios resources [puppet] - ''https://gerrit.wikimedia.org/r/951592 (https://phabricator.wikimedia.org/T263027) (owner: ''Andrea Denisse)'
2023-08-23 13:48:11 <logmsgbot> !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_esams and A:cp
2023-08-23 13:48:24 <wikibugs> 'SRE, ''ops-eqiad, ''decommission-hardware, ''fundraising-tech-ops: decommission frav1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T342678 (''Jclark-ctr)'
2023-08-23 13:48:30 <wikibugs> ('PS1) ''Gmodena: Remove rc1.mediawiki.page_content_change stream [mediawiki-config] - ''https://gerrit.wikimedia.org/r/951929 (https://phabricator.wikimedia.org/T307959)'
2023-08-23 13:48:36 <wikibugs> 'SRE, ''ops-eqiad, ''decommission-hardware, ''fundraising-tech-ops: decommission frav1002.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T342678 (''Jclark-ctr) ''Open''Resolved'
2023-08-23 13:50:00 <logmsgbot> !log jmm@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host testvm2004.codfw.wmnet with OS bookworm
2023-08-23 13:51:58 <wikibugs> 'SRE, ''ops-eqiad, ''decommission-hardware, ''fundraising-tech-ops: decommission civi1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T341868 (''Jclark-ctr)'
2023-08-23 13:52:04 <wikibugs> ('CR) ''Jbond: [C: ''+2] idp: drop superfluous permissions [puppet] - ''https://gerrit.wikimedia.org/r/951924 (https://phabricator.wikimedia.org/T341581) (owner: ''Jbond)'
2023-08-23 13:52:12 <wikibugs> ('PS1) ''Majavah: P:terraform: don't serve BUSL licensed Terraform versions [puppet] - ''https://gerrit.wikimedia.org/r/951934'
2023-08-23 13:52:20 <wikibugs> 'SRE, ''ops-eqiad, ''decommission-hardware, ''fundraising-tech-ops: decommission civi1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T341868 (''Jclark-ctr) ''Open''Resolved'
2023-08-23 13:52:26 <wikibugs> ('CR) ''Jbond: [C: ''+2] idp: drop superfluous permissions [puppet] - ''https://gerrit.wikimedia.org/r/951917 (https://phabricator.wikimedia.org/T341581) (owner: ''Jbond)'
2023-08-23 13:52:30 <wikibugs> ('PS2) ''Jbond: idp: drop superfluous permissions [puppet] - ''https://gerrit.wikimedia.org/r/951917 (https://phabricator.wikimedia.org/T341581)'
2023-08-23 13:52:48 <wikibugs> ('Merged) ''jenkins-bot: clienthints: Remove duplicate entries when converting to DB rows [extensions/CheckUser] (wmf/1.41.0-wmf.22) - ''https://gerrit.wikimedia.org/r/951846 (https://phabricator.wikimedia.org/T344787) (owner: ''Dreamy Jazz)'
2023-08-23 13:52:51 <wikibugs> ('PS3) ''Jbond: idp: add datacenter-ops to puppetboard [puppet] - ''https://gerrit.wikimedia.org/r/951903 (https://phabricator.wikimedia.org/T341581)'
2023-08-23 13:53:11 <wikibugs> ('CR) ''Papaul: [C: ''+2] Add new kubernetes node to site.pp [puppet] - ''https://gerrit.wikimedia.org/r/951921 (https://phabricator.wikimedia.org/T342534) (owner: ''Papaul)'
2023-08-23 13:53:18 <wikibugs> ('PS1) ''Jgiannelos: tegola debug: Change schedule of eqiad cronjobs temporarily [deployment-charts] - ''https://gerrit.wikimedia.org/r/951936'
2023-08-23 13:53:19 <logmsgbot> !log lucaswerkmeister-wmde@deploy1002 Started scap: Backport for [[gerrit:951846|clienthints: Remove duplicate entries when converting to DB rows (T344787)]]
2023-08-23 13:53:23 <stashbot> T344787: [{reqId}] {exception_url} Wikimedia\Rdbms\DBQueryError: Error 1062: Duplicate entry 'X-X-X' for key 'PRIMARY' Function: MediaWiki\CheckUser\Services\UserAgentClientHintsManager::insertMappingRows - https://phabricator.wikimedia.org/T344787
2023-08-23 13:53:57 <wikibugs> ('CR) ''Andrea Denisse: [C: ''+2] icinga: Add notification when purging nagios resources [puppet] - ''https://gerrit.wikimedia.org/r/951592 (https://phabricator.wikimedia.org/T263027) (owner: ''Andrea Denisse)'
2023-08-23 13:54:50 <logmsgbot> !log lucaswerkmeister-wmde@deploy1002 dreamyjazz and lucaswerkmeister-wmde: Backport for [[gerrit:951846|clienthints: Remove duplicate entries when converting to DB rows (T344787)]] synced to the testservers mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
2023-08-23 13:54:57 <Dreamy_Jazz> Testing now
2023-08-23 13:55:05 <wikibugs> 'SRE, ''ops-eqiad, ''decommission-hardware, ''fundraising-tech-ops: decommission frbast1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T340155 (''Jclark-ctr) ''Open''Resolved'
2023-08-23 13:55:12 <Lucas_WMDE> ok
2023-08-23 13:56:37 <logmsgbot> !log btullis@cumin1001 END (PASS) - Cookbook sre.kafka.reboot-workers (exit_code=0) for Kafka test-eqiad cluster: Reboot kafka nodes
2023-08-23 13:56:49 <Dreamy_Jazz> For some reason the same thing has happened.
2023-08-23 13:56:51 <wikibugs> 'SRE, ''ops-eqiad, ''decommission-hardware, ''fundraising-tech-ops: decommission frmon1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T342693 (''Jclark-ctr) ''Open''Resolved'
2023-08-23 13:57:06 <Lucas_WMDE> o_O
2023-08-23 13:57:23 <Dreamy_Jazz> debug mode is definitely off
2023-08-23 13:58:02 <wikibugs> 'SRE, ''ops-codfw, ''DC-Ops, ''serviceops, ''Patch-For-Review: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (''Papaul)'
2023-08-23 13:58:09 <Dreamy_Jazz> Literally the same request in the console (changing only the revision ID) now works on non-debug servers
2023-08-23 13:58:13 <Lucas_WMDE> did you check in dev tools too?
2023-08-23 13:58:14 <Lucas_WMDE> huh
2023-08-23 13:58:25 <Lucas_WMDE> (I assume console means outside the browser and thus renders my question pointless)
2023-08-23 13:58:34 <Dreamy_Jazz> Using the browser console
2023-08-23 13:58:45 <Lucas_WMDE> I was wondering if the extension is maybe buggy and always adding the header for some reason
2023-08-23 13:58:47 <Dreamy_Jazz> I could try a non-browser console, but I'm using the fetch command
2023-08-23 13:58:50 <Lucas_WMDE> maybe check in the network panel?
2023-08-23 13:59:27 <Dreamy_Jazz> Actually that might be it
2023-08-23 13:59:50 <Dreamy_Jazz> x-wikimedia-debug is set to backend=mwdebug1001.eqiad.wmnet when the extension has the debug mode off
2023-08-23 14:00:12 <Dreamy_Jazz> This is still the case even after a Ctr + F5
2023-08-23 14:00:16 <Dreamy_Jazz> *Ctrl
2023-08-23 14:00:17 <Lucas_WMDE> that sounds like it shouldn’t happen
2023-08-23 14:00:22 <Dreamy_Jazz> Ikr
2023-08-23 14:00:22 <Lucas_WMDE> and doesn’t happen on my end
2023-08-23 14:00:27 <Lucas_WMDE> firefox or chrome?
2023-08-23 14:00:30 <Dreamy_Jazz> Firefox
2023-08-23 14:00:31 <Lucas_WMDE> (I’m on ff)
2023-08-23 14:00:32 <Lucas_WMDE> hm ok
2023-08-23 14:00:42 <Dreamy_Jazz> However, the fix works as intended
2023-08-23 14:00:46 <Lucas_WMDE> yeah, good to sync I assume
2023-08-23 14:00:50 <Dreamy_Jazz> Yes
2023-08-23 14:00:53 <logmsgbot> !log lucaswerkmeister-wmde@deploy1002 dreamyjazz and lucaswerkmeister-wmde: Continuing with sync
2023-08-23 14:01:01 <Lucas_WMDE> and then probably a phab task for the extension and/or firefox being buggy?
2023-08-23 14:01:09 <Lucas_WMDE> jouncebot: now
2023-08-23 14:01:09 <jouncebot> No deployments scheduled for the next 0 hour(s) and 58 minute(s)
2023-08-23 14:01:15 <Dreamy_Jazz> I will investigate further
2023-08-23 14:01:35 <logmsgbot> !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-text_eqiad and A:cp
2023-08-23 14:01:39 <logmsgbot> !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on A:cp-upload_eqiad and A:cp
2023-08-23 14:01:39 <Dreamy_Jazz> Oh. I've realised what has happened
2023-08-23 14:01:42 <Lucas_WMDE> I think I’ll stop deploying after this change and not overrun the window too much
2023-08-23 14:01:47 <wikibugs> 'SRE, ''SRE-Access-Requests: Requesting membership in analytics-privatedata-users group, sql_lab role, Kerberos Principal for Omari Sefu - https://phabricator.wikimedia.org/T344257 (''OSefu-WMF) ''Resolved''Open Hi All - Reopening to confirm that my the SQL Lab role in superset has been applied correctly...'
2023-08-23 14:01:50 <Lucas_WMDE> to leave more of a break before the big phab move
2023-08-23 14:01:54 <Lucas_WMDE> yes?
2023-08-23 14:02:02 <Dreamy_Jazz> When copying the request from chrome as a "fetch" is also copies the headers
2023-08-23 14:02:08 <Dreamy_Jazz> I had not noticed this
2023-08-23 14:02:11 <Lucas_WMDE> ahhhh yes
2023-08-23 14:02:20 <Dreamy_Jazz> My mistake then. Apologies.
2023-08-23 14:02:32 <Lucas_WMDE> ok, mystery solved then \o/
2023-08-23 14:02:33 <Lucas_WMDE> phew
2023-08-23 14:03:47 <Dreamy_Jazz> I've moved my config change to the next window.
2023-08-23 14:03:51 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
2023-08-23 14:04:30 <Dreamy_Jazz> Should I remove the config change from the current window or is there a way to indicate it wasn't done due to time constraints?
2023-08-23 14:04:51 <Lucas_WMDE> Dreamy_Jazz: I wouldn’t usually bother updating the finished window tbh
2023-08-23 14:05:00 <Dreamy_Jazz> Okay. Thanks.
2023-08-23 14:05:03 <Lucas_WMDE> if someone wants to know whether something was deployed or not they should look at gerrit or SAL
2023-08-23 14:05:17 <wikibugs> 'SRE, ''ops-eqiad, ''decommission-hardware, ''fundraising-tech-ops: decommission frdev1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T341869 (''Jclark-ctr)'
2023-08-23 14:05:27 <wikibugs> 'SRE, ''ops-eqiad, ''decommission-hardware, ''fundraising-tech-ops: decommission frdev1001.frack.eqiad.wmnet - https://phabricator.wikimedia.org/T341869 (''Jclark-ctr) ''Open''Resolved'
2023-08-23 14:05:47 <Dreamy_Jazz> And apologies to Kizule for delaying their change being made by not noticing the debug header being included in the console request.
2023-08-23 14:05:51 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS entries for kubernetes2053 - pt1979@cumin2002"
2023-08-23 14:06:00 <Kizule> No problem, I'm moving my patch to another window as well. :)
2023-08-23 14:06:30 <logmsgbot> !log lucaswerkmeister-wmde@deploy1002 Finished scap: Backport for [[gerrit:951846|clienthints: Remove duplicate entries when converting to DB rows (T344787)]] (duration: 13m 10s)
2023-08-23 14:06:34 <jinxer-wm> (KubernetesAPILatency) firing: High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
2023-08-23 14:06:38 <stashbot> T344787: [{reqId}] {exception_url} Wikimedia\Rdbms\DBQueryError: Error 1062: Duplicate entry 'X-X-X' for key 'PRIMARY' Function: MediaWiki\CheckUser\Services\UserAgentClientHintsManager::insertMappingRows - https://phabricator.wikimedia.org/T344787
2023-08-23 14:06:44 <jinxer-wm> (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2023-08-23 14:06:44 <Lucas_WMDE> !log UTC afternoon backport+config window done
2023-08-23 14:06:47 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-08-23 14:06:49 <icinga-wm> PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: check_netbox_uncommitted_dns_changes.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2023-08-23 14:07:06 <Dreamy_Jazz> Thanks for the deploy!
2023-08-23 14:07:21 <Kizule> From me as well, see you later!
2023-08-23 14:07:38 <Lucas_WMDE> see you!
2023-08-23 14:08:15 <icinga-wm> RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2023-08-23 14:08:25 <wikibugs> 'SRE, ''ops-eqiad, ''Data-Platform-SRE, ''decommission-hardware: decommission db1108.eqiad.wmnet - https://phabricator.wikimedia.org/T336254 (''Jclark-ctr) ''Open''Resolved'
2023-08-23 14:08:44 <wikibugs> ('PS3) ''JMeybohm: admin_ng: Include admin service secrets [deployment-charts] - ''https://gerrit.wikimedia.org/r/951915 (https://phabricator.wikimedia.org/T297417)'
2023-08-23 14:09:17 <icinga-wm> PROBLEM - Uncommitted DNS changes in Netbox on netbox1002 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
2023-08-23 14:11:34 <jinxer-wm> (KubernetesAPILatency) resolved: (3) High Kubernetes API latency (LIST blockaffinities) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
2023-08-23 14:15:44 <wikibugs> ('CR) ''Stevemunene: [C: ''+2] switch an-worker[17-48] to reuse-analytics-hadoop recipe [puppet] - ''https://gerrit.wikimedia.org/r/951458 (https://phabricator.wikimedia.org/T332570) (owner: ''Stevemunene)'
2023-08-23 14:15:46 <wikibugs> 'ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T344127 (''Jclark-ctr) Replaced failed cable'
2023-08-23 14:16:44 <jinxer-wm> (JobUnavailable) firing: (2) Reduced availability for job nutcracker in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2023-08-23 14:16:50 <logmsgbot> !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1117.eqiad.wmnet with OS bullseye
2023-08-23 14:18:15 <wikibugs> 'ops-eqiad: ManagementSSHDown - https://phabricator.wikimedia.org/T344127 (''Jclark-ctr) ''Open''Resolved a:''Jclark-ctr'
2023-08-23 14:18:41 <logmsgbot> !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-upload_eqiad and A:cp
2023-08-23 14:19:18 <wikibugs> ('CR) ''JMeybohm: [C: ''+2] admin_ng: Include admin service secrets [deployment-charts] - ''https://gerrit.wikimedia.org/r/951915 (https://phabricator.wikimedia.org/T297417) (owner: ''JMeybohm)'
2023-08-23 14:20:09 <icinga-wm> RECOVERY - Uncommitted DNS changes in Netbox on netbox1002 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes
2023-08-23 14:21:03 <logmsgbot> !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on A:cp-text_eqiad and A:cp
2023-08-23 14:22:13 <logmsgbot> !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host datahubsearch1001.eqiad.wmnet
2023-08-23 14:22:23 <wikibugs> 'ops-eqiad: PDU sensor over limit - https://phabricator.wikimedia.org/T344394 (''Jclark-ctr) ''Open''Resolved a:''Jclark-ctr Rebalanced bower'
2023-08-23 14:22:40 <wikibugs> ('PS5) ''Muehlenhoff: Make nftables::service types more compatible [puppet] - ''https://gerrit.wikimedia.org/r/951889 (https://phabricator.wikimedia.org/T336497)'
2023-08-23 14:23:36 <akosiaris> !log pool kartotherian in codfw for testing T344324
2023-08-23 14:23:40 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-08-23 14:23:41 <stashbot> T344324: Maps Unavailability due to thanos-swift cfssl rollout (14 Aug 2023) - https://phabricator.wikimedia.org/T344324
2023-08-23 14:23:43 <wikibugs> ('CR) ''Muehlenhoff: "check experimental" [puppet] - ''https://gerrit.wikimedia.org/r/951889 (https://phabricator.wikimedia.org/T336497) (owner: ''Muehlenhoff)'
2023-08-23 14:23:47 <logmsgbot> !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=codfw
2023-08-23 14:24:03 <jinxer-wm> (KubernetesAPILatency) firing: High Kubernetes API latency (PUT customresourcedefinitions) on k8s-mlstaging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlstaging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
2023-08-23 14:25:11 <icinga-wm> PROBLEM - Router interfaces on cr2-eqsin is CRITICAL: CRITICAL: host 103.102.166.130, interfaces up: 82, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
2023-08-23 14:26:17 <logmsgbot> !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host datahubsearch1001.eqiad.wmnet
2023-08-23 14:26:24 <vgutierrez> !log update to HAProxy 2.7.10 in cp4052 and cp5032 - T344047
2023-08-23 14:26:27 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-08-23 14:26:57 <wikibugs> ('Merged) ''jenkins-bot: admin_ng: Include admin service secrets [deployment-charts] - ''https://gerrit.wikimedia.org/r/951915 (https://phabricator.wikimedia.org/T297417) (owner: ''JMeybohm)'
2023-08-23 14:27:19 <jinxer-wm> (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
2023-08-23 14:28:17 <icinga-wm> PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 70, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
2023-08-23 14:28:54 <logmsgbot> !log vgutierrez@cumin1001 START - Cookbook sre.cdn.roll-upgrade-haproxy rolling upgrade of HAProxy on P{cp4052.*,cp5032.*} and A:cp
2023-08-23 14:30:55 <wikibugs> 'SRE, ''ops-eqiad, ''DC-Ops, ''MW-on-K8s, ''serviceops-radar: Physical re-labeling of mw1497 and mw1498 to kubernetes1025 and kubernetes1026 - https://phabricator.wikimedia.org/T343708 (''Jclark-ctr) ''Open''Resolved Relabled Servers'
2023-08-23 14:31:16 <wikibugs> 'SRE, ''ops-eqiad, ''DC-Ops, ''MW-on-K8s, ''serviceops-radar: Physical re-labeling of mw1497 and mw1498 to kubernetes1025 and kubernetes1026 - https://phabricator.wikimedia.org/T343708 (''Clement_Goubert) Thank you, sorry for the out-of-order operation'
2023-08-23 14:31:33 <icinga-wm> PROBLEM - Maps HTTPS on maps2006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook
2023-08-23 14:31:35 <icinga-wm> PROBLEM - Maps HTTPS on maps1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook
2023-08-23 14:31:47 <icinga-wm> PROBLEM - Maps HTTPS on maps2010 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook
2023-08-23 14:31:57 <icinga-wm> PROBLEM - Maps HTTPS on maps2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook
2023-08-23 14:32:13 <icinga-wm> PROBLEM - Maps HTTPS on maps2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook
2023-08-23 14:32:15 <icinga-wm> PROBLEM - Maps HTTPS on maps2008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook
2023-08-23 14:32:16 <logmsgbot> !log jayme@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'.
2023-08-23 14:32:21 <icinga-wm> PROBLEM - Maps HTTPS on maps2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook
2023-08-23 14:32:34 <logmsgbot> !log btullis@cumin1001 START - Cookbook sre.hadoop.reboot-workers for Hadoop test cluster
2023-08-23 14:32:51 <logmsgbot> !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host datahubsearch1002.eqiad.wmnet
2023-08-23 14:33:33 <jinxer-wm> (KubernetesAPILatency) firing: (2) High Kubernetes API latency (GET pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
2023-08-23 14:33:47 <effie> we know about the maps hosts
2023-08-23 14:34:04 <logmsgbot> !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.cdn.roll-upgrade-haproxy (exit_code=0) rolling upgrade of HAProxy on P{cp4052.*,cp5032.*} and A:cp
2023-08-23 14:34:28 <akosiaris> !log depool again kartotherian in codfw for testing T344324
2023-08-23 14:34:32 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-08-23 14:34:32 <stashbot> T344324: Maps Unavailability due to thanos-swift cfssl rollout (14 Aug 2023) - https://phabricator.wikimedia.org/T344324
2023-08-23 14:34:35 <logmsgbot> !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=codfw
2023-08-23 14:35:56 <wikibugs> ('CR) ''Joal: [C: ''+1] "I guess we'll can deploy this safely as no more producer uses this stream, right?" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/951929 (https://phabricator.wikimedia.org/T307959) (owner: ''Gmodena)'
2023-08-23 14:36:45 <wikibugs> ('CR) ''BCornwall: [C: ''+2] sre.cdn.roll-reboot: Reduce min_grace_sleep to 300 [cookbooks] - ''https://gerrit.wikimedia.org/r/951196 (owner: ''BCornwall)'
2023-08-23 14:36:50 <logmsgbot> !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host datahubsearch1002.eqiad.wmnet
2023-08-23 14:37:05 <wikibugs> ('PS6) ''Hnowlan: helmfile: add namespace and service definition for geo-analytics [deployment-charts] - ''https://gerrit.wikimedia.org/r/941374 (https://phabricator.wikimedia.org/T336400)'
2023-08-23 14:37:12 <logmsgbot> !log jmm@cumin2002 START - Cookbook sre.ganeti.reboot-vm for VM testvm2002.codfw.wmnet
2023-08-23 14:38:33 <jinxer-wm> (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (GET pods) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
2023-08-23 14:40:01 <wikibugs> ('CR) ''Gmodena: Remove rc1.mediawiki.page_content_change stream (''1 comment) [mediawiki-config] - ''https://gerrit.wikimedia.org/r/951929 (https://phabricator.wikimedia.org/T307959) (owner: ''Gmodena)'
2023-08-23 14:41:04 <logmsgbot> !log jmm@cumin2002 END (PASS) - Cookbook sre.ganeti.reboot-vm (exit_code=0) for VM testvm2002.codfw.wmnet
2023-08-23 14:43:43 <icinga-wm> RECOVERY - Maps HTTPS on maps2009 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.761 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
2023-08-23 14:44:26 <logmsgbot> !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host datahubsearch1003.eqiad.wmnet
2023-08-23 14:44:35 <icinga-wm> RECOVERY - Maps HTTPS on maps2010 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.172 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
2023-08-23 14:45:03 <icinga-wm> RECOVERY - Maps HTTPS on maps2007 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.169 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
2023-08-23 14:45:47 <icinga-wm> RECOVERY - Maps HTTPS on maps2006 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.168 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
2023-08-23 14:45:57 <icinga-wm> RECOVERY - Maps HTTPS on maps1009 is OK: HTTP OK: HTTP/1.1 200 OK - 1342 bytes in 6.672 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
2023-08-23 14:46:13 <icinga-wm> RECOVERY - Maps HTTPS on maps2005 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.216 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
2023-08-23 14:47:28 <wikibugs> ('PS2) ''BBlack: varnish: parameterize fe cache mem reservation [puppet] - ''https://gerrit.wikimedia.org/r/849633'
2023-08-23 14:47:30 <wikibugs> ('PS1) ''BBlack: esams: experimental frontend memory settings [puppet] - ''https://gerrit.wikimedia.org/r/951949'
2023-08-23 14:47:53 <wikibugs> ('CR) ''Jbond: [C: ''+1] "lgtm but see inline" [puppet] - ''https://gerrit.wikimedia.org/r/951922 (https://phabricator.wikimedia.org/T336497) (owner: ''Muehlenhoff)'
2023-08-23 14:47:57 <icinga-wm> RECOVERY - Maps HTTPS on maps2008 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
2023-08-23 14:48:25 <wikibugs> ('PS7) ''Hnowlan: helmfile: add namespace and service definition for geo-analytics [deployment-charts] - ''https://gerrit.wikimedia.org/r/941374 (https://phabricator.wikimedia.org/T336400)'
2023-08-23 14:48:25 <logmsgbot> !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host datahubsearch1003.eqiad.wmnet
2023-08-23 14:49:06 <wikibugs> ('CR) ''CI reject: [V: ''-1] varnish: parameterize fe cache mem reservation [puppet] - ''https://gerrit.wikimedia.org/r/849633 (owner: ''BBlack)'
2023-08-23 14:50:01 <wikibugs> 'SRE, ''SRE-Access-Requests: Requesting membership in analytics-privatedata-users group, sql_lab role, Kerberos Principal for Omari Sefu - https://phabricator.wikimedia.org/T344257 (''jbond) > Reopening you say that the role has been applied correctly, Is there some further action required?'
2023-08-23 14:50:12 <wikibugs> ('PS1) ''JMeybohm: Remove admin secrets from service secrets [labs/private] - ''https://gerrit.wikimedia.org/r/951951 (https://phabricator.wikimedia.org/T297417)'
2023-08-23 14:50:14 <logmsgbot> !log btullis@cumin1001 START - Cookbook sre.hosts.reboot-single for host an-launcher1002.eqiad.wmnet
2023-08-23 14:50:34 <wikibugs> ('PS1) ''JMeybohm: deployment_server::services: Drop dummy admin services [puppet] - ''https://gerrit.wikimedia.org/r/951952 (https://phabricator.wikimedia.org/T297417)'
2023-08-23 14:50:57 <wikibugs> ('CR) ''JMeybohm: [V: ''+2 C: ''+2] Remove admin secrets from service secrets [labs/private] - ''https://gerrit.wikimedia.org/r/951951 (https://phabricator.wikimedia.org/T297417) (owner: ''JMeybohm)'
2023-08-23 14:53:47 <wikibugs> ('CR) ''JMeybohm: [V: ''+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42996/console"; [puppet] - ''https://gerrit.wikimedia.org/r/951952 (https://phabricator.wikimedia.org/T297417) (owner: ''JMeybohm)'
2023-08-23 14:54:14 <wikibugs> ('Abandoned) ''Fabfur: haproxy: sanitize eventual duplicate content-length header [puppet] - ''https://gerrit.wikimedia.org/r/951832 (https://phabricator.wikimedia.org/T344047) (owner: ''Fabfur)'
2023-08-23 14:54:17 <wikibugs> ('CR) ''JMeybohm: [V: ''+1 C: ''+2] deployment_server::services: Drop dummy admin services [puppet] - ''https://gerrit.wikimedia.org/r/951952 (https://phabricator.wikimedia.org/T297417) (owner: ''JMeybohm)'
2023-08-23 14:55:25 <wikibugs> ('PS3) ''BBlack: varnish: parameterize fe cache mem reservation [puppet] - ''https://gerrit.wikimedia.org/r/849633'
2023-08-23 14:55:27 <wikibugs> ('PS2) ''BBlack: esams: experimental frontend memory settings [puppet] - ''https://gerrit.wikimedia.org/r/951949'
2023-08-23 14:55:54 <wikibugs> 'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (''nskaggs) Any update on status from Dell on getting this hardware operational? Are we still waiting on the correct controller cards?'
2023-08-23 14:56:08 <wikibugs> 'ops-eqiad, ''DC-Ops, ''Infrastructure-Foundations: Q1:rack/setup/install pki1002 - https://phabricator.wikimedia.org/T342892 (''Jclark-ctr)'
2023-08-23 14:56:35 <logmsgbot> !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply
2023-08-23 14:57:14 <akosiaris> !log deploy codfw tegola-vector-tiles with high CPU limits to rule out a hunch. T344324
2023-08-23 14:57:17 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-08-23 14:57:18 <stashbot> T344324: Maps Unavailability due to thanos-swift cfssl rollout (14 Aug 2023) - https://phabricator.wikimedia.org/T344324
2023-08-23 14:57:54 <logmsgbot> !log stevemunene@cumin1001 START - Cookbook sre.hosts.reimage for host an-worker1117.eqiad.wmnet with OS bullseye
2023-08-23 14:58:24 <logmsgbot> !log akosiaris@deploy1002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply
2023-08-23 14:58:56 <logmsgbot> !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host bast4004.wikimedia.org
2023-08-23 14:58:57 <logmsgbot> !log akosiaris@deploy1002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: apply
2023-08-23 14:59:04 <wikibugs> ('PS2) ''Klausman: prometheus: Add recording rules for istio traffic on k8s [puppet] - ''https://gerrit.wikimedia.org/r/948149 (https://phabricator.wikimedia.org/T327620)'
2023-08-23 14:59:15 <wikibugs> 'SRE, ''ops-eqiad, ''DC-Ops, ''Data-Persistence: Q1:rack/setup/install db12[34-49] - https://phabricator.wikimedia.org/T342166 (''Jclark-ctr)'
2023-08-23 14:59:17 <akosiaris> !log pool kartotherian in codfw for testing T344324
2023-08-23 14:59:21 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-08-23 14:59:29 <logmsgbot> !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=codfw
2023-08-23 14:59:30 <logmsgbot> !log jmm@cumin2002 START - Cookbook sre.hosts.reimage for host testvm2004.codfw.wmnet with OS bookworm
2023-08-23 15:00:00 <logmsgbot> !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host an-launcher1002.eqiad.wmnet
2023-08-23 15:00:04 <jouncebot> brennen: My dear minions, it's time we take the moon! Just kidding. Time for Phabricator to Phorge migration deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230823T1500).
2023-08-23 15:00:20 <wikibugs> 'SRE, ''SRE-Access-Requests: Requesting membership in analytics-privatedata-users group, sql_lab role, Kerberos Principal for Omari Sefu - https://phabricator.wikimedia.org/T344257 (''OSefu-WMF) Sorry typo above >>! In T344257#9113615, @OSefu-WMF wrote: > Hi All - Reopening to confirm that my the SQL Lab rol...'
2023-08-23 15:00:54 <brennen> o/
2023-08-23 15:01:01 <marostegui> o/
2023-08-23 15:01:03 <Lucas_WMDE> hype
2023-08-23 15:01:17 <jynus> brennen: marostegui: I have phab backups on both datacenters- the finished correctly and are currently being compressed to recover them quickly
2023-08-23 15:01:23 <marostegui> cool
2023-08-23 15:01:30 <marostegui> brennen: please let me know before putting phab in RO
2023-08-23 15:01:49 <wikibugs> 'SRE, ''SRE-Access-Requests: Requesting membership in analytics-privatedata-users group, sql_lab role, Kerberos Principal for Omari Sefu - https://phabricator.wikimedia.org/T344257 (''BTullis) Thanks @OSefu-WMF - Could you try running a query again now please? I made a change recently (https://gerrit.wikimed...'
2023-08-23 15:01:53 <jynus> I will be here just in standby mode
2023-08-23 15:02:40 <brennen> marostegui: we're downtiming the service now, will probably just stop httpd and phd, then let you know.
2023-08-23 15:02:48 <logmsgbot> !log jelto@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on phab1004.eqiad.wmnet with reason: Switch Phabricator to Phorge
2023-08-23 15:02:58 <marostegui> brennen: excellent, thanks
2023-08-23 15:03:03 <logmsgbot> !log jelto@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on phab1004.eqiad.wmnet with reason: Switch Phabricator to Phorge
2023-08-23 15:04:13 <wikibugs> ('PS3) ''Klausman: prometheus: Add recording rules for istio traffic on k8s [puppet] - ''https://gerrit.wikimedia.org/r/948149 (https://phabricator.wikimedia.org/T327620)'
2023-08-23 15:04:44 <wikibugs> 'SRE, ''ops-eqiad, ''DC-Ops, ''serviceops: Q1:rack/setup/install kubernetes10[27-56] - https://phabricator.wikimedia.org/T342533 (''Jclark-ctr)'
2023-08-23 15:04:47 <logmsgbot> !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host bast4004.wikimedia.org
2023-08-23 15:05:09 <wikibugs> ('CR) ''FNegri: [C: ''+1] "Agreed. Hopefully something will come out of the OpenTF initiative:" [puppet] - ''https://gerrit.wikimedia.org/r/951934 (owner: ''Majavah)'
2023-08-23 15:06:07 <wikibugs> ('CR) ''Klausman: prometheus: Add recording rules for istio traffic on k8s (''1 comment) [puppet] - ''https://gerrit.wikimedia.org/r/948149 (https://phabricator.wikimedia.org/T327620) (owner: ''Klausman)'
2023-08-23 15:06:39 <icinga-wm> PROBLEM - Host an-druid1004 is DOWN: PING CRITICAL - Packet loss = 100%
2023-08-23 15:06:45 <icinga-wm> RECOVERY - Host an-druid1004 is UP: PING OK - Packet loss = 0%, RTA = 0.61 ms
2023-08-23 15:07:01 <wikibugs> 'SRE, ''ops-eqiad, ''DC-Ops, ''Data-Platform-SRE: Q1:rack/setup/install wdqs102[0-4] - https://phabricator.wikimedia.org/T342749 (''Jclark-ctr)'
2023-08-23 15:07:50 <logmsgbot> !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host cuminunpriv1001.eqiad.wmnet
2023-08-23 15:08:41 <jynus> Should I change topic?
2023-08-23 15:08:57 <brennen> marostegui: phab httpd & phd are down
2023-08-23 15:09:01 <marostegui> ok
2023-08-23 15:09:02 <marostegui> one sec
2023-08-23 15:09:12 <brennen> jynus: might be good to mention phab maint
2023-08-23 15:09:20 <brennen> hopefully this will be brief. :)
2023-08-23 15:09:24 <brennen> (things i should not say aloud.)
2023-08-23 15:09:36 <marostegui> brennen: replication stopped
2023-08-23 15:09:55 <brennen> marostegui: good to proceed with migration?
2023-08-23 15:09:59 <marostegui> brennen: yep
2023-08-23 15:10:03 <brennen> cool, here goes
2023-08-23 15:10:07 <jynus> ^
2023-08-23 15:10:19 <jynus> that will also CC urandom and herron
2023-08-23 15:10:35 <logmsgbot> !log brennen@deploy1002 Started deploy [phabricator/deployment@82e8e76]: update phabricator to phorge (T333885)
2023-08-23 15:11:09 <herron> ack
2023-08-23 15:11:10 <logmsgbot> !log brennen@deploy1002 Finished deploy [phabricator/deployment@82e8e76]: update phabricator to phorge (T333885) (duration: 00m 34s)
2023-08-23 15:11:50 <logmsgbot> !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cuminunpriv1001.eqiad.wmnet
2023-08-23 15:12:44 <brennen> marostegui: ok, phab is back up, migrations should i believe have happened, lemme confirm that...
2023-08-23 15:12:47 <logmsgbot> !log btullis@cumin1001 END (PASS) - Cookbook sre.hadoop.reboot-workers (exit_code=0) for Hadoop test cluster
2023-08-23 15:13:02 <brett> cheers from the sidelines
2023-08-23 15:13:22 <Dreamy_Jazz> My changes to a task made minutes before seem to still be there
2023-08-23 15:13:23 <marostegui> brennen: sure, we can leave replication stopped till tomorrow on the "just in-case host"
2023-08-23 15:13:37 <marostegui> That shouldn't be an issue
2023-08-23 15:13:49 <jynus> oh, wow, that was fast
2023-08-23 15:14:01 <brennen> yeah, scap deploy is pretty quick
2023-08-23 15:14:04 <wikibugs> ('PS2) ''FNegri: New files/templates for OpenStack Antelope (2023.1) [puppet] - ''https://gerrit.wikimedia.org/r/951923 (https://phabricator.wikimedia.org/T341285)'
2023-08-23 15:14:10 <jynus> I thought it was going to be like a multiple-hours db migration
2023-08-23 15:14:30 <wikibugs> ('CR) ''CI reject: [V: ''-1] New files/templates for OpenStack Antelope (2023.1) [puppet] - ''https://gerrit.wikimedia.org/r/951923 (https://phabricator.wikimedia.org/T341285) (owner: ''FNegri)'
2023-08-23 15:14:35 <James_F> So it must be new code. :-)
2023-08-23 15:14:35 <James_F> Well, the font changed.
2023-08-23 15:16:10 <wikibugs> ('CR) ''BBlack: "PCC on a few nodes: https://puppet-compiler.wmflabs.org/output/951949/42998/"; [puppet] - ''https://gerrit.wikimedia.org/r/951949 (owner: ''BBlack)'
2023-08-23 15:16:34 <logmsgbot> !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on testvm2004.codfw.wmnet with reason: host reimage
2023-08-23 15:17:16 <Dreamy_Jazz> One thing that might be different is the Activity pane doesn't seem expanded by default. I can't remember 100% whether it was expanded by default before, but the empty space looks odd.
2023-08-23 15:17:45 <marostegui> Dreamy_Jazz: It was expanded by default before, yeah
2023-08-23 15:17:49 <marostegui> At least it'd show stuff
2023-08-23 15:18:13 <Dreamy_Jazz> It does still show stuff if you click on one of the tabs
2023-08-23 15:18:17 <marostegui> yeah
2023-08-23 15:18:35 <jynus> I expect minor inconveniences to show up, it always happens on upgrade
2023-08-23 15:18:44 <Dreamy_Jazz> ^
2023-08-23 15:18:50 <jynus> but as long as it is that, it is not a big issue
2023-08-23 15:19:02 <Dreamy_Jazz> No problem with it being like this. Just wanted to report it.
2023-08-23 15:19:45 <brennen> thanks, noted. we think this is an upstream issue that should be fixed with future updates.
2023-08-23 15:19:54 <logmsgbot> !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on testvm2004.codfw.wmnet with reason: host reimage
2023-08-23 15:19:56 <Dreamy_Jazz> 👍
2023-08-23 15:20:25 <brennen> marostegui: leaving replication stopped seems sensible. i'll be around through US workday tomorrow. hoping we don't identify anything that needs a rollback though.
2023-08-23 15:20:35 <jynus> brennen: is the maintenance then finished, other than monitoring?
2023-08-23 15:21:04 <logmsgbot> !log brennen@deploy1002 Started deploy [phabricator/deployment@82e8e76]: update phabricator to phorge (T333885)
2023-08-23 15:21:09 <stashbot> T333885: Migrate phabricator.wikimedia.org to Phorge as upstream - https://phabricator.wikimedia.org/T333885
2023-08-23 15:21:09 <marostegui> brennen: No problem, I will leave it stopped until you give me green light
2023-08-23 15:21:21 <brennen> jynus: we're updating the fallback machine and then this should just be monitoring.
2023-08-23 15:21:27 <wikibugs> ('PS1) ''Effie Mouzeli: tegola: bump image and cpu limits on codfw [deployment-charts] - ''https://gerrit.wikimedia.org/r/951957 (https://phabricator.wikimedia.org/T344324)'
2023-08-23 15:21:37 <jynus> I see, then I will wait for that to finish
2023-08-23 15:21:43 <logmsgbot> !log brennen@deploy1002 Finished deploy [phabricator/deployment@82e8e76]: update phabricator to phorge (T333885) (duration: 00m 38s)
2023-08-23 15:22:15 <wikibugs> ('CR) ''CI reject: [V: ''-1] tegola: bump image and cpu limits on codfw [deployment-charts] - ''https://gerrit.wikimedia.org/r/951957 (https://phabricator.wikimedia.org/T344324) (owner: ''Effie Mouzeli)'
2023-08-23 15:23:26 <wikibugs> ('CR) ''Vgutierrez: [C: ''+1] varnish: parameterize fe cache mem reservation [puppet] - ''https://gerrit.wikimedia.org/r/849633 (owner: ''BBlack)'
2023-08-23 15:24:39 <wikibugs> ('PS2) ''Effie Mouzeli: tegola: bump image and cpu limits on codfw [deployment-charts] - ''https://gerrit.wikimedia.org/r/951957 (https://phabricator.wikimedia.org/T344324)'
2023-08-23 15:25:30 <wikibugs> ('CR) ''CI reject: [V: ''-1] tegola: bump image and cpu limits on codfw [deployment-charts] - ''https://gerrit.wikimedia.org/r/951957 (https://phabricator.wikimedia.org/T344324) (owner: ''Effie Mouzeli)'
2023-08-23 15:25:39 <brennen> jynus: should be good
2023-08-23 15:25:40 <wikibugs> ('PS3) ''Effie Mouzeli: tegola: bump image and cpu limits on codfw [deployment-charts] - ''https://gerrit.wikimedia.org/r/951957 (https://phabricator.wikimedia.org/T344324)'
2023-08-23 15:28:05 <wikibugs> ('CR) ''Eevans: [C: ''+2] Update kask container image path [deployment-charts] - ''https://gerrit.wikimedia.org/r/913949 (https://phabricator.wikimedia.org/T335691) (owner: ''Ahmon Dancy)'
2023-08-23 15:28:40 <wikibugs> ('CR) ''Alexandros Kosiaris: [C: ''-1] tegola: bump image and cpu limits on codfw (''1 comment) [deployment-charts] - ''https://gerrit.wikimedia.org/r/951957 (https://phabricator.wikimedia.org/T344324) (owner: ''Effie Mouzeli)'
2023-08-23 15:29:37 <wikibugs> ('CR) ''Effie Mouzeli: tegola: bump image and cpu limits on codfw (''1 comment) [deployment-charts] - ''https://gerrit.wikimedia.org/r/951957 (https://phabricator.wikimedia.org/T344324) (owner: ''Effie Mouzeli)'
2023-08-23 15:29:46 <logmsgbot> !log ayounsi@cumin1001 START - Cookbook sre.dns.netbox
2023-08-23 15:29:48 <logmsgbot> !log eevans@deploy1002 helmfile [staging] START helmfile.d/services/echostore: apply
2023-08-23 15:30:09 <logmsgbot> !log eevans@deploy1002 helmfile [staging] DONE helmfile.d/services/echostore: apply
2023-08-23 15:30:34 <wikibugs> ('PS1) ''Gmodena: data-engineering: flink: alert when TM is missing for 5m. [alerts] - ''https://gerrit.wikimedia.org/r/951959 (https://phabricator.wikimedia.org/T340666)'
2023-08-23 15:31:45 <logmsgbot> !log ayounsi@cumin1001 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: esams sandbox - ayounsi@cumin1001"
2023-08-23 15:31:54 <logmsgbot> !log eevans@deploy1002 helmfile [codfw] START helmfile.d/services/echostore: apply
2023-08-23 15:31:58 <wikibugs> ('PS4) ''Effie Mouzeli: tegola: bump cpu limits [deployment-charts] - ''https://gerrit.wikimedia.org/r/951957 (https://phabricator.wikimedia.org/T344324)'
2023-08-23 15:32:04 <logmsgbot> !log btullis@cumin1001 END (PASS) - Cookbook sre.druid.reboot-workers (exit_code=0) for Druid analytics cluster: Reboot Druid nodes
2023-08-23 15:32:17 <wikibugs> ('CR) ''Vgutierrez: [C: ''+1] esams: experimental frontend memory settings [puppet] - ''https://gerrit.wikimedia.org/r/951949 (owner: ''BBlack)'
2023-08-23 15:32:37 <wikibugs> ('CR) ''Alexandros Kosiaris: [C: ''+1] tegola: bump cpu limits [deployment-charts] - ''https://gerrit.wikimedia.org/r/951957 (https://phabricator.wikimedia.org/T344324) (owner: ''Effie Mouzeli)'
2023-08-23 15:32:51 <wikibugs> ('CR) ''Gmodena: "I suspect this was the cause of alerts fired during a maintenance restart today:" [alerts] - ''https://gerrit.wikimedia.org/r/951959 (https://phabricator.wikimedia.org/T340666) (owner: ''Gmodena)'
2023-08-23 15:33:07 <logmsgbot> !log eevans@deploy1002 helmfile [codfw] DONE helmfile.d/services/echostore: apply
2023-08-23 15:33:19 <logmsgbot> !log ayounsi@cumin1001 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: esams sandbox - ayounsi@cumin1001"
2023-08-23 15:33:19 <logmsgbot> !log ayounsi@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
2023-08-23 15:33:23 <wikibugs> ('CR) ''Effie Mouzeli: [C: ''+2] tegola: bump cpu limits [deployment-charts] - ''https://gerrit.wikimedia.org/r/951957 (https://phabricator.wikimedia.org/T344324) (owner: ''Effie Mouzeli)'
2023-08-23 15:34:06 <wikibugs> ('Merged) ''jenkins-bot: tegola: bump cpu limits [deployment-charts] - ''https://gerrit.wikimedia.org/r/951957 (https://phabricator.wikimedia.org/T344324) (owner: ''Effie Mouzeli)'
2023-08-23 15:34:32 <logmsgbot> !log eevans@deploy1002 helmfile [eqiad] START helmfile.d/services/echostore: apply
2023-08-23 15:35:22 <logmsgbot> !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host testvm2004.codfw.wmnet with OS bookworm
2023-08-23 15:35:43 <wikibugs> ('PS1) ''Ebernhardson: Draft: cirrus streaming updater producer service [deployment-charts] - ''https://gerrit.wikimedia.org/r/951960'
2023-08-23 15:35:43 <logmsgbot> !log eevans@deploy1002 helmfile [eqiad] DONE helmfile.d/services/echostore: apply
2023-08-23 15:36:58 <icinga-wm> PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2023-08-23 15:37:03 <wikibugs> ('PS3) ''FNegri: New files/templates for OpenStack Antelope (2023.1) [puppet] - ''https://gerrit.wikimedia.org/r/951923 (https://phabricator.wikimedia.org/T341285)'
2023-08-23 15:37:19 <logmsgbot> !log eevans@deploy1002 helmfile [staging] START helmfile.d/services/sessionstore: apply
2023-08-23 15:37:32 <logmsgbot> !log eevans@deploy1002 helmfile [staging] DONE helmfile.d/services/sessionstore: apply
2023-08-23 15:37:33 <wikibugs> ('CR) ''CI reject: [V: ''-1] New files/templates for OpenStack Antelope (2023.1) [puppet] - ''https://gerrit.wikimedia.org/r/951923 (https://phabricator.wikimedia.org/T341285) (owner: ''FNegri)'
2023-08-23 15:38:09 <logmsgbot> !log jiji@deploy1002 helmfile [codfw] START helmfile.d/services/tegola-vector-tiles: apply
2023-08-23 15:39:09 <logmsgbot> !log eevans@deploy1002 helmfile [codfw] START helmfile.d/services/sessionstore: apply
2023-08-23 15:39:28 <icinga-wm> RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2023-08-23 15:39:33 <logmsgbot> !log eevans@deploy1002 helmfile [codfw] DONE helmfile.d/services/sessionstore: apply
2023-08-23 15:40:19 <wikibugs> ('CR) ''BBlack: [C: ''+2] varnish: parameterize fe cache mem reservation [puppet] - ''https://gerrit.wikimedia.org/r/849633 (owner: ''BBlack)'
2023-08-23 15:40:24 <logmsgbot> !log eevans@deploy1002 helmfile [eqiad] START helmfile.d/services/sessionstore: apply
2023-08-23 15:40:27 <logmsgbot> !log jiji@deploy1002 helmfile [codfw] DONE helmfile.d/services/tegola-vector-tiles: apply
2023-08-23 15:40:29 <wikibugs> ('CR) ''BBlack: [C: ''+2] esams: experimental frontend memory settings [puppet] - ''https://gerrit.wikimedia.org/r/951949 (owner: ''BBlack)'
2023-08-23 15:40:46 <logmsgbot> !log eevans@deploy1002 helmfile [eqiad] DONE helmfile.d/services/sessionstore: apply
2023-08-23 15:44:22 <logmsgbot> !log stevemunene@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1117.eqiad.wmnet with OS bullseye
2023-08-23 15:44:22 <wikibugs> ('PS2) ''Majavah: P:terraform: don't serve BUSL licensed Terraform versions [puppet] - ''https://gerrit.wikimedia.org/r/951934'
2023-08-23 15:44:56 <logmsgbot> !log jiji@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=codfw
2023-08-23 15:45:10 <icinga-wm> PROBLEM - Maps HTTPS on maps2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook
2023-08-23 15:46:14 <icinga-wm> PROBLEM - Maps HTTPS on maps2008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook
2023-08-23 15:46:24 <icinga-wm> PROBLEM - Maps HTTPS on maps2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook
2023-08-23 15:46:46 <wikibugs> ('CR) ''Majavah: [C: ''+2] P:terraform: don't serve BUSL licensed Terraform versions [puppet] - ''https://gerrit.wikimedia.org/r/951934 (owner: ''Majavah)'
2023-08-23 15:47:22 <icinga-wm> PROBLEM - Maps HTTPS on maps2006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook
2023-08-23 15:47:30 <icinga-wm> PROBLEM - Maps HTTPS on maps1009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook
2023-08-23 15:47:38 <icinga-wm> PROBLEM - Maps HTTPS on maps2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook
2023-08-23 15:50:06 <icinga-wm> RECOVERY - Maps HTTPS on maps2006 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.649 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
2023-08-23 15:50:14 <icinga-wm> RECOVERY - Maps HTTPS on maps1009 is OK: HTTP OK: HTTP/1.1 200 OK - 1342 bytes in 0.189 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
2023-08-23 15:50:20 <icinga-wm> RECOVERY - Maps HTTPS on maps2007 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.185 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
2023-08-23 15:50:30 <icinga-wm> RECOVERY - Maps HTTPS on maps2009 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.331 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
2023-08-23 15:50:38 <icinga-wm> RECOVERY - Maps HTTPS on maps2005 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.325 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
2023-08-23 15:50:56 <wikibugs> ('CR) ''Kamila Součková: [C: ''+1] helmfile: add namespace and service definition for geo-analytics [deployment-charts] - ''https://gerrit.wikimedia.org/r/941374 (https://phabricator.wikimedia.org/T336400) (owner: ''Hnowlan)'
2023-08-23 15:51:01 <wikibugs> ('PS22) ''Slyngshede: C:bigtop::hadoop move net-topology.py to files. [puppet] - ''https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480)'
2023-08-23 15:51:46 <icinga-wm> RECOVERY - Maps HTTPS on maps2008 is OK: HTTP OK: HTTP/1.1 200 OK - 956 bytes in 0.553 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook
2023-08-23 15:52:24 <wikibugs> ('PS1) ''Effie Mouzeli: tegola-vector-tiles: bump cpu [deployment-charts] - ''https://gerrit.wikimedia.org/r/951962'
2023-08-23 15:53:44 <wikibugs> ('CR) ''Effie Mouzeli: [C: ''+2] tegola-vector-tiles: bump cpu [deployment-charts] - ''https://gerrit.wikimedia.org/r/951962 (owner: ''Effie Mouzeli)'
2023-08-23 15:54:03 <wikibugs> ('PS1) ''Kamila Součková: cassandra-http-gateway: remove typo in values [deployment-charts] - ''https://gerrit.wikimedia.org/r/951964'
2023-08-23 15:54:07 <logmsgbot> !log jiji@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=codfw
2023-08-23 15:54:11 <wikibugs> ('PS2) ''JMeybohm: deployment_server: Add jaeger user to aux-k8s [puppet] - ''https://gerrit.wikimedia.org/r/951533 (https://phabricator.wikimedia.org/T344253)'
2023-08-23 15:54:31 <wikibugs> ('Merged) ''jenkins-bot: tegola-vector-tiles: bump cpu [deployment-charts] - ''https://gerrit.wikimedia.org/r/951962 (owner: ''Effie Mouzeli)'
2023-08-23 15:55:33 <wikibugs> ('PS2) ''Kamila Součková: cassandra-http-gateway: remove typo in values [deployment-charts] - ''https://gerrit.wikimedia.org/r/951964'
2023-08-23 15:55:34 <effie> !log pooled codfw kartotherian/maps
2023-08-23 15:55:37 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-08-23 15:56:29 <wikibugs> ('CR) ''JMeybohm: [V: ''+1] "PCC SUCCESS (CORE_DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/42999/console"; [puppet] - ''https://gerrit.wikimedia.org/r/951533 (https://phabricator.wikimedia.org/T344253) (owner: ''JMeybohm)'
2023-08-23 15:57:09 <bblack> !log cp3066 - varnish-frontend-restart for new memory params experiment
2023-08-23 15:57:11 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-08-23 15:57:51 <wikibugs> ('CR) ''Kamila Součková: "here, have some cleanup in return for the other review :D" [deployment-charts] - ''https://gerrit.wikimedia.org/r/951964 (owner: ''Kamila Součková)'
2023-08-23 15:59:30 <wikibugs> ('CR) ''Hnowlan: [C: ''+2] helmfile: add namespace and service definition for geo-analytics [deployment-charts] - ''https://gerrit.wikimedia.org/r/941374 (https://phabricator.wikimedia.org/T336400) (owner: ''Hnowlan)'
2023-08-23 15:59:56 <wikibugs> ('CR) ''Hnowlan: [C: ''+1] cassandra-http-gateway: remove typo in values [deployment-charts] - ''https://gerrit.wikimedia.org/r/951964 (owner: ''Kamila Součková)'
2023-08-23 16:01:59 <wikibugs> ('PS1) ''Jbond: puppetdb-api-microservice: redact one the puppetdb side [puppet] - ''https://gerrit.wikimedia.org/r/951965'
2023-08-23 16:02:34 <wikibugs> ('Merged) ''jenkins-bot: helmfile: add namespace and service definition for geo-analytics [deployment-charts] - ''https://gerrit.wikimedia.org/r/941374 (https://phabricator.wikimedia.org/T336400) (owner: ''Hnowlan)'
2023-08-23 16:05:52 <logmsgbot> !log hnowlan@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
2023-08-23 16:06:25 <logmsgbot> !log hnowlan@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
2023-08-23 16:06:36 <logmsgbot> !log jclark@cumin1001 START - Cookbook sre.dns.netbox
2023-08-23 16:07:21 <logmsgbot> !log hnowlan@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
2023-08-23 16:07:54 <logmsgbot> !log jclark@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
2023-08-23 16:07:57 <logmsgbot> !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-text_eqiad and A:cp
2023-08-23 16:08:51 <logmsgbot> !log hnowlan@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
2023-08-23 16:09:17 <logmsgbot> !log fabfur@cumin1001 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-upload_eqiad and A:cp
2023-08-23 16:09:22 <logmsgbot> !log hnowlan@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'.
2023-08-23 16:10:22 <icinga-wm> PROBLEM - Check systemd state on cloudweb1004 is CRITICAL: CRITICAL - degraded: The following units failed: wikitech_run_jobs.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2023-08-23 16:10:49 <logmsgbot> !log hnowlan@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
2023-08-23 16:11:35 <wikibugs> 'SRE, ''ops-eqiad, ''decommission-hardware, ''serviceops: Decommission thumbor100[12] - https://phabricator.wikimedia.org/T344598 (''RobH)'
2023-08-23 16:11:48 <icinga-wm> RECOVERY - Check systemd state on cloudweb1004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2023-08-23 16:14:55 <wikibugs> ('PS1) ''Herron: Revert "thanos-fe: switch to cfssl" [puppet] - ''https://gerrit.wikimedia.org/r/951853'
2023-08-23 16:14:57 <wikibugs> ('PS2) ''Jbond: puppetdb-api-microservice: redact one the puppetdb side [puppet] - ''https://gerrit.wikimedia.org/r/951965'
2023-08-23 16:16:41 <wikibugs> ('CR) ''Herron: [C: ''+2] Revert "thanos-fe: switch to cfssl" [puppet] - ''https://gerrit.wikimedia.org/r/951853 (owner: ''Herron)'
2023-08-23 16:17:10 <effie> !log depool maps/karothertian codfw
2023-08-23 16:17:12 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-08-23 16:17:58 <logmsgbot> !log jiji@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=codfw
2023-08-23 16:19:43 <wikibugs> 'SRE, ''SRE-Access-Requests: Requesting membership in analytics-privatedata-users group, sql_lab role, Kerberos Principal for Omari Sefu - https://phabricator.wikimedia.org/T344257 (''OSefu-WMF) Unfortunately I'm still getting the same error in SQL lab.'
2023-08-23 16:24:44 <logmsgbot> !log hnowlan@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'.
2023-08-23 16:25:19 <logmsgbot> !log hnowlan@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
2023-08-23 16:27:16 <logmsgbot> !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/geo-analytics: apply
2023-08-23 16:35:15 <bblack> !log cp3067-81 - rolling restart of varnish frontends (one at a time, 30 minute sleep between, will run for ~7.5h), for experimental cache memory settings from https://gerrit.wikimedia.org/r/c/operations/puppet/+/951949
2023-08-23 16:35:22 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-08-23 16:36:06 <wikibugs> 'SRE, ''SRE-Access-Requests: Requesting membership in analytics-privatedata-users group, sql_lab role, Kerberos Principal for Omari Sefu - https://phabricator.wikimedia.org/T344257 (''BTullis) Oh, sorry about this @OSefu-WMF. I've just tried the same query from your screenshot above and its working for me. I...'
2023-08-23 16:37:08 <wikibugs> ('PS1) ''Andrea Denisse: alerting_host: Failover Icinga and Alertmanger from eqiad to codfw [puppet] - ''https://gerrit.wikimedia.org/r/951968 (https://phabricator.wikimedia.org/T344671)'
2023-08-23 16:37:25 <logmsgbot> !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/geo-analytics: apply
2023-08-23 16:37:43 <jinxer-wm> (SystemdUnitFailed) firing: (5) wdqs-blazegraph.service Failed on wdqs1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-08-23 16:42:48 <wikibugs> ('PS1) ''Andrea Denisse: dns: Repoint alert host services from alert1001 to alert2001 [dns] - ''https://gerrit.wikimedia.org/r/951969 (https://phabricator.wikimedia.org/T344671)'
2023-08-23 16:43:39 <logmsgbot> !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS entries for kubernetes2053 - pt1979@cumin2002"
2023-08-23 16:43:39 <logmsgbot> !log pt1979@cumin2002 END (FAIL) - Cookbook sre.dns.netbox (exit_code=99)
2023-08-23 16:43:55 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
2023-08-23 16:45:12 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
2023-08-23 16:45:39 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2053.mgmt.codfw.wmnet with reboot policy FORCED
2023-08-23 16:46:23 <wikibugs> 'SRE, ''SRE-Access-Requests: Requesting membership in analytics-privatedata-users group, sql_lab role, Kerberos Principal for Omari Sefu - https://phabricator.wikimedia.org/T344257 (''OSefu-WMF) @BTullis After login/logout I am able to view dashboards but it seems like the issues is confined to the SQL Lab fe...'
2023-08-23 16:46:35 <wikibugs> ('CR) ''Herron: [C: ''+1] dns: Repoint alert host services from alert1001 to alert2001 [dns] - ''https://gerrit.wikimedia.org/r/951969 (https://phabricator.wikimedia.org/T344671) (owner: ''Andrea Denisse)'
2023-08-23 16:46:43 <wikibugs> ('CR) ''Herron: [C: ''+1] alerting_host: Failover Icinga and Alertmanger from eqiad to codfw [puppet] - ''https://gerrit.wikimedia.org/r/951968 (https://phabricator.wikimedia.org/T344671) (owner: ''Andrea Denisse)'
2023-08-23 16:54:48 <wikibugs> ('PS14) ''Ilias Sarantopoulos: ores-extension: replace thresholds with numeric values [mediawiki-config] - ''https://gerrit.wikimedia.org/r/948542 (https://phabricator.wikimedia.org/T343308)'
2023-08-23 16:56:26 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2053.mgmt.codfw.wmnet with reboot policy FORCED
2023-08-23 16:57:43 <jinxer-wm> (SystemdUnitFailed) firing: (5) wdqs-blazegraph.service Failed on wdqs1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-08-23 16:59:08 <wikibugs> ('PS3) ''Dduvall: gitlab: Support loading of local gems [puppet] - ''https://gerrit.wikimedia.org/r/951580 (https://phabricator.wikimedia.org/T337570)'
2023-08-23 16:59:10 <wikibugs> ('CR) ''Dduvall: gitlab: Support loading of local gems (''6 comments) [puppet] - ''https://gerrit.wikimedia.org/r/951580 (https://phabricator.wikimedia.org/T337570) (owner: ''Dduvall)'
2023-08-23 17:00:05 <jouncebot> Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230823T1700)
2023-08-23 17:00:08 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2053']
2023-08-23 17:00:11 <wikibugs> 'sre-alert-triage, ''Release-Engineering-Team: Alert triage: overdue critical alert - https://phabricator.wikimedia.org/T342755 (''thcipriani) Hrm. We get an email from the systemd timer for this, so the alert is probably not necessary. We're not very familiar with alertmanager. Can we just remove this alert?'
2023-08-23 17:02:43 <jinxer-wm> (SystemdUnitFailed) firing: (5) wdqs-blazegraph.service Failed on wdqs1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-08-23 17:03:35 <logmsgbot> !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs1009.eqiad.wmnet with reason: jnl export
2023-08-23 17:03:48 <logmsgbot> !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs1009.eqiad.wmnet with reason: jnl export
2023-08-23 17:05:04 <herron> !log set icinga downtime on wikitech-static
2023-08-23 17:05:06 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-08-23 17:05:11 <wikibugs> 'SRE, ''ops-eqiad, ''decommission-hardware, ''serviceops: Decommission thumbor100[12] - https://phabricator.wikimedia.org/T344598 (''ayounsi) I'm going to hijack those 2 hosts before they get decommissioned for some tests. I'll rename them ganeti-test1001/1002.'
2023-08-23 17:06:15 <denisse> !log reboot alert2001 for a kernel upgrade
2023-08-23 17:06:17 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-08-23 17:07:31 <logmsgbot> !log denisse@cumin1001 START - Cookbook sre.hosts.reboot-single for host alert2001.wikimedia.org
2023-08-23 17:07:32 <logmsgbot> !log denisse@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host alert2001.wikimedia.org
2023-08-23 17:08:58 <icinga-wm> RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
2023-08-23 17:09:58 <wikibugs> ('PS1) ''Brennen Bearnes: Revert "clienthints: Remove duplicate entries when converting to DB rows" [extensions/CheckUser] (wmf/1.41.0-wmf.22) - ''https://gerrit.wikimedia.org/r/951855'
2023-08-23 17:10:36 <jouncebot> In 0 hour(s) and 49 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230823T1800)
2023-08-23 17:10:36 <brennen> jouncebot: nowandnext
2023-08-23 17:10:36 <jouncebot> For the next 0 hour(s) and 49 minute(s): MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230823T1700)
2023-08-23 17:10:36 <jouncebot> In 0 hour(s) and 49 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230823T1800)
2023-08-23 17:10:41 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes2053']
2023-08-23 17:11:19 <wikibugs> ('CR) ''Effie Mouzeli: "We believe that this commit bumped our rps, thus SRE requested this revert https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-r"; [extensions/CheckUser] (wmf/1.41.0-wmf.22) - ''https://gerrit.wikimedia.org/r/951855 (owner: ''Brennen Bearnes)'
2023-08-23 17:11:31 <wikibugs> ('CR) ''Effie Mouzeli: [C: ''+1] Revert "clienthints: Remove duplicate entries when converting to DB rows" [extensions/CheckUser] (wmf/1.41.0-wmf.22) - ''https://gerrit.wikimedia.org/r/951855 (owner: ''Brennen Bearnes)'
2023-08-23 17:11:49 <wikibugs> ('CR) ''TrainBranchBot: [C: ''+2] "Approved by brennen@deploy1002 using scap backport" [extensions/CheckUser] (wmf/1.41.0-wmf.22) - ''https://gerrit.wikimedia.org/r/951855 (owner: ''Brennen Bearnes)'
2023-08-23 17:13:14 <icinga-wm> PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
2023-08-23 17:15:39 <jinxer-wm> (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
2023-08-23 17:16:07 <wikibugs> ('CR) ''Andrea Denisse: [C: ''+2] alerting_host: Failover Icinga and Alertmanger from eqiad to codfw [puppet] - ''https://gerrit.wikimedia.org/r/951968 (https://phabricator.wikimedia.org/T344671) (owner: ''Andrea Denisse)'
2023-08-23 17:17:30 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
2023-08-23 17:18:03 <jinxer-wm> (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2023-08-23 17:19:06 <logmsgbot> !log hnowlan@deploy1002 helmfile [staging] START helmfile.d/services/geo-analytics: apply
2023-08-23 17:19:17 <logmsgbot> !log hnowlan@deploy1002 helmfile [staging] DONE helmfile.d/services/geo-analytics: apply
2023-08-23 17:19:46 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS entries for kubernetes2040-kubernetes2052 - pt1979@cumin2002"
2023-08-23 17:20:31 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS entries for kubernetes2040-kubernetes2052 - pt1979@cumin2002"
2023-08-23 17:20:31 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
2023-08-23 17:21:37 <wikibugs> ('CR) ''Kosta Harlan: "I don't see how this is related to a jump in RPS. This patch fixed a narrow case where someone (probably manually tampering with the colle" [extensions/CheckUser] (wmf/1.41.0-wmf.22) - ''https://gerrit.wikimedia.org/r/951855 (owner: ''Brennen Bearnes)'
2023-08-23 17:22:20 <logmsgbot> !log brett@cumin2002 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-upload_eqiad and A:cp
2023-08-23 17:22:40 <logmsgbot> !log brett@cumin2002 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-text_eqiad and A:cp
2023-08-23 17:23:03 <jinxer-wm> (ProbeDown) resolved: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2023-08-23 17:23:07 <logmsgbot> !log brett@cumin2002 END (ERROR) - Cookbook sre.cdn.roll-reboot (exit_code=97) rolling reboot on A:cp-text_eqiad and A:cp
2023-08-23 17:23:08 <logmsgbot> !log brett@cumin2002 END (ERROR) - Cookbook sre.cdn.roll-reboot (exit_code=97) rolling reboot on A:cp-upload_eqiad and A:cp
2023-08-23 17:23:14 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2053.codfw.wmnet with OS bullseye
2023-08-23 17:23:22 <wikibugs> 'SRE, ''ops-codfw, ''DC-Ops, ''serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2053.codfw.wmnet with OS bullseye'
2023-08-23 17:24:06 <logmsgbot> !log brett@cumin2002 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-upload_codfw and A:cp
2023-08-23 17:24:22 <logmsgbot> !log brett@cumin2002 START - Cookbook sre.cdn.roll-reboot rolling reboot on A:cp-text_codfw and A:cp
2023-08-23 17:24:49 <kostajh> brennen: per discussion in -serviceops I'd suggest not reverting... or if it's too late, I guess I can make a revert of the revert :)
2023-08-23 17:25:02 <brennen> i think i can -2 it so it doesn't merge
2023-08-23 17:25:18 <wikibugs> ('CR) ''Brennen Bearnes: [C: ''-2] Revert "clienthints: Remove duplicate entries when converting to DB rows" [extensions/CheckUser] (wmf/1.41.0-wmf.22) - ''https://gerrit.wikimedia.org/r/951855 (owner: ''Brennen Bearnes)'
2023-08-23 17:25:47 <brennen> kostajh: and killed scap
2023-08-23 17:26:06 <kostajh> thank you
2023-08-23 17:26:42 <wikibugs> ('Abandoned) ''Brennen Bearnes: Revert "clienthints: Remove duplicate entries when converting to DB rows" [extensions/CheckUser] (wmf/1.41.0-wmf.22) - ''https://gerrit.wikimedia.org/r/951855 (owner: ''Brennen Bearnes)'
2023-08-23 17:27:02 <rzl> sorry for the runaround brennen <3 appreciate the quick response anyway
2023-08-23 17:27:29 <wikibugs> 'SRE, ''ops-codfw, ''DC-Ops, ''serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (''Papaul)'
2023-08-23 17:27:36 <brennen> rzl: no worries.
2023-08-23 17:28:04 <urbanecm> can someone bring logmsgbot back to life please? :))
2023-08-23 17:28:07 <urbanecm> (should be `tcpircbot-logmsgbot` on alert1001)
2023-08-23 17:28:41 <rzl> urbanecm: there's some maintenance ongoing on alert hosts, it should return soonish AIUI
2023-08-23 17:28:42 <rzl> denisse: ^ fyi
2023-08-23 17:28:49 <urbanecm> ack, ty
2023-08-23 17:28:58 <brennen> i am going to have to go afk for a bit here, so if further deploy followup does turn out to be needed please coordinate with train folks for the upcoming window. (i think that's dduvall today.)
2023-08-23 17:29:07 <rzl> brennen: ack
2023-08-23 17:29:25 <denisse> urbanecm: Yes, we're doing some maintenance on those host. Apologies for the downtime caused!
2023-08-23 17:29:47 <urbanecm> np, thought it just disconnected randomly (ircbots have a tendency of doing that :D )
2023-08-23 17:29:56 <wikibugs> ('CR) ''Andrea Denisse: [C: ''+2] dns: Repoint alert host services from alert1001 to alert2001 [dns] - ''https://gerrit.wikimedia.org/r/951969 (https://phabricator.wikimedia.org/T344671) (owner: ''Andrea Denisse)'
2023-08-23 17:31:26 <denisse> !log failing over alert1001 to alert2001
2023-08-23 17:31:28 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-08-23 17:43:55 <wikibugs> 'SRE, ''ops-eqiad, ''DC-Ops, ''cloud-services-team (Hardware): Q3:rack/setup/install cloudcephosd10(3[5-9]|40) - https://phabricator.wikimedia.org/T324998 (''Papaul) @nskaggs please see last update @https://phabricator.wikimedia.org/T339131'
2023-08-23 17:47:25 <denisse> !log make alert2001 the active host
2023-08-23 17:47:28 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-08-23 17:49:11 <jinxer-wm> (AlertManagerNotificationFail) firing: AlertManager is failing to deliver notifications - https://wikitech.wikimedia.org/wiki/Alertmanager#Alerts - https://grafana.wikimedia.org/d/eea-9_sik/alertmanager - https://alerts.wikimedia.org/?q=alertname%3DAlertManagerNotificationFail
2023-08-23 17:49:14 <jinxer-wm> (KeyholderUnarmed) resolved: 1 unarmed Keyholder key(s) on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
2023-08-23 17:49:18 <jinxer-wm> (RedisMemoryFull) firing: (3) Redis memory full on gitlab1003:9121 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_gitlab - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
2023-08-23 17:49:23 <jinxer-wm> (JobUnavailable) firing: (4) Reduced availability for job cloudlb_haproxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2023-08-23 17:49:27 <jinxer-wm> (RedisMemoryFull) firing: (6) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
2023-08-23 17:49:31 <jinxer-wm> (AlertManagerNotificationFail) resolved: AlertManager is failing to deliver notifications - https://wikitech.wikimedia.org/wiki/Alertmanager#Alerts - https://grafana.wikimedia.org/d/eea-9_sik/alertmanager - https://alerts.wikimedia.org/?q=alertname%3DAlertManagerNotificationFail
2023-08-23 17:50:52 <icinga-wm> PROBLEM - Check systemd state on alert2001 is CRITICAL: CRITICAL - degraded: The following units failed: sync-icinga-state.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2023-08-23 17:51:46 <logmsgbot> !log denisse@cumin1001 START - Cookbook sre.hosts.reboot-single for host alert1001.wikimedia.org
2023-08-23 17:51:47 <logmsgbot> !log denisse@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=99) for host alert1001.wikimedia.org
2023-08-23 17:52:56 <jinxer-wm> (JobUnavailable) firing: (4) Reduced availability for job cloudlb_haproxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2023-08-23 17:53:37 <wikibugs> 'SRE: Wiki store page indexing issues detected - https://phabricator.wikimedia.org/T344844 (''SHust)'
2023-08-23 17:57:43 <jinxer-wm> (KeyholderUnarmed) firing: 1 unarmed Keyholder key(s) on alert1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
2023-08-23 17:58:58 <jinxer-wm> (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
2023-08-23 17:59:18 <wikibugs> ('PS1) ''Papaul: Add new kubernetes node to netboot.cfg [puppet] - ''https://gerrit.wikimedia.org/r/951975 (https://phabricator.wikimedia.org/T342534)'
2023-08-23 17:59:48 <wikibugs> ('PS1) ''Andrea Denisse: Revert "alerting_host: Failover Icinga and Alertmanger from eqiad to codfw" [puppet] - ''https://gerrit.wikimedia.org/r/951856'
2023-08-23 18:00:06 <jouncebot> dduvall and dancy: Your horoscope predicts another unfortunate Train log triage with CPT deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230823T1800).
2023-08-23 18:00:06 <jouncebot> dduvall and dancy: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for MediaWiki train - Utc-7 Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230823T1800).
2023-08-23 18:00:10 <wikibugs> ('CR) ''Papaul: [C: ''+2] Add new kubernetes node to netboot.cfg [puppet] - ''https://gerrit.wikimedia.org/r/951975 (https://phabricator.wikimedia.org/T342534) (owner: ''Papaul)'
2023-08-23 18:02:00 <wikibugs> ('CR) ''Andrea Denisse: [C: ''+2] Revert "alerting_host: Failover Icinga and Alertmanger from eqiad to codfw" [puppet] - ''https://gerrit.wikimedia.org/r/951856 (owner: ''Andrea Denisse)'
2023-08-23 18:02:06 <wikibugs> ('CR) ''Herron: [C: ''+1] Revert "alerting_host: Failover Icinga and Alertmanger from eqiad to codfw" [puppet] - ''https://gerrit.wikimedia.org/r/951856 (owner: ''Andrea Denisse)'
2023-08-23 18:02:39 <jinxer-wm> (KeyholderUnarmed) resolved: 1 unarmed Keyholder key(s) on alert1001:9100 - https://wikitech.wikimedia.org/wiki/Keyholder - TODO - https://alerts.wikimedia.org/?q=alertname%3DKeyholderUnarmed
2023-08-23 18:03:48 <denisse> !log failing over from alert2001 to alert1001
2023-08-23 18:03:51 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-08-23 18:03:57 <jinxer-wm> (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
2023-08-23 18:06:22 <wikibugs> ('PS1) ''TrainBranchBot: group1 wikis to 1.41.0-wmf.23 [mediawiki-config] - ''https://gerrit.wikimedia.org/r/951976 (https://phabricator.wikimedia.org/T343725)'
2023-08-23 18:06:24 <wikibugs> ('CR) ''TrainBranchBot: [C: ''+2] group1 wikis to 1.41.0-wmf.23 [mediawiki-config] - ''https://gerrit.wikimedia.org/r/951976 (https://phabricator.wikimedia.org/T343725) (owner: ''TrainBranchBot)'
2023-08-23 18:07:03 <wikibugs> ('Merged) ''jenkins-bot: group1 wikis to 1.41.0-wmf.23 [mediawiki-config] - ''https://gerrit.wikimedia.org/r/951976 (https://phabricator.wikimedia.org/T343725) (owner: ''TrainBranchBot)'
2023-08-23 18:08:44 <wikibugs> ('PS1) ''Andrea Denisse: Revert "dns: Repoint alert host services from alert1001 to alert2001" [dns] - ''https://gerrit.wikimedia.org/r/951857'
2023-08-23 18:09:29 <wikibugs> ('CR) ''Andrea Denisse: [C: ''+2] Revert "dns: Repoint alert host services from alert1001 to alert2001" [dns] - ''https://gerrit.wikimedia.org/r/951857 (owner: ''Andrea Denisse)'
2023-08-23 18:09:31 <wikibugs> ('CR) ''Andrea Denisse: [V: ''+2 C: ''+2] Revert "dns: Repoint alert host services from alert1001 to alert2001" [dns] - ''https://gerrit.wikimedia.org/r/951857 (owner: ''Andrea Denisse)'
2023-08-23 18:09:54 <denisse> !log updating DNS to point to alert1001
2023-08-23 18:09:57 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-08-23 18:10:52 <wikibugs> ('CR) ''Herron: [C: ''+1] Revert "dns: Repoint alert host services from alert1001 to alert2001" [dns] - ''https://gerrit.wikimedia.org/r/951857 (owner: ''Andrea Denisse)'
2023-08-23 18:13:38 <denisse> !log making alert1001 the primary alert host
2023-08-23 18:13:40 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-08-23 18:15:52 <jinxer-wm> (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
2023-08-23 18:15:57 <jinxer-wm> (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
2023-08-23 18:17:17 <denisse> !log alert hosts maintenance finished
2023-08-23 18:17:19 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-08-23 18:19:29 <herron> !log re-enabled icinga meta-monitoring on wikitech-static
2023-08-23 18:19:31 <stashbot> Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
2023-08-23 18:19:51 <logmsgbot> !log dduvall@deploy1002 Synchronized php: group1 wikis to 1.41.0-wmf.23 refs T343725 (duration: 06m 01s)
2023-08-23 18:19:56 <stashbot> T343725: 1.41.0-wmf.23 deployment blockers - https://phabricator.wikimedia.org/T343725
2023-08-23 18:21:34 <jinxer-wm> (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
2023-08-23 18:26:03 <jinxer-wm> (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
2023-08-23 18:26:34 <jinxer-wm> (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
2023-08-23 18:30:57 <jinxer-wm> (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
2023-08-23 18:38:49 <logmsgbot> !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2053.codfw.wmnet with OS bullseye
2023-08-23 18:38:56 <wikibugs> 'SRE, ''ops-codfw, ''DC-Ops, ''serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2053.codfw.wmnet with OS bullseye executed with errors: - kubernetes20...'
2023-08-23 18:40:57 <jinxer-wm> (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
2023-08-23 18:43:18 <jinxer-wm> (KubernetesAPILatency) firing: High Kubernetes API latency (POST nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
2023-08-23 18:45:18 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2053.codfw.wmnet with OS bullseye
2023-08-23 18:45:25 <wikibugs> 'SRE, ''ops-codfw, ''DC-Ops, ''serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2053.codfw.wmnet with OS bullseye'
2023-08-23 18:45:59 <logmsgbot> !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host cassandra-dev2001.codfw.wmnet
2023-08-23 18:48:18 <jinxer-wm> (KubernetesAPILatency) resolved: High Kubernetes API latency (POST nodes) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
2023-08-23 18:51:25 <icinga-wm> RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
2023-08-23 18:55:43 <icinga-wm> PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
2023-08-23 18:55:50 <logmsgbot> !log htriedman@deploy1002 Started deploy [airflow-dags/platform_eng@33de526]: (no justification provided)
2023-08-23 18:56:11 <logmsgbot> !log htriedman@deploy1002 Finished deploy [airflow-dags/platform_eng@33de526]: (no justification provided) (duration: 00m 20s)
2023-08-23 18:57:21 <logmsgbot> !log eevans@cumin1001 END (FAIL) - Cookbook sre.hosts.reboot-single (exit_code=1) for host cassandra-dev2001.codfw.wmnet
2023-08-23 19:00:04 <wikibugs> ('PS1) ''Eevans: cassandra-dev: monitor tls on port 7000 [puppet] - ''https://gerrit.wikimedia.org/r/951980'
2023-08-23 19:00:44 <wikibugs> ('CR) ''Eevans: [C: ''+2] cassandra-dev: monitor tls on port 7000 [puppet] - ''https://gerrit.wikimedia.org/r/951980 (owner: ''Eevans)'
2023-08-23 19:06:43 <logmsgbot> !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host cassandra-dev2002.codfw.wmnet
2023-08-23 19:09:42 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2053.codfw.wmnet with reason: host reimage
2023-08-23 19:11:05 <jinxer-wm> (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
2023-08-23 19:12:50 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2053.codfw.wmnet with reason: host reimage
2023-08-23 19:13:23 <logmsgbot> !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cassandra-dev2002.codfw.wmnet
2023-08-23 19:14:35 <logmsgbot> !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host cassandra-dev2003.codfw.wmnet
2023-08-23 19:20:07 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2052.mgmt.codfw.wmnet with reboot policy FORCED
2023-08-23 19:20:57 <jinxer-wm> (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
2023-08-23 19:21:01 <logmsgbot> !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cassandra-dev2003.codfw.wmnet
2023-08-23 19:28:56 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
2023-08-23 19:29:29 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
2023-08-23 19:31:22 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2052.mgmt.codfw.wmnet with reboot policy FORCED
2023-08-23 19:31:24 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
2023-08-23 19:31:25 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2053.codfw.wmnet with OS bullseye
2023-08-23 19:31:32 <wikibugs> 'SRE, ''ops-codfw, ''DC-Ops, ''serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2053.codfw.wmnet with OS bullseye completed: - kubernetes2053 (**PASS*...'
2023-08-23 19:31:49 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add mgmt DNS for kubernetes2051 - pt1979@cumin2002"
2023-08-23 19:31:49 <icinga-wm> PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1001 is CRITICAL: 1.005e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw
2023-08-23 19:32:10 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2052']
2023-08-23 19:32:34 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add mgmt DNS for kubernetes2051 - pt1979@cumin2002"
2023-08-23 19:32:34 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
2023-08-23 19:34:09 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2051.mgmt.codfw.wmnet with reboot policy FORCED
2023-08-23 19:35:29 <logmsgbot> !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2013.codfw.wmnet
2023-08-23 19:35:57 <jinxer-wm> (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
2023-08-23 19:41:05 <jinxer-wm> (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
2023-08-23 19:43:21 <logmsgbot> !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2013.codfw.wmnet
2023-08-23 19:43:24 <logmsgbot> !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2014.codfw.wmnet
2023-08-23 19:45:19 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes2052']
2023-08-23 19:46:46 <logmsgbot> !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host kubernetes2051.mgmt.codfw.wmnet with reboot policy FORCED
2023-08-23 19:47:39 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2052.codfw.wmnet with OS bullseye
2023-08-23 19:47:47 <wikibugs> 'SRE, ''ops-codfw, ''DC-Ops, ''serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2052.codfw.wmnet with OS bullseye'
2023-08-23 19:48:31 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2051.mgmt.codfw.wmnet with reboot policy FORCED
2023-08-23 19:49:10 <wikibugs> 'SRE, ''SRE-swift-storage, ''Beta-Cluster-Infrastructure: Captchas are broken in the beta cluster - https://phabricator.wikimedia.org/T344834 (''Urbanecm_WMF) Adding some SRE tags.'
2023-08-23 19:49:57 <wikibugs> 'SRE, ''SRE-swift-storage, ''Beta-Cluster-Infrastructure: Captchas are broken in the beta cluster - https://phabricator.wikimedia.org/T344834 (''RhinosF1)'
2023-08-23 19:50:03 <RhinosF1> urbanecm: conflicted with you. Oops.
2023-08-23 19:52:17 <logmsgbot> !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2014.codfw.wmnet
2023-08-23 19:52:21 <logmsgbot> !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2019.codfw.wmnet
2023-08-23 19:53:53 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2050.mgmt.codfw.wmnet with reboot policy FORCED
2023-08-23 19:54:12 <urbanecm> RhinosF1: no worries :)
2023-08-23 19:55:48 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2049.mgmt.codfw.wmnet with reboot policy FORCED
2023-08-23 19:59:17 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2051.mgmt.codfw.wmnet with reboot policy FORCED
2023-08-23 20:00:04 <jouncebot> RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: #bothumor My software never has bugs. It just develops random features. Rise for UTC late backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230823T2000).
2023-08-23 20:00:04 <jouncebot> hmonroy, Dreamy_Jazz, and kizule: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
2023-08-23 20:00:11 <Dreamy_Jazz> \o
2023-08-23 20:00:25 <hmonroy> \o
2023-08-23 20:00:37 <logmsgbot> !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2019.codfw.wmnet
2023-08-23 20:00:41 <logmsgbot> !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2021.codfw.wmnet
2023-08-23 20:00:58 <jinxer-wm> (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
2023-08-23 20:03:22 <wikibugs> ('PS4) ''HMonroy: wikidiff2: set maxSplitSize = 10 on group1 wikis [mediawiki-config] - ''https://gerrit.wikimedia.org/r/950049 (https://phabricator.wikimedia.org/T341754)'
2023-08-23 20:03:57 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2050.mgmt.codfw.wmnet with reboot policy FORCED
2023-08-23 20:04:09 <icinga-wm> RECOVERY - OSPF status on cr2-eqiad is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
2023-08-23 20:04:13 <Dreamy_Jazz> Is anyone around for the backport window?
2023-08-23 20:05:15 <hmonroy> not sure, I can try deploying but it would be my first time doing it
2023-08-23 20:05:39 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2049.mgmt.codfw.wmnet with reboot policy FORCED
2023-08-23 20:06:13 <RhinosF1> hmonroy: https://deploy-commands.toolforge.org/ might be useful but I was talking to a deployer a moment ago
2023-08-23 20:06:18 <RhinosF1> Please hold a few minutes
2023-08-23 20:06:42 <urbanecm> i can deploy today :)
2023-08-23 20:06:51 <Dreamy_Jazz> :D
2023-08-23 20:06:59 <urbanecm> hi Dreamy_Jazz !
2023-08-23 20:07:05 <Dreamy_Jazz> Hi there
2023-08-23 20:07:11 <urbanecm> hmonroy: wanna hop on a call and try doing the deployment for your patch? :)
2023-08-23 20:07:25 <hmonroy> urbanecm: yes!
2023-08-23 20:08:04 <urbanecm> hmonroy: pm sent with a link
2023-08-23 20:08:31 <icinga-wm> PROBLEM - OSPF status on cr2-eqiad is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/5 UP : 6 v2 P2P interfaces vs. 5 v3 P2P interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
2023-08-23 20:09:29 <logmsgbot> !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2021.codfw.wmnet
2023-08-23 20:09:32 <logmsgbot> !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2024.codfw.wmnet
2023-08-23 20:10:57 <jinxer-wm> (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
2023-08-23 20:11:05 <jinxer-wm> (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
2023-08-23 20:11:09 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2051.codfw.wmnet with OS bullseye
2023-08-23 20:11:16 <wikibugs> 'SRE, ''ops-codfw, ''DC-Ops, ''serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2051.codfw.wmnet with OS bullseye'
2023-08-23 20:11:18 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2052.codfw.wmnet with reason: host reimage
2023-08-23 20:11:47 <wikibugs> ('CR) ''TrainBranchBot: [C: ''+2] "Approved by hmonroy@deploy1002 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/950049 (https://phabricator.wikimedia.org/T341754) (owner: ''HMonroy)'
2023-08-23 20:12:28 <wikibugs> ('Merged) ''jenkins-bot: wikidiff2: set maxSplitSize = 10 on group1 wikis [mediawiki-config] - ''https://gerrit.wikimedia.org/r/950049 (https://phabricator.wikimedia.org/T341754) (owner: ''HMonroy)'
2023-08-23 20:12:58 <logmsgbot> !log hmonroy@deploy1002 Started scap: Backport for [[gerrit:950049|wikidiff2: set maxSplitSize = 10 on group1 wikis (T341754)]]
2023-08-23 20:13:03 <stashbot> T341754: Deploy wikidiff2 paragraph split detection - https://phabricator.wikimedia.org/T341754
2023-08-23 20:14:27 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2052.codfw.wmnet with reason: host reimage
2023-08-23 20:14:32 <logmsgbot> !log hmonroy@deploy1002 hmonroy: Backport for [[gerrit:950049|wikidiff2: set maxSplitSize = 10 on group1 wikis (T341754)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
2023-08-23 20:15:41 <wikibugs> 'SRE: store.wikimedia.org page indexing issues detected by google search console - https://phabricator.wikimedia.org/T344844 (''Peachey88)'
2023-08-23 20:15:41 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2050.codfw.wmnet with OS bullseye
2023-08-23 20:15:48 <wikibugs> 'SRE, ''ops-codfw, ''DC-Ops, ''serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2050.codfw.wmnet with OS bullseye'
2023-08-23 20:15:52 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2049.codfw.wmnet with OS bullseye
2023-08-23 20:15:59 <wikibugs> 'SRE, ''ops-codfw, ''DC-Ops, ''serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2049.codfw.wmnet with OS bullseye'
2023-08-23 20:17:05 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2048.mgmt.codfw.wmnet with reboot policy FORCED
2023-08-23 20:17:25 <logmsgbot> !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2024.codfw.wmnet
2023-08-23 20:18:01 <logmsgbot> !log hmonroy@deploy1002 hmonroy: Continuing with sync
2023-08-23 20:18:20 <wikibugs> 'SRE, ''Search-Console-access-request: store.wikimedia.org page indexing issues detected by google search console - https://phabricator.wikimedia.org/T344844 (''RhinosF1) Hi, I don't believe SRE maintain search console access. I added the main tag and I think @SCherukuwada is the POC'
2023-08-23 20:23:22 <logmsgbot> !log hmonroy@deploy1002 Finished scap: Backport for [[gerrit:950049|wikidiff2: set maxSplitSize = 10 on group1 wikis (T341754)]] (duration: 10m 24s)
2023-08-23 20:23:28 <stashbot> T341754: Deploy wikidiff2 paragraph split detection - https://phabricator.wikimedia.org/T341754
2023-08-23 20:24:32 <hmonroy> Dreamy_Jazz: you're patch will be deploy next :)
2023-08-23 20:24:39 <Dreamy_Jazz> Thanks!
2023-08-23 20:24:55 <wikibugs> ('PS2) ''HMonroy: clienthints: Lower API max lag time to 5 minutes on group0 and 1 [mediawiki-config] - ''https://gerrit.wikimedia.org/r/951833 (https://phabricator.wikimedia.org/T344797) (owner: ''Dreamy Jazz)'
2023-08-23 20:25:03 <jinxer-wm> (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2023-08-23 20:25:12 <logmsgbot> !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2015.codfw.wmnet
2023-08-23 20:25:28 <wikibugs> ('CR) ''TrainBranchBot: [C: ''+2] "Approved by hmonroy@deploy1002 using scap backport" [mediawiki-config] - ''https://gerrit.wikimedia.org/r/951833 (https://phabricator.wikimedia.org/T344797) (owner: ''Dreamy Jazz)'
2023-08-23 20:25:44 <Dreamy_Jazz> I don't have any easy way to test this other than waiting 5 minutes from an edit to see if the request fails.
2023-08-23 20:26:12 <wikibugs> ('Merged) ''jenkins-bot: clienthints: Lower API max lag time to 5 minutes on group0 and 1 [mediawiki-config] - ''https://gerrit.wikimedia.org/r/951833 (https://phabricator.wikimedia.org/T344797) (owner: ''Dreamy Jazz)'
2023-08-23 20:26:19 <Dreamy_Jazz> If you would like me to do that, I can do so.
2023-08-23 20:26:39 <logmsgbot> !log hmonroy@deploy1002 Started scap: Backport for [[gerrit:951833|clienthints: Lower API max lag time to 5 minutes on group0 and 1 (T344797)]]
2023-08-23 20:26:43 <Dreamy_Jazz> Actually, I can make that testing edit now.
2023-08-23 20:26:43 <stashbot> T344797: Decrease CheckUserClientHintsRestApiMaxTimeLag config on production wikis - https://phabricator.wikimedia.org/T344797
2023-08-23 20:26:47 <Dreamy_Jazz> That should reduce the time.
2023-08-23 20:27:16 <hmonroy> Dreamy_Jazz: we can proceed and let us to revert if anything fails
2023-08-23 20:28:04 <Dreamy_Jazz> Sure. I've made that testing edit now and already set a timer.
2023-08-23 20:28:10 <logmsgbot> !log hmonroy@deploy1002 dreamyjazz and hmonroy: Backport for [[gerrit:951833|clienthints: Lower API max lag time to 5 minutes on group0 and 1 (T344797)]] synced to the testservers mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, and mw-debug kubernetes deployment (accessible via k8s-experimental XWD option)
2023-08-23 20:28:26 <logmsgbot> !log hmonroy@deploy1002 dreamyjazz and hmonroy: Continuing with sync
2023-08-23 20:30:03 <jinxer-wm> (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
2023-08-23 20:30:30 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2048.mgmt.codfw.wmnet with reboot policy FORCED
2023-08-23 20:30:31 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
2023-08-23 20:32:38 <logmsgbot> !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2015.codfw.wmnet
2023-08-23 20:32:41 <logmsgbot> !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2016.codfw.wmnet
2023-08-23 20:32:44 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
2023-08-23 20:32:45 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2052.codfw.wmnet with OS bullseye
2023-08-23 20:32:53 <wikibugs> 'SRE, ''ops-codfw, ''DC-Ops, ''serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2052.codfw.wmnet with OS bullseye completed: - kubernetes2052 (**PASS*...'
2023-08-23 20:33:40 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2048']
2023-08-23 20:33:48 <logmsgbot> !log hmonroy@deploy1002 Finished scap: Backport for [[gerrit:951833|clienthints: Lower API max lag time to 5 minutes on group0 and 1 (T344797)]] (duration: 07m 09s)
2023-08-23 20:33:52 <stashbot> T344797: Decrease CheckUserClientHintsRestApiMaxTimeLag config on production wikis - https://phabricator.wikimedia.org/T344797
2023-08-23 20:33:58 <Dreamy_Jazz> Works as expected.
2023-08-23 20:34:21 <hmonroy> Dreamy_Jazz: Awesome! It's in production now :)
2023-08-23 20:34:21 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2051.codfw.wmnet with reason: host reimage
2023-08-23 20:34:25 <Dreamy_Jazz> Thanks!
2023-08-23 20:34:41 <hmonroy> Dreamy_Jazz: NP!
2023-08-23 20:35:23 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2047.mgmt.codfw.wmnet with reboot policy FORCED
2023-08-23 20:37:35 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2051.codfw.wmnet with reason: host reimage
2023-08-23 20:39:15 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2050.codfw.wmnet with reason: host reimage
2023-08-23 20:40:05 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2049.codfw.wmnet with reason: host reimage
2023-08-23 20:41:01 <logmsgbot> !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2016.codfw.wmnet
2023-08-23 20:41:05 <logmsgbot> !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2020.codfw.wmnet
2023-08-23 20:42:33 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2050.codfw.wmnet with reason: host reimage
2023-08-23 20:45:04 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes2048']
2023-08-23 20:45:06 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2049.codfw.wmnet with reason: host reimage
2023-08-23 20:45:06 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2047.mgmt.codfw.wmnet with reboot policy FORCED
2023-08-23 20:45:42 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2047']
2023-08-23 20:46:44 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2048.codfw.wmnet with OS bullseye
2023-08-23 20:46:51 <wikibugs> 'SRE, ''ops-codfw, ''DC-Ops, ''serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2048.codfw.wmnet with OS bullseye'
2023-08-23 20:48:54 <logmsgbot> !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2020.codfw.wmnet
2023-08-23 20:48:57 <logmsgbot> !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2022.codfw.wmnet
2023-08-23 20:51:05 <jinxer-wm> (SwiftTooManyMediaUploads) firing: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
2023-08-23 20:52:00 <icinga-wm> PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received: /en.wikipedia.org/v1/page/mobile-html/{title} (Get mobile-html from storage) timed out before a response was received ht
2023-08-23 20:52:00 <icinga-wm> kitech.wikimedia.org/wiki/Services/Monitoring/restbase
2023-08-23 20:53:27 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
2023-08-23 20:54:02 <icinga-wm> RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
2023-08-23 20:54:25 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
2023-08-23 20:54:26 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2051.codfw.wmnet with OS bullseye
2023-08-23 20:54:33 <wikibugs> 'SRE, ''ops-codfw, ''DC-Ops, ''serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2051.codfw.wmnet with OS bullseye completed: - kubernetes2051 (**PASS*...'
2023-08-23 20:55:00 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2046.mgmt.codfw.wmnet with reboot policy FORCED
2023-08-23 20:56:40 <icinga-wm> PROBLEM - Check systemd state on an-presto1002 is CRITICAL: CRITICAL - degraded: The following units failed: presto-server.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2023-08-23 20:57:18 <logmsgbot> !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2022.codfw.wmnet
2023-08-23 20:57:21 <logmsgbot> !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2025.codfw.wmnet
2023-08-23 20:58:03 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes2047']
2023-08-23 20:58:31 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2047.codfw.wmnet with OS bullseye
2023-08-23 20:58:38 <wikibugs> 'SRE, ''ops-codfw, ''DC-Ops, ''serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2047.codfw.wmnet with OS bullseye'
2023-08-23 20:58:39 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
2023-08-23 21:01:11 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
2023-08-23 21:02:17 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
2023-08-23 21:02:18 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2050.codfw.wmnet with OS bullseye
2023-08-23 21:02:24 <wikibugs> 'SRE, ''ops-codfw, ''DC-Ops, ''serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2050.codfw.wmnet with OS bullseye completed: - kubernetes2050 (**PASS*...'
2023-08-23 21:02:35 <logmsgbot> !log pt1979@cumin2002 END (FAIL) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=99) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
2023-08-23 21:02:36 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2049.codfw.wmnet with OS bullseye
2023-08-23 21:02:42 <wikibugs> 'SRE, ''ops-codfw, ''DC-Ops, ''serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2049.codfw.wmnet with OS bullseye completed: - kubernetes2049 (**WARN*...'
2023-08-23 21:04:42 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2046.mgmt.codfw.wmnet with reboot policy FORCED
2023-08-23 21:05:04 <icinga-wm> RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on alert1001 is OK: (C)1e+05 gt (W)1e+04 gt 3679 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw
2023-08-23 21:05:19 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2046']
2023-08-23 21:05:30 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2045.mgmt.codfw.wmnet with reboot policy FORCED
2023-08-23 21:05:41 <logmsgbot> !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2025.codfw.wmnet
2023-08-23 21:05:47 <logmsgbot> !log bking@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on wdqs1009.eqiad.wmnet with reason: jnl export/downtime test
2023-08-23 21:05:50 <logmsgbot> !log bking@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on wdqs1009.eqiad.wmnet with reason: jnl export/downtime test
2023-08-23 21:06:44 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2044.mgmt.codfw.wmnet with reboot policy FORCED
2023-08-23 21:07:42 <logmsgbot> !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2012.codfw.wmnet
2023-08-23 21:15:21 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2045.mgmt.codfw.wmnet with reboot policy FORCED
2023-08-23 21:15:33 <logmsgbot> !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2012.codfw.wmnet
2023-08-23 21:15:36 <logmsgbot> !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2017.codfw.wmnet
2023-08-23 21:17:03 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2044.mgmt.codfw.wmnet with reboot policy FORCED
2023-08-23 21:19:29 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2047.codfw.wmnet with reason: host reimage
2023-08-23 21:20:58 <icinga-wm> RECOVERY - Check systemd state on an-presto1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
2023-08-23 21:23:00 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2047.codfw.wmnet with reason: host reimage
2023-08-23 21:23:37 <logmsgbot> !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2017.codfw.wmnet
2023-08-23 21:23:40 <logmsgbot> !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2018.codfw.wmnet
2023-08-23 21:25:57 <jinxer-wm> (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
2023-08-23 21:30:57 <jinxer-wm> (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
2023-08-23 21:32:23 <logmsgbot> !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2018.codfw.wmnet
2023-08-23 21:32:26 <logmsgbot> !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2023.codfw.wmnet
2023-08-23 21:38:25 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
2023-08-23 21:40:45 <logmsgbot> !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2023.codfw.wmnet
2023-08-23 21:40:48 <logmsgbot> !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2026.codfw.wmnet
2023-08-23 21:41:05 <jinxer-wm> (SwiftTooManyMediaUploads) resolved: (2) Too many eqiad mediawiki originals uploads - https://wikitech.wikimedia.org/wiki/Swift/How_To#mediawiki_originals_uploads - https://alerts.wikimedia.org/?q=alertname%3DSwiftTooManyMediaUploads
2023-08-23 21:44:30 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes2046']
2023-08-23 21:44:37 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
2023-08-23 21:44:38 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2047.codfw.wmnet with OS bullseye
2023-08-23 21:44:45 <wikibugs> 'SRE, ''ops-codfw, ''DC-Ops, ''serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2047.codfw.wmnet with OS bullseye completed: - kubernetes2047 (**PASS*...'
2023-08-23 21:49:00 <logmsgbot> !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2026.codfw.wmnet
2023-08-23 21:49:03 <logmsgbot> !log eevans@cumin1001 START - Cookbook sre.hosts.reboot-single for host restbase2027.codfw.wmnet
2023-08-23 21:49:44 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2046.codfw.wmnet with OS bullseye
2023-08-23 21:49:52 <wikibugs> 'SRE, ''ops-codfw, ''DC-Ops, ''serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2046.codfw.wmnet with OS bullseye'
2023-08-23 21:50:14 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2045']
2023-08-23 21:50:57 <jinxer-wm> (RedisMemoryFull) firing: (3) Redis memory full on gitlab1003:9121 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_gitlab - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
2023-08-23 21:51:23 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2044']
2023-08-23 21:51:33 <logmsgbot> !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['kubernetes2044']
2023-08-23 21:51:59 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2044']
2023-08-23 21:52:28 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2043.mgmt.codfw.wmnet with reboot policy FORCED
2023-08-23 21:55:57 <jinxer-wm> (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
2023-08-23 21:56:02 <jinxer-wm> (JobUnavailable) firing: (3) Reduced availability for job cloudlb_haproxy in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
2023-08-23 21:57:16 <logmsgbot> !log eevans@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host restbase2027.codfw.wmnet
2023-08-23 22:00:32 <wikibugs> ('PS1) ''Urbanecm: mediawiki::mcrouter_wancache: add wikifunctions entry [puppet] - ''https://gerrit.wikimedia.org/r/952000 (https://phabricator.wikimedia.org/T344147)'
2023-08-23 22:00:57 <jinxer-wm> (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
2023-08-23 22:04:14 <wikibugs> 'SRE, ''SRE-Access-Requests: Requesting access to analytics-privatedata-users and Kerberos identity for RMaung - https://phabricator.wikimedia.org/T330335 (''Rmaung) Hello! I have been issued a new laptop, and now I'm not sure how to set up production access once again. Do I need to provide a new ssh public k...'
2023-08-23 22:04:17 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes2045']
2023-08-23 22:04:25 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes2044']
2023-08-23 22:04:25 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2043.mgmt.codfw.wmnet with reboot policy FORCED
2023-08-23 22:04:51 <logmsgbot> !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-text_codfw and A:cp
2023-08-23 22:05:25 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2045.codfw.wmnet with OS bullseye
2023-08-23 22:05:33 <wikibugs> 'SRE, ''ops-codfw, ''DC-Ops, ''serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2045.codfw.wmnet with OS bullseye'
2023-08-23 22:06:57 <logmsgbot> !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubernetes2048.codfw.wmnet with OS bullseye
2023-08-23 22:07:04 <wikibugs> 'SRE, ''ops-codfw, ''DC-Ops, ''serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2048.codfw.wmnet with OS bullseye executed with errors: - kubernetes20...'
2023-08-23 22:07:21 <wikibugs> ('CR) ''Btullis: [C: ''+1] "Looks good. Thanks again." [puppet] - ''https://gerrit.wikimedia.org/r/929643 (https://phabricator.wikimedia.org/T254480) (owner: ''Slyngshede)'
2023-08-23 22:08:25 <logmsgbot> !log brett@cumin2002 END (PASS) - Cookbook sre.cdn.roll-reboot (exit_code=0) rolling reboot on A:cp-upload_codfw and A:cp
2023-08-23 22:09:14 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2044.codfw.wmnet with OS bullseye
2023-08-23 22:09:21 <wikibugs> 'SRE, ''ops-codfw, ''DC-Ops, ''serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2044.codfw.wmnet with OS bullseye'
2023-08-23 22:11:46 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2046.codfw.wmnet with reason: host reimage
2023-08-23 22:13:23 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2043']
2023-08-23 22:13:27 <wikibugs> ('CR) ''Jforrester: [C: ''+1] mediawiki::mcrouter_wancache: add wikifunctions entry [puppet] - ''https://gerrit.wikimedia.org/r/952000 (https://phabricator.wikimedia.org/T344147) (owner: ''Urbanecm)'
2023-08-23 22:13:46 <wikibugs> ('PS1) ''JHathaway: puppetserver: ensure correct ordering when using an intermediate cert [puppet] - ''https://gerrit.wikimedia.org/r/952003 (https://phabricator.wikimedia.org/T344868)'
2023-08-23 22:15:05 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2046.codfw.wmnet with reason: host reimage
2023-08-23 22:17:20 <wikibugs> 'SRE, ''ops-codfw, ''DC-Ops, ''serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (''Papaul) @Jhancock.wm hey looks like i have no link on kubernetes2048. Thanks ` papaul@asw-d-codfw> show interfaces descriptions ge-5/0/28 Interface Admin Link Descript...'
2023-08-23 22:19:03 <logmsgbot> !log ryankemper@cumin1001 START - Cookbook sre.wdqs.reboot
2023-08-23 22:19:59 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2042.mgmt.codfw.wmnet with reboot policy FORCED
2023-08-23 22:20:57 <jinxer-wm> (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
2023-08-23 22:24:10 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes2043']
2023-08-23 22:25:54 <icinga-wm> PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2007 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook
2023-08-23 22:25:57 <jinxer-wm> (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
2023-08-23 22:26:12 <jinxer-wm> (SystemdUnitFailed) firing: nginx.service Failed on wdqs2007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-08-23 22:26:35 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2043.codfw.wmnet with OS bullseye
2023-08-23 22:26:44 <wikibugs> 'SRE, ''ops-codfw, ''DC-Ops, ''serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2043.codfw.wmnet with OS bullseye'
2023-08-23 22:26:52 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2045.codfw.wmnet with reason: host reimage
2023-08-23 22:30:53 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
2023-08-23 22:30:54 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2044.codfw.wmnet with reason: host reimage
2023-08-23 22:30:57 <jinxer-wm> (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
2023-08-23 22:31:12 <jinxer-wm> (SystemdUnitFailed) resolved: nginx.service Failed on wdqs2007:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
2023-08-23 22:31:22 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2042.mgmt.codfw.wmnet with reboot policy FORCED
2023-08-23 22:31:50 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
2023-08-23 22:31:51 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2046.codfw.wmnet with OS bullseye
2023-08-23 22:31:58 <wikibugs> 'SRE, ''ops-codfw, ''DC-Ops, ''serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2046.codfw.wmnet with OS bullseye completed: - kubernetes2046 (**PASS*...'
2023-08-23 22:32:06 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2045.codfw.wmnet with reason: host reimage
2023-08-23 22:33:28 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2042']
2023-08-23 22:33:36 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2041.mgmt.codfw.wmnet with reboot policy FORCED
2023-08-23 22:34:49 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2044.codfw.wmnet with reason: host reimage
2023-08-23 22:35:57 <jinxer-wm> (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
2023-08-23 22:44:16 <wikibugs> 'ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T344872 (''phaultfinder)'
2023-08-23 22:44:19 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2041.mgmt.codfw.wmnet with reboot policy FORCED
2023-08-23 22:46:24 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes2042']
2023-08-23 22:46:58 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2041']
2023-08-23 22:47:57 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
2023-08-23 22:48:09 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2043.codfw.wmnet with reason: host reimage
2023-08-23 22:49:36 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2042.codfw.wmnet with OS bullseye
2023-08-23 22:49:36 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
2023-08-23 22:49:37 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2045.codfw.wmnet with OS bullseye
2023-08-23 22:49:44 <wikibugs> 'SRE, ''ops-codfw, ''DC-Ops, ''serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2042.codfw.wmnet with OS bullseye'
2023-08-23 22:50:53 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host kubernetes2040.mgmt.codfw.wmnet with reboot policy FORCED
2023-08-23 22:50:55 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
2023-08-23 22:51:47 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2043.codfw.wmnet with reason: host reimage
2023-08-23 22:52:05 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
2023-08-23 22:52:06 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2044.codfw.wmnet with OS bullseye
2023-08-23 22:52:14 <wikibugs> 'SRE, ''ops-codfw, ''DC-Ops, ''serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2044.codfw.wmnet with OS bullseye completed: - kubernetes2044 (**PASS*...'
2023-08-23 22:55:57 <jinxer-wm> (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
2023-08-23 22:58:41 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes2041']
2023-08-23 22:59:42 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2041.codfw.wmnet with OS bullseye
2023-08-23 22:59:49 <wikibugs> 'SRE, ''ops-codfw, ''DC-Ops, ''serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2041.codfw.wmnet with OS bullseye'
2023-08-23 23:00:28 <wikibugs> 'SRE, ''ops-codfw, ''DC-Ops, ''serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (''Papaul)'
2023-08-23 23:00:57 <jinxer-wm> (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
2023-08-23 23:01:49 <wikibugs> 'SRE, ''ops-codfw, ''DC-Ops, ''serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (''Papaul)'
2023-08-23 23:02:11 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.provision (exit_code=0) for host kubernetes2040.mgmt.codfw.wmnet with reboot policy FORCED
2023-08-23 23:07:20 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
2023-08-23 23:09:38 <jinxer-wm> (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
2023-08-23 23:09:38 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
2023-08-23 23:09:39 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2043.codfw.wmnet with OS bullseye
2023-08-23 23:09:46 <wikibugs> 'SRE, ''ops-codfw, ''DC-Ops, ''serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2043.codfw.wmnet with OS bullseye completed: - kubernetes2043 (**PASS*...'
2023-08-23 23:10:11 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['kubernetes2040']
2023-08-23 23:10:24 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2042.codfw.wmnet with reason: host reimage
2023-08-23 23:13:56 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2042.codfw.wmnet with reason: host reimage
2023-08-23 23:14:38 <jinxer-wm> (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
2023-08-23 23:15:57 <jinxer-wm> (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
2023-08-23 23:20:48 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2041.codfw.wmnet with reason: host reimage
2023-08-23 23:20:57 <jinxer-wm> (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
2023-08-23 23:24:21 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2041.codfw.wmnet with reason: host reimage
2023-08-23 23:25:57 <jinxer-wm> (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
2023-08-23 23:29:25 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['kubernetes2040']
2023-08-23 23:29:35 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
2023-08-23 23:30:18 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hosts.reimage for host kubernetes2040.codfw.wmnet with OS bullseye
2023-08-23 23:30:27 <wikibugs> 'SRE, ''ops-codfw, ''DC-Ops, ''serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by pt1979@cumin2002 for host kubernetes2040.codfw.wmnet with OS bullseye'
2023-08-23 23:30:57 <jinxer-wm> (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
2023-08-23 23:34:10 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
2023-08-23 23:34:11 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2042.codfw.wmnet with OS bullseye
2023-08-23 23:34:18 <wikibugs> 'SRE, ''ops-codfw, ''DC-Ops, ''serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2042.codfw.wmnet with OS bullseye completed: - kubernetes2042 (**PASS*...'
2023-08-23 23:35:57 <jinxer-wm> (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
2023-08-23 23:40:08 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
2023-08-23 23:40:57 <jinxer-wm> (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
2023-08-23 23:43:59 <logmsgbot> !log ryankemper@cumin1001 END (FAIL) - Cookbook sre.wdqs.reboot (exit_code=99)
2023-08-23 23:45:35 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.hosts.reimage: Host reimage - pt1979@cumin2002"
2023-08-23 23:45:36 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubernetes2041.codfw.wmnet with OS bullseye
2023-08-23 23:45:42 <wikibugs> 'SRE, ''ops-codfw, ''DC-Ops, ''serviceops: Q1:rack/setup/install kubernetes20[25-54] - https://phabricator.wikimedia.org/T342534 (''ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by pt1979@cumin2002 for host kubernetes2041.codfw.wmnet with OS bullseye completed: - kubernetes2041 (**PASS*...'
2023-08-23 23:45:57 <jinxer-wm> (RedisMemoryFull) firing: (8) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
2023-08-23 23:50:57 <jinxer-wm> (RedisMemoryFull) firing: (7) Redis memory full on rdb1011:16378 - https://wikitech.wikimedia.org/wiki/Redis#Cluster_redis_misc - https://alerts.wikimedia.org/?q=alertname%3DRedisMemoryFull
2023-08-23 23:51:22 <logmsgbot> !log pt1979@cumin2002 START - Cookbook sre.hosts.downtime for 2:00:00 on kubernetes2040.codfw.wmnet with reason: host reimage
2023-08-23 23:52:48 <wikibugs> ('PS1) ''Dduvall: P:gitlab::runner: Do not schedule untagged jobs on WMCS [puppet] - ''https://gerrit.wikimedia.org/r/952017 (https://phabricator.wikimedia.org/T344874)'
2023-08-23 23:54:56 <logmsgbot> !log pt1979@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubernetes2040.codfw.wmnet with reason: host reimage

This page is generated from SQL logs, you can also download static txt files from here