[00:00:39] <wikibugs>	 (03Abandoned) 10Dzahn: peopleweb: add monitor for people.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/900741 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn)
[00:02:10] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "both probes working per https://thanos.wikimedia.org/graph?g0.expr=probe_success%7Binstance%3D~%22.*etherpad.*%22%7D&g0.tab=1&g0.stacked=0" [puppet] - 10https://gerrit.wikimedia.org/r/902783 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn)
[00:03:26] <jinxer-wm>	 (WidespreadPuppetFailure) firing: Puppet has failed on wdqs cluster - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=wdqs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[00:03:34] <jinxer-wm>	 (SystemdUnitFailed) firing: (15) statograph_post.service Failed on alert1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:08:34] <jinxer-wm>	 (SystemdUnitFailed) firing: (15) statograph_post.service Failed on alert1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[00:18:14] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T333157 (10KFrancis) I am confirming the NDA has been signed.  Please proceed with the access request.  Thanks!
[00:26:36] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-eqiad, 10DC-Ops: Q3:rack/setup/install ms-fe1013 - ms-fe1014, thanos-fe1004 - https://phabricator.wikimedia.org/T326846 (10wiki_willy) @Cmjohnson & @Papaul - can you guys provide an ETR on this one?  Thanks, Willy
[00:33:26] <jinxer-wm>	 (WidespreadPuppetFailure) resolved: Puppet has failed on wdqs cluster - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=wdqs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[00:47:26] <jinxer-wm>	 (WidespreadPuppetFailure) firing: Puppet has failed on ganeti cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=ganeti - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[00:51:26] <jinxer-wm>	 (WidespreadPuppetFailure) firing: Puppet has failed on lvs cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=lvs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[00:51:59] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Hal deployment rights - https://phabricator.wikimedia.org/T331647 (10Htriedman) @Dzahn Thank you so much for the help explaining this! Makes a ton of sense, and I'll create that ticket soon.  @Ottomata Unfortunately I'm trying to get something from hdfs and publish it to `/s...
[01:02:39] <icinga-wm>	 PROBLEM - Check systemd state on logstash1011 is CRITICAL: CRITICAL - degraded: The following units failed: opensearch_2@production-elk7-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:03:19] <icinga-wm>	 PROBLEM - OpenSearch health check for shards on 9200 on logstash1011 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f0d9ddda280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.wi
[01:03:19] <icinga-wm>	 org/wiki/Search%23Administration
[01:03:54] <jinxer-wm>	 (SystemdUnitFailed) firing: (14) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:07:33] <wikibugs>	 10ops-codfw: Inbound interface errors - https://phabricator.wikimedia.org/T331373 (10phaultfinder)
[01:12:05] <icinga-wm>	 RECOVERY - Check systemd state on logstash1011 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[01:12:43] <icinga-wm>	 RECOVERY - OpenSearch health check for shards on 9200 on logstash1011 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: yellow, timed_out: False, number_of_nodes: 18, number_of_data_nodes: 12, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 607, active_shards: 1330, relocating_shards: 1, initializing_shards: 3, unassigned_shards: 69, delayed_unassigned_sh
[01:12:43] <icinga-wm>	  number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 94.8644793152639 https://wikitech.wikimedia.org/wiki/Search%23Administration
[01:13:49] <jinxer-wm>	 (SystemdUnitFailed) firing: (14) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:21:26] <jinxer-wm>	 (WidespreadPuppetFailure) resolved: Puppet has failed on lvs cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=lvs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[01:22:26] <jinxer-wm>	 (WidespreadPuppetFailure) resolved: Puppet has failed on ganeti cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=ganeti - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[01:24:15] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) idm-sync-permissions.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:43:54] <wikibugs>	 (03PS2) 10Krinkle: mc: Remove unused $wgWANObjectCaches and $wgMainWANCache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889246 (https://phabricator.wikimedia.org/T329680)
[01:45:00] * Krinkle testing on mwdebug2001
[01:45:05] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] mc: Remove unused $wgWANObjectCaches and $wgMainWANCache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889246 (https://phabricator.wikimedia.org/T329680) (owner: 10Krinkle)
[01:45:48] <wikibugs>	 (03Merged) 10jenkins-bot: mc: Remove unused $wgWANObjectCaches and $wgMainWANCache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889246 (https://phabricator.wikimedia.org/T329680) (owner: 10Krinkle)
[01:50:20] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, and 2 others: Serve production traffic via Kubernetes - https://phabricator.wikimedia.org/T290536 (10Quiddity)
[01:52:26] <wikibugs>	 (03CR) 10Krinkle: [C: 03+2] "Test plan: Make an edit on test2.wikipedia via WikimediaDebug with verbose logging enabled. Then, confirm in Logstash that for the given w" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889246 (https://phabricator.wikimedia.org/T329680) (owner: 10Krinkle)
[01:58:34] <jinxer-wm>	 (SystemdUnitFailed) firing: (16) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[01:59:16] <logmsgbot>	 !log krinkle@deploy2002 Synchronized wmf-config/mc.php: I44edcd46da45b827d (duration: 06m 33s)
[02:00:05] <jouncebot>	 Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T0200)
[02:03:39] <jinxer-wm>	 (SystemdUnitFailed) firing: (60) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[02:03:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[02:06:45] <jinxer-wm>	 (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:07:43] <wikibugs>	 (03PS1) 10TrainBranchBot: Branch commit for wmf/1.41.0-wmf.2 [core] (wmf/1.41.0-wmf.2) - 10https://gerrit.wikimedia.org/r/902622 (https://phabricator.wikimedia.org/T330208)
[02:07:49] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.41.0-wmf.2 [core] (wmf/1.41.0-wmf.2) - 10https://gerrit.wikimedia.org/r/902622 (https://phabricator.wikimedia.org/T330208) (owner: 10TrainBranchBot)
[02:08:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST pods) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[02:18:39] <jinxer-wm>	 (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 386.1k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig
[02:22:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Branch commit for wmf/1.41.0-wmf.2 [core] (wmf/1.41.0-wmf.2) - 10https://gerrit.wikimedia.org/r/902622 (https://phabricator.wikimedia.org/T330208) (owner: 10TrainBranchBot)
[02:25:16] <wikibugs>	 10SRE, 10MW-on-K8s, 10Traffic, 10serviceops, 10Patch-For-Review: Find a sensible way to direct traffic to mw-on-k8s - https://phabricator.wikimedia.org/T331318 (10Krinkle)
[02:26:45] <jinxer-wm>	 (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[02:29:21] <wikibugs>	 (03PS1) 10Andrew Bogott: Trove: adjust timeouts yet again [puppet] - 10https://gerrit.wikimedia.org/r/903361
[02:30:26] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Trove: adjust timeouts yet again [puppet] - 10https://gerrit.wikimedia.org/r/903361 (owner: 10Andrew Bogott)
[02:47:21] <icinga-wm>	 RECOVERY - confd service on an-worker1132 is OK: OK - confd is active https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[02:48:52] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Lift-Wing, 10Machine-Learning-Team, 10Patch-For-Review: Machine Learning team -  k8s resources access - https://phabricator.wikimedia.org/T333174 (10Ladsgroup)
[02:49:11] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Lift-Wing, 10Machine-Learning-Team, 10Patch-For-Review: Machine Learning team -  k8s resources access - https://phabricator.wikimedia.org/T333174 (10Ladsgroup)
[02:53:05] <icinga-wm>	 PROBLEM - confd service on an-worker1132 is CRITICAL: CRITICAL - Expecting active but unit confd is activating https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[03:00:04] <jouncebot>	 Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T0300)
[03:03:34] <jinxer-wm>	 (SystemdUnitFailed) firing: (14) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[03:05:25] <icinga-wm>	 PROBLEM - Check systemd state on deploy2002 is CRITICAL: CRITICAL - degraded: The following units failed: train-presync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[03:22:21] <jinxer-wm>	 (CirrusSearchNodeIndexingNotIncreasing) firing: (10) Elasticsearch instance elastic1055-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[03:59:23] <icinga-wm>	 PROBLEM - dump of db_inventory in codfw on backupmon1001 is CRITICAL: Last dump for db_inventory at codfw (db2185) taken on 2023-03-28 03:47:53 is 108 KiB, but the previous one was 92 KiB, a change of +17.0 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[04:13:43] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[04:14:29] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[04:19:19] <icinga-wm>	 PROBLEM - dump of db_inventory in eqiad on backupmon1001 is CRITICAL: Last dump for db_inventory at eqiad (db1115) taken on 2023-03-28 03:52:10 is 107 KiB, but the previous one was 91 KiB, a change of +17.5 % https://wikitech.wikimedia.org/wiki/MariaDB/Backups%23Rerun_a_failed_backup
[04:20:07] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.400 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[04:21:17] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49708 bytes in 0.089 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[05:11:54] <wikibugs>	 (03PS1) 10Marostegui: db1179: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/903364 (https://phabricator.wikimedia.org/T332292)
[05:13:53] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1179: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/903364 (https://phabricator.wikimedia.org/T332292) (owner: 10Marostegui)
[05:15:39] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 1%: Repooling', diff saved to https://phabricator.wikimedia.org/P45955 and previous config saved to /var/cache/conftool/dbconfig/20230328-051539-root.json
[05:28:34] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) idm-sync-permissions.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[05:30:44] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 2%: Repooling', diff saved to https://phabricator.wikimedia.org/P45956 and previous config saved to /var/cache/conftool/dbconfig/20230328-053043-root.json
[05:40:41] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] charts: upgrade to mesh 1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/901769 (https://phabricator.wikimedia.org/T287983) (owner: 10Giuseppe Lavagetto)
[05:45:25] <wikibugs>	 (03Merged) 10jenkins-bot: charts: upgrade to mesh 1.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/901769 (https://phabricator.wikimedia.org/T287983) (owner: 10Giuseppe Lavagetto)
[05:45:49] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 3%: Repooling', diff saved to https://phabricator.wikimedia.org/P45957 and previous config saved to /var/cache/conftool/dbconfig/20230328-054548-root.json
[05:53:20] <logmsgbot>	 !log oblivian@deploy2002 helmfile [eqiad] START helmfile.d/services/mw-debug: apply
[05:53:57] <logmsgbot>	 !log oblivian@deploy2002 helmfile [eqiad] DONE helmfile.d/services/mw-debug: apply
[05:55:28] <logmsgbot>	 !log oblivian@deploy2002 helmfile [codfw] START helmfile.d/services/mw-debug: apply
[05:55:55] <logmsgbot>	 !log oblivian@deploy2002 helmfile [codfw] DONE helmfile.d/services/mw-debug: apply
[06:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T0600)
[06:00:05] <jouncebot>	 kormat, marostegui, and Amir1: OwO what's this, a deployment window?? Primary database switchover. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T0600). nyaa~
[06:00:35] <_joe_>	 jouncebot: next
[06:00:35] <jouncebot>	 In 0 hour(s) and 59 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T0700)
[06:00:54] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 4%: Repooling', diff saved to https://phabricator.wikimedia.org/P45958 and previous config saved to /var/cache/conftool/dbconfig/20230328-060053-root.json
[06:14:42] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1104 T329481', diff saved to https://phabricator.wikimedia.org/P45959 and previous config saved to /var/cache/conftool/dbconfig/20230328-061441-root.json
[06:14:48] <stashbot>	 T329481: decommission db1104.eqiad.wmnet - https://phabricator.wikimedia.org/T329481
[06:15:45] <wikibugs>	 (03PS1) 10Marostegui: db1104: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/903537 (https://phabricator.wikimedia.org/T329481)
[06:15:58] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 5%: Repooling', diff saved to https://phabricator.wikimedia.org/P45960 and previous config saved to /var/cache/conftool/dbconfig/20230328-061558-root.json
[06:16:11] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] db1104: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/903537 (https://phabricator.wikimedia.org/T329481) (owner: 10Marostegui)
[06:18:39] <jinxer-wm>	 (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 310.2k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig
[06:28:26] <wikibugs>	 (03CR) 10Abijeet Patro: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/903546 (owner: 10Abijeet Patro)
[06:31:03] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 10%: Repooling', diff saved to https://phabricator.wikimedia.org/P45961 and previous config saved to /var/cache/conftool/dbconfig/20230328-063103-root.json
[06:31:41] <icinga-wm>	 RECOVERY - Check systemd state on db1101 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[06:46:08] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 25%: Repooling', diff saved to https://phabricator.wikimedia.org/P45962 and previous config saved to /var/cache/conftool/dbconfig/20230328-064607-root.json
[06:51:14] <wikibugs>	 10SRE-Sprint-Week-Sustainability-March2023, 10Traffic, 10envoy, 10serviceops, 10Sustainability (Incident Followup): Raw "upstream connect error or disconnect/reset before headers. reset reason: overflow" error message shown to users during outage - https://phabricator.wikimedia.org/T287983 (10Joe) 05Ope...
[06:51:24] <wikibugs>	 (03PS15) 10Ayounsi: Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649)
[06:56:29] <wikibugs>	 (03PS16) 10Ayounsi: Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649)
[06:57:32] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] k8s: Use storage-driver instead of storage_driver [puppet] - 10https://gerrit.wikimedia.org/r/903329 (https://phabricator.wikimedia.org/T332803) (owner: 10Ahmon Dancy)
[06:58:53] <wikibugs>	 (03CR) 10Jforrester: "Filed the merge failure as T333291." [core] (wmf/1.41.0-wmf.2) - 10https://gerrit.wikimedia.org/r/902622 (https://phabricator.wikimedia.org/T330208) (owner: 10TrainBranchBot)
[07:00:05] <jouncebot>	 Amir1, Urbanecm, and taavi: I, the Bot under the Fountain, call upon thee, The Deployer, to do UTC morning backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T0700).
[07:00:05] <jouncebot>	 kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker.
[07:00:13] <taavi>	 o/
[07:00:16] <taavi>	 kart_: want to self-deploy?
[07:01:13] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 50%: Repooling', diff saved to https://phabricator.wikimedia.org/P45963 and previous config saved to /var/cache/conftool/dbconfig/20230328-070112-root.json
[07:02:45] * kart_ is here
[07:02:55] <kart_>	 taavi: I can self deploy this.
[07:03:24] <wikibugs>	 (03PS2) 10KartikMistry: Enable Section Translation on some wikis while Content Translation remains in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903003 (https://phabricator.wikimedia.org/T308834)
[07:03:34] <jinxer-wm>	 (SystemdUnitFailed) firing: (14) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:06:39] <wikibugs>	 (03PS17) 10Ayounsi: Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649)
[07:07:36] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903003 (https://phabricator.wikimedia.org/T308834) (owner: 10KartikMistry)
[07:08:21] <wikibugs>	 (03Merged) 10jenkins-bot: Enable Section Translation on some wikis while Content Translation remains in beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903003 (https://phabricator.wikimedia.org/T308834) (owner: 10KartikMistry)
[07:08:43] <logmsgbot>	 !log kartik@deploy2002 Started scap: Backport for [[gerrit:903003|Enable Section Translation on some wikis while Content Translation remains in beta (T308834)]]
[07:08:49] <stashbot>	 T308834: Enable Section Translation on some wikis while Content Translation remains in beta - https://phabricator.wikimedia.org/T308834
[07:10:51] <logmsgbot>	 !log kartik@deploy2002 kartik: Backport for [[gerrit:903003|Enable Section Translation on some wikis while Content Translation remains in beta (T308834)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet
[07:14:08] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "LGTM" [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi)
[07:16:17] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 75%: Repooling', diff saved to https://phabricator.wikimedia.org/P45964 and previous config saved to /var/cache/conftool/dbconfig/20230328-071617-root.json
[07:20:26] <wikibugs>	 (03PS3) 10Slyngshede: P:url_downloader send Squid access logs to Logstash [puppet] - 10https://gerrit.wikimedia.org/r/903265
[07:20:49] <logmsgbot>	 !log kartik@deploy2002 Finished scap: Backport for [[gerrit:903003|Enable Section Translation on some wikis while Content Translation remains in beta (T308834)]] (duration: 12m 05s)
[07:20:54] <stashbot>	 T308834: Enable Section Translation on some wikis while Content Translation remains in beta - https://phabricator.wikimedia.org/T308834
[07:21:58] <kart_>	 taavi: I'm done.
[07:22:10] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40356/console" [puppet] - 10https://gerrit.wikimedia.org/r/903265 (owner: 10Slyngshede)
[07:23:34] <jinxer-wm>	 (SystemdUnitFailed) firing: (64) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:27:06] <jinxer-wm>	 (CirrusSearchNodeIndexingNotIncreasing) firing: (10) Elasticsearch instance elastic1055-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[07:27:48] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'clear' for AS: 17806
[07:28:14] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'clear' for AS: 17806
[07:28:28] <wikibugs>	 (03PS4) 10Slyngshede: P:url_downloader send Squid access logs to Logstash [puppet] - 10https://gerrit.wikimedia.org/r/903265
[07:28:34] <jinxer-wm>	 (SystemdUnitFailed) firing: (64) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[07:31:22] <logmsgbot>	 !log marostegui@cumin1001 dbctl commit (dc=all): 'db1179 (re)pooling @ 100%: Repooling', diff saved to https://phabricator.wikimedia.org/P45965 and previous config saved to /var/cache/conftool/dbconfig/20230328-073122-root.json
[07:31:41] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40357/console" [puppet] - 10https://gerrit.wikimedia.org/r/903265 (owner: 10Slyngshede)
[07:34:47] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi)
[07:37:32] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] P:url_downloader send Squid access logs to Logstash (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/903265 (owner: 10Slyngshede)
[07:38:02] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] wmnet: move reads to graphite2004 [dns] - 10https://gerrit.wikimedia.org/r/903185 (https://phabricator.wikimedia.org/T330165) (owner: 10Filippo Giunchedi)
[07:38:06] <wikibugs>	 (03PS2) 10Filippo Giunchedi: wmnet: move reads to graphite2004 [dns] - 10https://gerrit.wikimedia.org/r/903185 (https://phabricator.wikimedia.org/T330165)
[07:38:34] <jinxer-wm>	 (CirrusSearchJobQueueBacklogTooBig) resolved: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 206.8k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig
[07:40:09] <godog>	 !log move graphite reads to codfw - T330165
[07:40:19] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] graphite: check graphite2004 [puppet] - 10https://gerrit.wikimedia.org/r/903206 (https://phabricator.wikimedia.org/T330165) (owner: 10Filippo Giunchedi)
[07:40:21] <wikibugs>	 (03Merged) 10jenkins-bot: Refactor and centralize BGPpeer config [deployment-charts] - 10https://gerrit.wikimedia.org/r/887945 (https://phabricator.wikimedia.org/T306649) (owner: 10Ayounsi)
[07:47:56] <wikibugs>	 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Marostegui)
[07:50:57] <godog>	 jouncebot: next
[07:50:57] <jouncebot>	 In 2 hour(s) and 9 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T1000)
[07:51:53] <logmsgbot>	 !log ayounsi@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[07:51:54] <logmsgbot>	 !log ayounsi@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[07:54:44] <logmsgbot>	 !log root@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[07:54:45] <logmsgbot>	 !log root@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[07:56:13] <logmsgbot>	 !log ayounsi@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[07:56:15] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] START helmfile.d/admin 'apply'.
[07:56:20] <logmsgbot>	 !log jayme@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[07:56:36] <icinga-wm>	 PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale: 1 (gerrit1001), Fresh: 124 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[08:00:06] <godog>	 !log move graphite reads to codfw - T330165
[08:00:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:00:17] <stashbot>	 T330165: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165
[08:00:20] <logmsgbot>	 !log ayounsi@deploy1002 helmfile [staging-codfw] DONE helmfile.d/admin 'apply'.
[08:00:33] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] wmnet: move writes to graphite2004 [dns] - 10https://gerrit.wikimedia.org/r/903208 (https://phabricator.wikimedia.org/T330165) (owner: 10Filippo Giunchedi)
[08:00:38] <wikibugs>	 (03PS2) 10Filippo Giunchedi: wmnet: move writes to graphite2004 [dns] - 10https://gerrit.wikimedia.org/r/903208 (https://phabricator.wikimedia.org/T330165)
[08:00:41] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] statsd: move writes to graphite2004 [puppet] - 10https://gerrit.wikimedia.org/r/903207 (https://phabricator.wikimedia.org/T330165) (owner: 10Filippo Giunchedi)
[08:01:06] <jinxer-wm>	 (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 254.2k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig
[08:01:39] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by oblivian@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903209 (https://phabricator.wikimedia.org/T330165) (owner: 10Filippo Giunchedi)
[08:02:25] <wikibugs>	 (03Merged) 10jenkins-bot: Failover statsd to graphite2004 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903209 (https://phabricator.wikimedia.org/T330165) (owner: 10Filippo Giunchedi)
[08:02:26] <logmsgbot>	 !log ayounsi@deploy1002 helmfile [staging-eqiad] START helmfile.d/admin 'apply'.
[08:02:36] <logmsgbot>	 !log oblivian@deploy2002 Started scap: Backport for [[gerrit:903209|Failover statsd to graphite2004 (T330165)]]
[08:03:05] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on es[1020-1022].eqiad.wmnet with reason: Switch maintenance
[08:03:11] <logmsgbot>	 !log ayounsi@deploy1002 helmfile [staging-eqiad] DONE helmfile.d/admin 'apply'.
[08:03:19] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on es[1020-1022].eqiad.wmnet with reason: Switch maintenance
[08:04:11] <logmsgbot>	 !log oblivian@deploy2002 oblivian and filippo: Backport for [[gerrit:903209|Failover statsd to graphite2004 (T330165)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet
[08:04:22] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Error creating interfaces in netbox-next - https://phabricator.wikimedia.org/T333292 (10cmooney) p:05Triage→03Medium
[08:05:11] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on 21 hosts with reason: Switch maintenance
[08:05:27] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on 21 hosts with reason: Switch maintenance
[08:05:53] <logmsgbot>	 !log marostegui@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on 16 hosts with reason: Switch maintenance
[08:06:15] <logmsgbot>	 !log marostegui@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on 16 hosts with reason: Switch maintenance
[08:06:47] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Error creating interfaces in netbox-next - https://phabricator.wikimedia.org/T333292 (10cmooney)
[08:08:34] <logmsgbot>	 !log ayounsi@deploy1002 helmfile [ml-staging-codfw] START helmfile.d/admin 'apply'.
[08:08:34] <jinxer-wm>	 (SystemdUnitFailed) firing: (16) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:08:46] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) @papaul looks good to me.  I can do them any day this week except today (Tuesday), so whenever...
[08:09:03] <_joe_>	 godog: php restarts happening
[08:09:12] <_joe_>	 you should see the traffif shifting
[08:09:16] <godog>	 _joe_: ok! thank you
[08:09:30] <godog>	 I'm looking at this guy https://grafana.wikimedia.org/d/000000337/graphite-codfw?orgId=1&refresh=1m&from=1679987363327&to=1679990963327&viewPanel=14
[08:11:25] <logmsgbot>	 !log oblivian@deploy2002 Finished scap: Backport for [[gerrit:903209|Failover statsd to graphite2004 (T330165)]] (duration: 08m 48s)
[08:11:30] <stashbot>	 T330165: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165
[08:12:50] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] P:kubernetes::node: Use performance governor [puppet] - 10https://gerrit.wikimedia.org/r/902119 (https://phabricator.wikimedia.org/T332788) (owner: 10Clément Goubert)
[08:13:17] <wikibugs>	 (03PS6) 10Clément Goubert: P:kubernetes::node: Use performance governor [puppet] - 10https://gerrit.wikimedia.org/r/902119 (https://phabricator.wikimedia.org/T332788)
[08:13:34] <jinxer-wm>	 (SystemdUnitFailed) firing: (61) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:14:38] <logmsgbot>	 !log ayounsi@deploy1002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'apply'.
[08:18:34] <jinxer-wm>	 (SystemdUnitFailed) firing: (61) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[08:21:13] <logmsgbot>	 !log ayounsi@deploy1002 helmfile [aux-k8s-eqiad] START helmfile.d/admin 'apply'.
[08:21:26] <wikibugs>	 (03PS1) 10Stevemunene: Deprecate oozie services on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/903594 (https://phabricator.wikimedia.org/T333295)
[08:24:02] <wikibugs>	 10SRE, 10SRE-Access-Requests: Update SSH key for abi - https://phabricator.wikimedia.org/T333298 (10abi_)
[08:24:39] <wikibugs>	 (03CR) 10LSobanski: [C: 03+1] releases-jenkins: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902788 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn)
[08:25:30] <logmsgbot>	 !log ayounsi@deploy1002 helmfile [aux-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[08:26:20] <wikibugs>	 (03PS1) 10Slyngshede: sre.ganeti.makevm: run sync-netbox-hiera after creation [cookbooks] - 10https://gerrit.wikimedia.org/r/903596
[08:27:33] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: requestctl: fix default path for the git repo [software/conftool] - 10https://gerrit.wikimedia.org/r/903597
[08:27:35] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Add a ConftoolClient class to ease initialization by clients [software/conftool] - 10https://gerrit.wikimedia.org/r/903598
[08:27:37] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Support urllib 2.x [software/conftool] - 10https://gerrit.wikimedia.org/r/903599
[08:27:39] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Release 2.3.0 [software/conftool] - 10https://gerrit.wikimedia.org/r/903600
[08:28:36] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.ganeti.makevm: run sync-netbox-hiera after creation [cookbooks] - 10https://gerrit.wikimedia.org/r/903596 (owner: 10Slyngshede)
[08:29:19] <wikibugs>	 (03CR) 10Stevemunene: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40359/console" [puppet] - 10https://gerrit.wikimedia.org/r/903594 (https://phabricator.wikimedia.org/T333295) (owner: 10Stevemunene)
[08:29:25] <logmsgbot>	 !log ayounsi@deploy1002 helmfile [ml-serve-codfw] START helmfile.d/admin 'apply'.
[08:30:27] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Support urllib 2.x [software/conftool] - 10https://gerrit.wikimedia.org/r/903599 (owner: 10Giuseppe Lavagetto)
[08:30:35] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add a ConftoolClient class to ease initialization by clients [software/conftool] - 10https://gerrit.wikimedia.org/r/903598 (owner: 10Giuseppe Lavagetto)
[08:30:37] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Release 2.3.0 [software/conftool] - 10https://gerrit.wikimedia.org/r/903600 (owner: 10Giuseppe Lavagetto)
[08:31:26] <logmsgbot>	 !log ayounsi@deploy1002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'apply'.
[08:31:27] <wikibugs>	 (03PS3) 10Abijeet Patro: Update SSH key for abi [puppet] - 10https://gerrit.wikimedia.org/r/903546 (https://phabricator.wikimedia.org/T333298)
[08:32:09] <logmsgbot>	 !log phedenskog@deploy2002 Started deploy [performance/navtiming@e757bdf]: (no justification provided)
[08:32:11] <wikibugs>	 (03PS2) 10Slyngshede: sre.ganeti.makevm: run sync-netbox-hiera after creation [cookbooks] - 10https://gerrit.wikimedia.org/r/903596
[08:32:15] <logmsgbot>	 !log phedenskog@deploy2002 Finished deploy [performance/navtiming@e757bdf]: (no justification provided) (duration: 00m 06s)
[08:32:21] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] requestctl: fix default path for the git repo [software/conftool] - 10https://gerrit.wikimedia.org/r/903597 (owner: 10Giuseppe Lavagetto)
[08:32:25] <logmsgbot>	 !log ayounsi@deploy1002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'apply'.
[08:32:29] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Update SSH key for abi - https://phabricator.wikimedia.org/T333298 (10Nikerabbit) I approve. Though, this should be just a key update.
[08:32:49] <wikibugs>	 (03CR) 10Hashar: releases-jenkins: replace Icinga with Prometheus monitoring (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/902788 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn)
[08:34:20] <logmsgbot>	 !log ayounsi@deploy1002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'apply'.
[08:35:17] <logmsgbot>	 !log ayounsi@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'.
[08:35:26] <wikibugs>	 (03Merged) 10jenkins-bot: requestctl: fix default path for the git repo [software/conftool] - 10https://gerrit.wikimedia.org/r/903597 (owner: 10Giuseppe Lavagetto)
[08:35:54] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Error creating interfaces in netbox-next - https://phabricator.wikimedia.org/T333292 (10cmooney) p:05Medium→03Low Had a quick chat with @ayounsi on irc about this, seems it's related to some of the validation scripts, should be easy to fix.
[08:36:09] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983 (10MatthewVernon) The firmware update cookbook does offer a firmware update; I was going to apply it once the disks were swapped (as rebooting the system with drives in a funny state...
[08:36:43] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Many thanks. Looks good to me." [puppet] - 10https://gerrit.wikimedia.org/r/899652 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[08:37:08] <wikibugs>	 10SRE, 10Data-Persistence, 10Traffic-Icebox, 10serviceops: Audit and harmonize timeouts across the stack - https://phabricator.wikimedia.org/T250251 (10Marostegui)
[08:37:37] <logmsgbot>	 !log ayounsi@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'.
[08:37:58] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10aborrero) >>! In T327919#8732605, @cmooney wrote: >  > @aborrero are we ok to proceed with theis second...
[08:38:18] <wikibugs>	 (03CR) 10Clément Goubert: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/903595 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert)
[08:39:28] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Also looks good to me, thanks." [puppet] - 10https://gerrit.wikimedia.org/r/902454 (https://phabricator.wikimedia.org/T309007) (owner: 10Cathal Mooney)
[08:39:34] <logmsgbot>	 !log ayounsi@deploy1002 helmfile [codfw] START helmfile.d/admin 'apply'.
[08:39:57] <wikibugs>	 10ops-codfw, 10serviceops-collab, 10GitLab (Infrastructure): Install additional SSDs on gitlab2003.wikimedia.org (B5) - https://phabricator.wikimedia.org/T333304 (10Jelto)
[08:40:16] <wikibugs>	 (03PS2) 10Clément Goubert: P:services_proxy::envoy: Add mw-api-int [puppet] - 10https://gerrit.wikimedia.org/r/903595 (https://phabricator.wikimedia.org/T333120)
[08:40:42] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Great stuff, thanks." [alerts] - 10https://gerrit.wikimedia.org/r/902694 (https://phabricator.wikimedia.org/T309009) (owner: 10Cathal Mooney)
[08:41:06] <jinxer-wm>	 (CirrusSearchJobQueueBacklogTooBig) resolved: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 203.1k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig
[08:41:06] <icinga-wm>	 PROBLEM - BGP status on cr2-codfw is CRITICAL: BGP CRITICAL - AS64602/IPv6: Active - kubernetes-codfw, AS64602/IPv4: Active - kubernetes-codfw https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status
[08:42:48] <logmsgbot>	 !log aborrero@cumin2002 START - Cookbook sre.hosts.reboot-single for host cloudservices2005-dev.wikimedia.org
[08:43:43] <logmsgbot>	 !log ayounsi@deploy1002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[08:44:17] <wikibugs>	 (03PS1) 10Marostegui: orchestrator.conf.json.erb: Replace sanitarium hosts [puppet] - 10https://gerrit.wikimedia.org/r/903602 (https://phabricator.wikimedia.org/T326596)
[08:45:17] <logmsgbot>	 !log ayounsi@deploy1002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[08:45:23] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 04-1] P:services_proxy::envoy: Add mw-api-int (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/903595 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert)
[08:46:30] <wikibugs>	 (03CR) 10Clément Goubert: P:services_proxy::envoy: Add mw-api-int (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/903595 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert)
[08:48:01] <wikibugs>	 (03CR) 10MVernon: [C: 03+1] orchestrator.conf.json.erb: Replace sanitarium hosts [puppet] - 10https://gerrit.wikimedia.org/r/903602 (https://phabricator.wikimedia.org/T326596) (owner: 10Marostegui)
[08:48:12] <wikibugs>	 (03CR) 10Marostegui: [C: 03+2] orchestrator.conf.json.erb: Replace sanitarium hosts [puppet] - 10https://gerrit.wikimedia.org/r/903602 (https://phabricator.wikimedia.org/T326596) (owner: 10Marostegui)
[08:48:34] <wikibugs>	 (03PS3) 10Clément Goubert: P:services_proxy::envoy: Add mw-api-int [puppet] - 10https://gerrit.wikimedia.org/r/903595 (https://phabricator.wikimedia.org/T333120)
[08:49:12] <logmsgbot>	 !log ayounsi@deploy1002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[08:50:56] <logmsgbot>	 !log aborrero@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host cloudservices2005-dev.wikimedia.org
[08:52:09] <wikibugs>	 (03CR) 10Clément Goubert: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/903603 (owner: 10Clément Goubert)
[08:55:33] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10cmooney) >>! In T296832#8729881, @Volans wrote: > Looks ok to me too, I'm no sure about all the details involved if w...
[08:57:10] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] P:docker::prune_old_images: Fix type [puppet] - 10https://gerrit.wikimedia.org/r/903603 (owner: 10Clément Goubert)
[08:58:29] <vgutierrez>	 !log restart ipmiseld on cp2035
[08:58:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:59:50] <wikibugs>	 (03CR) 10Jelto: [C: 03+2] "new disks arrived, merging the new partman config" [puppet] - 10https://gerrit.wikimedia.org/r/898791 (https://phabricator.wikimedia.org/T330172) (owner: 10Jelto)
[09:01:16] <wikibugs>	 (03CR) 10Btullis: "Looks good. I have one query about another couple of alerts that we might be able to remove, but I couldn't find them." [puppet] - 10https://gerrit.wikimedia.org/r/902703 (https://phabricator.wikimedia.org/T309009) (owner: 10Cathal Mooney)
[09:03:17] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netbox, 10netops, 10Patch-For-Review: Represent sub-interface and bridge device assocations in Netbox - https://phabricator.wikimedia.org/T296832 (10cmooney) >>! In T296832#8729881, @Volans wrote: > Looks ok to me too, I'm no sure about all the details involved if w...
[09:03:38] <wikibugs>	 10ops-codfw, 10Traffic: cp2035 IPMI and management console issues - https://phabricator.wikimedia.org/T333312 (10Vgutierrez)
[09:03:49] <wikibugs>	 (03PS4) 10Clément Goubert: service_catalog: Add mw-api-int k8s service [puppet] - 10https://gerrit.wikimedia.org/r/903217 (https://phabricator.wikimedia.org/T333120)
[09:04:27] <wikibugs>	 10ops-codfw, 10Traffic: cp2035 IPMI and management console issues - https://phabricator.wikimedia.org/T333312 (10Vgutierrez) p:05Triage→03Medium
[09:04:40] <wikibugs>	 (03PS4) 10Clément Goubert: P:services_proxy::envoy: Add mw-api-int [puppet] - 10https://gerrit.wikimedia.org/r/903595 (https://phabricator.wikimedia.org/T333120)
[09:05:11] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] "I like the idea thanks! Let's see what Janis thinks about it :)" [deployment-charts] - 10https://gerrit.wikimedia.org/r/903297 (https://phabricator.wikimedia.org/T333174) (owner: 10Alexandros Kosiaris)
[09:06:19] <wikibugs>	 (03CR) 10Clément Goubert: P:services_proxy::envoy: Add mw-api-int (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/903595 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert)
[09:06:37] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+1] "I think that's the right way of doing it" [deployment-charts] - 10https://gerrit.wikimedia.org/r/903297 (https://phabricator.wikimedia.org/T333174) (owner: 10Alexandros Kosiaris)
[09:09:44] <wikibugs>	 (03CR) 10Clément Goubert: [V: 03+1] "PCC SUCCESS (DIFF 17 NOOP 7): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40361/console" [puppet] - 10https://gerrit.wikimedia.org/r/903595 (https://phabricator.wikimedia.org/T333120) (owner: 10Clément Goubert)
[09:10:59] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.idm.logout Logging Nicolas Fraison out of systemdlogoutd on: 2048 hosts
[09:11:08] <logmsgbot>	 !log jbond@cumin1001 END (ERROR) - Cookbook sre.idm.logout (exit_code=97) Logging Nicolas Fraison out of systemdlogoutd on: 2048 hosts
[09:11:44] <logmsgbot>	 !log jbond@cumin1001 START - Cookbook sre.idm.logout Logging Nicolas Fraison out of all services on: 2048 hosts
[09:12:23] <logmsgbot>	 !log jbond@cumin1001 END (PASS) - Cookbook sre.idm.logout (exit_code=0) Logging Nicolas Fraison out of all services on: 2048 hosts
[09:13:42] <wikibugs>	 (03CR) 10Jaime Nuche: Revert "deployment_server: ensure Docker is installed" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/903200 (owner: 10Dzahn)
[09:13:47] <wikibugs>	 (03CR) 10Btullis: [C: 03+1] "Looks good. We know that the oozie profile is also added by the Hui UI role, but that is being deprecated with oozie anyway, so +1." [puppet] - 10https://gerrit.wikimedia.org/r/903594 (https://phabricator.wikimedia.org/T333295) (owner: 10Stevemunene)
[09:15:09] <wikibugs>	 (03PS1) 10Jaime Nuche: deployment_server: ensure Docker is installed [puppet] - 10https://gerrit.wikimedia.org/r/903605 (https://phabricator.wikimedia.org/T329622)
[09:20:16] <wikibugs>	 (03CR) 10Volans: [C: 04-1] "Missing some bits" [cookbooks] - 10https://gerrit.wikimedia.org/r/903596 (owner: 10Slyngshede)
[09:20:39] <wikibugs>	 (03PS1) 10Vgutierrez: admin: Remove shared SSH key with WMCS for trokhymovych [puppet] - 10https://gerrit.wikimedia.org/r/903606
[09:20:48] <wikibugs>	 (03CR) 10David Caro: [C: 04-1] "Half-refactor and "having to get on a plane let's push to save progress" kinda patch" [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro)
[09:21:30] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/903606 (owner: 10Vgutierrez)
[09:21:41] <wikibugs>	 (03CR) 10Jaime Nuche: "PCC: https://puppet-compiler.wmflabs.org/output/903605/40362/" [puppet] - 10https://gerrit.wikimedia.org/r/903605 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche)
[09:21:57] <wikibugs>	 (03CR) 10Vgutierrez: [C: 03+2] admin: Remove shared SSH key with WMCS for trokhymovych [puppet] - 10https://gerrit.wikimedia.org/r/903606 (owner: 10Vgutierrez)
[09:22:43] <wikibugs>	 (03CR) 10Hashar: "Marking my comment about using ECS as solved after https://gerrit.wikimedia.org/r/c/operations/puppet/+/903239" [puppet] - 10https://gerrit.wikimedia.org/r/900375 (https://phabricator.wikimedia.org/T325245) (owner: 10EoghanGaffney)
[09:26:01] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Trokhymovych - https://phabricator.wikimedia.org/T315262 (10jbond) 05Resolved→03Open @Trokhymovych We have noticed that you have stared to use your production key in WMCS.  as a precaution [[ https://gerrit.wikimedia.org/r/c/operations...
[09:26:37] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Homer unable to commit config to cloudsw1-b1-codfw (QFX5120 21.4R3.16) - https://phabricator.wikimedia.org/T333316 (10cmooney) p:05Triage→03Medium
[09:26:51] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Homer unable to commit config to cloudsw1-b1-codfw (QFX5120 21.4R3.16) - https://phabricator.wikimedia.org/T333316 (10cmooney)
[09:26:56] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney)
[09:27:14] <wikibugs>	 (03PS1) 10Filippo Giunchedi: alertmanager: group alerts by team too [puppet] - 10https://gerrit.wikimedia.org/r/903607 (https://phabricator.wikimedia.org/T332709)
[09:28:10] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on kafka-main1001.eqiad.wmnet with reason: stop kafka and dist-upgrade
[09:28:31] <wikibugs>	 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (10BTullis) Ah, thanks @Dzahn - I think the reason for these leftover processes is the kerberos automatic tickets renewal mechanism that I put in place in {T268985} It enables 'linger...
[09:28:34] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) idm-sync-permissions.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:28:34] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kafka-main1001.eqiad.wmnet with reason: stop kafka and dist-upgrade
[09:28:34] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/903328 (https://phabricator.wikimedia.org/T332868) (owner: 10Dzahn)
[09:30:24] <wikibugs>	 (03CR) 10Jaime Nuche: deployment_server: ensure Docker is installed (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/900353 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche)
[09:31:31] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] alertmanager: group alerts by team too [puppet] - 10https://gerrit.wikimedia.org/r/903607 (https://phabricator.wikimedia.org/T332709) (owner: 10Filippo Giunchedi)
[09:33:23] <wikibugs>	 (03CR) 10Jaime Nuche: "Thanks for bearing with me on this Daniel." [puppet] - 10https://gerrit.wikimedia.org/r/903605 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche)
[09:34:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] alertmanager: group alerts by team too [puppet] - 10https://gerrit.wikimedia.org/r/903607 (https://phabricator.wikimedia.org/T332709) (owner: 10Filippo Giunchedi)
[09:34:18] <wikibugs>	 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (10jbond) >>! In T333135#8732974, @BTullis wrote: > Ah, thanks @Dzahn - I think the reason for these leftover processes is the kerberos automatic tickets renewal mechanism that I put...
[09:34:32] <icinga-wm>	 PROBLEM - Check systemd state on cp2035 is CRITICAL: CRITICAL - degraded: The following units failed: ipmiseld.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:35:02] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Homer unable to commit config to cloudsw1-b1-codfw (QFX5120 21.4R3.16) - https://phabricator.wikimedia.org/T333316 (10cmooney) Logs from switch at during operation: `lines=20 Mar 28 09:28:50  cloudsw1-b1-codfw sshd[11342]: WARNING: could not open /etc/ssh/moduli...
[09:35:10] <wikibugs>	 (03PS1) 10Btullis: Disable the gobblin timers temporarily for switch maintenance [puppet] - 10https://gerrit.wikimedia.org/r/903610 (https://phabricator.wikimedia.org/T330165)
[09:35:32] <wikibugs>	 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (10jbond) 05Resolved→03In progress
[09:35:40] <vgutierrez>	 !log depool cp2035 - T333312
[09:35:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:35:46] <stashbot>	 T333312: cp2035 IPMI and management console issues - https://phabricator.wikimedia.org/T333312
[09:36:05] <godog>	 !log silence systemdunitfailed alerts for team=wmcs - T333315
[09:36:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:36:10] <stashbot>	 T333315: WMCS: hundred of phabricator tickets were created for some alerts - https://phabricator.wikimedia.org/T333315
[09:36:16] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: Homer unable to commit config to cloudsw1-b1-codfw (QFX5120 21.4R3.16) - https://phabricator.wikimedia.org/T333316 (10cmooney)
[09:36:57] <wikibugs>	 (03CR) 10Btullis: "This is to be deployed at around 12:50 UTC, in order to pause ingestion to HDFS." [puppet] - 10https://gerrit.wikimedia.org/r/903610 (https://phabricator.wikimedia.org/T330165) (owner: 10Btullis)
[09:38:31] <elukey>	 !log dist-upgrade kafka-main1001 to bullseye - T332013
[09:38:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:38:36] <stashbot>	 T332013: Migrate kafka-main to bullseye - https://phabricator.wikimedia.org/T332013
[09:41:04] <vgutierrez>	 !log resetting cp2035 management card - T333312
[09:41:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:41:10] <stashbot>	 T333312: cp2035 IPMI and management console issues - https://phabricator.wikimedia.org/T333312
[09:42:11] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to Analytic Cluster for Trokhymovych - https://phabricator.wikimedia.org/T315262 (10Trokhymovych) @jbond New Public SSH key: ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIFCyl+eu4X9cI/XT6nCSvud+X6LJyVV7Rcr1g4MnP2xf trokhymovych.mykola@gmail.com
[09:43:51] <wikibugs>	 (03CR) 10Stevemunene: [V: 03+1 C: 03+2] Deprecate oozie services on bullseye [puppet] - 10https://gerrit.wikimedia.org/r/903594 (https://phabricator.wikimedia.org/T333295) (owner: 10Stevemunene)
[09:45:16] <wikibugs>	 10SRE, 10ops-codfw, 10Traffic: cp2035 IPMI and management console issues - https://phabricator.wikimedia.org/T333312 (10Vgutierrez) Unable to reset the management card: ` root@cp2035:~# bmc-device --cold-reset; echo $? ipmi_cmd_cold_reset: driver timeout 1 `
[09:45:26] <logmsgbot>	 !log vgutierrez@cumin1001 START - Cookbook sre.hosts.downtime for 2 days, 0:00:00 on cp2035.codfw.wmnet with reason: HW issues
[09:45:41] <logmsgbot>	 !log vgutierrez@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2 days, 0:00:00 on cp2035.codfw.wmnet with reason: HW issues
[09:45:48] <icinga-wm>	 RECOVERY - Check systemd state on cp2035 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[09:46:04] <wikibugs>	 (03PS5) 10Jbond: P:url_downloader send Squid access logs to Logstash [puppet] - 10https://gerrit.wikimedia.org/r/903265 (owner: 10Slyngshede)
[09:46:12] <wikibugs>	 10SRE, 10ops-codfw, 10Traffic: cp2035 IPMI and management console issues - https://phabricator.wikimedia.org/T333312 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=07b8190f-1479-43ea-ba98-63f852f30e9e) set by vgutierrez@cumin1001 for 2 days, 0:00:00 on 1 host(s) and their services with r...
[09:46:24] <wikibugs>	 (03CR) 10Jbond: [C: 04-1] "lgtm just a minor change needed" [puppet] - 10https://gerrit.wikimedia.org/r/903265 (owner: 10Slyngshede)
[09:46:36] <wikibugs>	 (03PS6) 10JMeybohm: k8s: Remove 1.16 related code [puppet] - 10https://gerrit.wikimedia.org/r/899652 (https://phabricator.wikimedia.org/T328291)
[09:49:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[09:49:41] <wikibugs>	 (03PS1) 10Jbond: admin: add ssh key for Trokhymovych [puppet] - 10https://gerrit.wikimedia.org/r/903612 (https://phabricator.wikimedia.org/T315262)
[09:49:48] <elukey>	 the under replicated partitions is due to kafka-main1001 being upgraded
[09:51:54] <wikibugs>	 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10BTullis)
[09:54:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[09:54:53] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] admin: add ssh key for Trokhymovych [puppet] - 10https://gerrit.wikimedia.org/r/903612 (https://phabricator.wikimedia.org/T315262) (owner: 10Jbond)
[09:55:09] <jinxer-wm>	 (SystemdUnitFailed) firing: (5) idm-sync-permissions.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:55:18] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:55:41] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] k8s: Remove 1.16 related code (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/899652 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[09:56:15] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.ganeti.reimage for host an-test-client1002.eqiad.wmnet with OS bullseye
[09:56:25] <jinxer-wm>	 (SystemdUnitFailed) firing: (7) train-presync.service Failed on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:56:30] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) planet_sync_tile_generation-gis.service Failed on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:56:51] <jinxer-wm>	 (SystemdUnitFailed) firing: (2) discard_held_messages.service Failed on lists1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[09:57:00] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to Analytic Cluster for Trokhymovych - https://phabricator.wikimedia.org/T315262 (10jbond) 05Open→03Resolved >>! In T315262#8733314, @Trokhymovych wrote: > @jbond > New Public SSH key: > ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIFCyl+eu4X9cI/...
[09:57:11] <jinxer-wm>	 (CirrusSearchNodeIndexingNotIncreasing) firing: (10) Elasticsearch instance elastic1055-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[09:59:15] <wikibugs>	 (03PS3) 10EoghanGaffney: Adds php and apache logs for doc machines [puppet] - 10https://gerrit.wikimedia.org/r/900375 (https://phabricator.wikimedia.org/T325245)
[09:59:31] <jinxer-wm>	 (SystemdUnitFailed) firing: kubelet.service Failed on ml-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T1000)
[10:01:09] <wikibugs>	 (03PS1) 10Filippo Giunchedi: sre: temp downgrade systemdunitfailed to warning, exclude wmcs [alerts] - 10https://gerrit.wikimedia.org/r/903613 (https://phabricator.wikimedia.org/T332764)
[10:01:21] <wikibugs>	 (03PS1) 10JMeybohm: Revert "k8s: Remove 1.16 related code" [puppet] - 10https://gerrit.wikimedia.org/r/903548
[10:02:58] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre: temp downgrade systemdunitfailed to warning, exclude wmcs [alerts] - 10https://gerrit.wikimedia.org/r/903613 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi)
[10:03:19] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Revert "k8s: Remove 1.16 related code" [puppet] - 10https://gerrit.wikimedia.org/r/903548 (owner: 10JMeybohm)
[10:04:36] <wikibugs>	 (03PS2) 10JMeybohm: Revert "k8s: Remove 1.16 related code" [puppet] - 10https://gerrit.wikimedia.org/r/903548
[10:04:47] <wikibugs>	 (03PS3) 10JMeybohm: Revert "k8s: Remove 1.16 related code" [puppet] - 10https://gerrit.wikimedia.org/r/903548
[10:07:01] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: thumbor-codfw: Fix indentation of nutcracker servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/903614
[10:07:33] <wikibugs>	 (03CR) 10JMeybohm: [C: 03+2] Revert "k8s: Remove 1.16 related code" [puppet] - 10https://gerrit.wikimedia.org/r/903548 (owner: 10JMeybohm)
[10:10:51] <wikibugs>	 (03PS2) 10Filippo Giunchedi: sre: temp downgrade systemdunitfailed to warning, exclude wmcs [alerts] - 10https://gerrit.wikimedia.org/r/903613 (https://phabricator.wikimedia.org/T332764)
[10:11:19] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] thumbor-codfw: Fix indentation of nutcracker servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/903614 (owner: 10Alexandros Kosiaris)
[10:11:45] <wikibugs>	 (03PS1) 10Jbond: alertmanager: change repeat interval to 1 week for warnings [puppet] - 10https://gerrit.wikimedia.org/r/903615
[10:11:56] <wikibugs>	 (03PS2) 10Volans: remote: add results to RemoteExecutionError [software/spicerack] - 10https://gerrit.wikimedia.org/r/902460
[10:12:08] <wikibugs>	 (03CR) 10Volans: [C: 03+2] remote: add results to RemoteExecutionError [software/spicerack] - 10https://gerrit.wikimedia.org/r/902460 (owner: 10Volans)
[10:12:48] <logmsgbot>	 !log stevemunene@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on an-test-client1002.eqiad.wmnet with reason: host reimage
[10:14:11] <wikibugs>	 (03PS1) 10Elukey: admin_ng: lower the typha pods to 1 in ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/903616 (https://phabricator.wikimedia.org/T333302)
[10:14:31] <jinxer-wm>	 (SystemdUnitFailed) resolved: kubelet.service Failed on ml-staging2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:15:46] <wikibugs>	 (03Merged) 10jenkins-bot: remote: add results to RemoteExecutionError [software/spicerack] - 10https://gerrit.wikimedia.org/r/902460 (owner: 10Volans)
[10:15:59] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 04-1] admin_ng: increase namespace cpu quota for thumbor, increase replicas (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/899654 (https://phabricator.wikimedia.org/T328033) (owner: 10Hnowlan)
[10:16:01] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Add a ConftoolClient class [software/conftool] - 10https://gerrit.wikimedia.org/r/903598
[10:16:03] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Support urllib 2.x [software/conftool] - 10https://gerrit.wikimedia.org/r/903599
[10:16:06] <wikibugs>	 (03PS2) 10Giuseppe Lavagetto: Release 2.3.0 [software/conftool] - 10https://gerrit.wikimedia.org/r/903600
[10:16:10] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Add black formatting and enforcement [software/conftool] - 10https://gerrit.wikimedia.org/r/903617
[10:16:21] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on an-test-client1002.eqiad.wmnet with reason: host reimage
[10:16:58] <wikibugs>	 (03PS2) 10Ladsgroup: admin: add Barakat Ajadi as ldap_only_admin (wmf group) [puppet] - 10https://gerrit.wikimedia.org/r/903328 (https://phabricator.wikimedia.org/T332868) (owner: 10Dzahn)
[10:17:03] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] admin: add Barakat Ajadi as ldap_only_admin (wmf group) [puppet] - 10https://gerrit.wikimedia.org/r/903328 (https://phabricator.wikimedia.org/T332868) (owner: 10Dzahn)
[10:19:48] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] admin_ng: lower the typha pods to 1 in ml-staging-codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/903616 (https://phabricator.wikimedia.org/T333302) (owner: 10Elukey)
[10:21:49] <wikibugs>	 (03PS1) 10Jbond: openldap: drop sre-admins from the list of ops members [puppet] - 10https://gerrit.wikimedia.org/r/903619
[10:22:31] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] openldap: drop sre-admins from the list of ops members [puppet] - 10https://gerrit.wikimedia.org/r/903619 (owner: 10Jbond)
[10:23:15] <wikibugs>	 10ops-codfw: ManagementSSHDown - https://phabricator.wikimedia.org/T333328 (10phaultfinder)
[10:23:29] <jinxer-wm>	 (SystemdUnitFailed) firing: (8) train-presync.service Failed on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:24:24] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[10:24:28] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[10:24:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (8) train-presync.service Failed on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:26:08] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+1] alertmanager: change repeat interval to 1 week for warnings [puppet] - 10https://gerrit.wikimedia.org/r/903615 (owner: 10Jbond)
[10:27:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:28:17] <jinxer-wm>	 (NELHigh) firing: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[10:28:40] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10Patch-For-Review: Grant Grafana access to babiola - https://phabricator.wikimedia.org/T332868 (10Ladsgroup) 05In progress→03Resolved I added Barakat to wmf ldap group. They should be able to access grafana and such.
[10:28:53] <jynus>	 nel page?
[10:29:02] <akosiaris>	 !ack
[10:29:02] <sirenbot>	 no value provided for parameter incident and no default available
[10:29:02] <sirenbot>	 Incident id must be an integer
[10:29:09] <akosiaris>	 !incidents
[10:29:09] <sirenbot>	 3512 (UNACKED)  NELHigh sre (tcp.timed_out)
[10:29:13] <akosiaris>	 !ack 3512
[10:29:14] <sirenbot>	 3512 (ACKED)  NELHigh sre (tcp.timed_out)
[10:29:27] <jynus>	 tcp timeouts
[10:29:31] <wikibugs>	 (03PS6) 10Hnowlan: admin_ng: increase namespace cpu quota for thumbor, increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/899654 (https://phabricator.wikimedia.org/T328033)
[10:30:19] <wikibugs>	 (03CR) 10Hnowlan: admin_ng: increase namespace cpu quota for thumbor, increase replicas (033 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/899654 (https://phabricator.wikimedia.org/T328033) (owner: 10Hnowlan)
[10:30:39] <volans>	 brief spike of timeout from india
[10:31:45] <jayme>	 but nothing persistent it seems
[10:31:52] <volans>	 almost all frorm a single IP? weird
[10:32:07] <wikibugs>	 (03PS1) 10Btullis: Failover hive services to the standby coordinator [dns] - 10https://gerrit.wikimedia.org/r/903621 (https://phabricator.wikimedia.org/T330165)
[10:32:11] <volans>	 ah no sorry, misreading
[10:32:14] <volans>	 target IP :D
[10:32:16] <volans>	 makes sense
[10:32:17] <jbond>	 volans: could be GNAT restarting 
[10:32:26] <volans>	 upload-eqsin
[10:32:27] <akosiaris>	 CGNAT ? 
[10:32:27] <jbond>	 oh ok yes that makes more senses :)
[10:32:35] <akosiaris>	 upload-eqsin, just 1 ISP
[10:32:48] <volans>	 akosiaris: no that's Other's ISP
[10:32:51] <akosiaris>	 what jbond says is a pretty plausible explanation
[10:32:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[10:33:02] <akosiaris>	 volans: I see 1 specific one
[10:33:06] <akosiaris>	 not "Other"
[10:33:17] <akosiaris>	 ah, no scratch that
[10:33:17] <jinxer-wm>	 (NELHigh) resolved: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh
[10:33:19] <akosiaris>	 you are right
[10:34:14] <wikibugs>	 10SRE, 10Infrastructure Security, 10Infrastructure-Foundations: Offboard nfraison - https://phabricator.wikimedia.org/T333135 (10jbond)
[10:35:06] <jinxer-wm>	 (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 235.2k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig
[10:35:09] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Update SSH key for abi - https://phabricator.wikimedia.org/T333298 (10Ladsgroup) a:03Ladsgroup Clinic duty this week, it doesn't need the full process. Let me double check something and get back to you.
[10:36:31] <jinxer-wm>	 (SystemdUnitFailed) firing: kube-controller-manager.service Failed on ml-staging-ctrl2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:38:37] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] sre: temp downgrade systemdunitfailed to warning, exclude wmcs [alerts] - 10https://gerrit.wikimedia.org/r/903613 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi)
[10:39:14] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+2] Adds php and apache logs for doc machines [puppet] - 10https://gerrit.wikimedia.org/r/900375 (https://phabricator.wikimedia.org/T325245) (owner: 10EoghanGaffney)
[10:41:31] <jinxer-wm>	 (SystemdUnitFailed) resolved: kube-controller-manager.service Failed on ml-staging-ctrl2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:42:10] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Prod-Kubernetes, 10netops, and 2 others: Agree strategy for Kubernetes BGP peering to top-of-rack switches - https://phabricator.wikimedia.org/T306649 (10cmooney) @ayounsi thanks for the response.  Overall I've no objection so let's proceed.  I agree in terms of addin...
[10:43:00] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 45, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:43:30] <icinga-wm>	 PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[10:45:40] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Failover hive services to the standby coordinator [dns] - 10https://gerrit.wikimedia.org/r/903621 (https://phabricator.wikimedia.org/T330165) (owner: 10Btullis)
[10:46:12] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10observability: Investigate Junos Prometheus exporter - https://phabricator.wikimedia.org/T333210 (10cmooney) Thanks for the task, does indeed look like a useful tool that could simplify adding additional monitoring without having to modify the LibreNMS codeb...
[10:46:29] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10observability: Investigate Junos Prometheus exporter - https://phabricator.wikimedia.org/T333210 (10cmooney) a:03cmooney
[10:47:26] <jinxer-wm>	 (WidespreadPuppetFailure) firing: Puppet has failed on ganeti cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=ganeti - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[10:48:29] <jinxer-wm>	 (SystemdUnitFailed) firing: (8) train-presync.service Failed on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:49:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (8) train-presync.service Failed on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[10:51:21] <wikibugs>	 (03PS1) 10Ladsgroup: api: Mark query as read-only to avoid regex on SQL [core] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/903549 (https://phabricator.wikimedia.org/T332942)
[10:52:10] <Amir1>	 jouncebot: nowandnext
[10:52:10] <jouncebot>	 For the next 0 hour(s) and 7 minute(s): MediaWiki infrastucture (UTC mid-day) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T1000)
[10:52:11] <jouncebot>	 In 2 hour(s) and 7 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T1300)
[10:52:11] <jouncebot>	 In 2 hour(s) and 7 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T1300)
[10:52:18] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] sre: temp downgrade systemdunitfailed to warning, exclude wmcs [alerts] - 10https://gerrit.wikimedia.org/r/903613 (https://phabricator.wikimedia.org/T332764) (owner: 10Filippo Giunchedi)
[10:53:41] <wikibugs>	 (03PS4) 10Ladsgroup: Update SSH key for abi [puppet] - 10https://gerrit.wikimedia.org/r/903546 (https://phabricator.wikimedia.org/T333298) (owner: 10Abijeet Patro)
[10:53:46] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Update SSH key for abi [puppet] - 10https://gerrit.wikimedia.org/r/903546 (https://phabricator.wikimedia.org/T333298) (owner: 10Abijeet Patro)
[10:53:48] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+2] thumbor-codfw: Fix indentation of nutcracker servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/903614 (owner: 10Alexandros Kosiaris)
[10:55:49] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Update SSH key for abi - https://phabricator.wikimedia.org/T333298 (10Ladsgroup) 05Open→03Resolved you'll have access with the new keys in thirty minutes
[10:57:40] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloud_private_subnet: add route to public IPv4 range [puppet] - 10https://gerrit.wikimedia.org/r/903622 (https://phabricator.wikimedia.org/T324992)
[10:57:42] <wikibugs>	 (03PS1) 10Arturo Borrero Gonzalez: cloud_private_subnet: codfw: relocate some hiera [puppet] - 10https://gerrit.wikimedia.org/r/903623 (https://phabricator.wikimedia.org/T324992)
[10:58:31] <wikibugs>	 (03Merged) 10jenkins-bot: thumbor-codfw: Fix indentation of nutcracker servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/903614 (owner: 10Alexandros Kosiaris)
[10:59:22] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cloud_private_subnet: add route to public IPv4 range [puppet] - 10https://gerrit.wikimedia.org/r/903622 (https://phabricator.wikimedia.org/T324992)
[10:59:24] <wikibugs>	 (03PS2) 10Arturo Borrero Gonzalez: cloud_private_subnet: codfw: relocate some hiera [puppet] - 10https://gerrit.wikimedia.org/r/903623 (https://phabricator.wikimedia.org/T324992)
[11:00:22] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: apply
[11:00:27] <wikibugs>	 (03Abandoned) 10Clément Goubert: cpufrequtils: Force reload init script on change [puppet] - 10https://gerrit.wikimedia.org/r/900645 (owner: 10Clément Goubert)
[11:01:39] <wikibugs>	 (03PS3) 10Arturo Borrero Gonzalez: cloud_private_subnet: codfw: relocate some hiera [puppet] - 10https://gerrit.wikimedia.org/r/903623 (https://phabricator.wikimedia.org/T324992)
[11:03:47] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 03+1] admin_ng: increase namespace cpu quota for thumbor, increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/899654 (https://phabricator.wikimedia.org/T328033) (owner: 10Hnowlan)
[11:03:49] <wikibugs>	 (03PS2) 10Ladsgroup: Parsoid-Tests: Give Yiannis and Mateus root rights on parsoid-test hosts [puppet] - 10https://gerrit.wikimedia.org/r/903314 (https://phabricator.wikimedia.org/T333206) (owner: 10Subramanya Sastry)
[11:04:05] <wikibugs>	 (03PS6) 10Slyngshede: P:url_downloader send Squid access logs to Logstash [puppet] - 10https://gerrit.wikimedia.org/r/903265
[11:04:20] <wikibugs>	 (03CR) 10Slyngshede: P:url_downloader send Squid access logs to Logstash (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/903265 (owner: 10Slyngshede)
[11:04:36] <wikibugs>	 (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] Parsoid-Tests: Give Yiannis and Mateus root rights on parsoid-test hosts [puppet] - 10https://gerrit.wikimedia.org/r/903314 (https://phabricator.wikimedia.org/T333206) (owner: 10Subramanya Sastry)
[11:05:47] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Move mbsantos and jgiannelos from parsoid-test-admins to parsoid-test-roots - https://phabricator.wikimedia.org/T333206 (10Ladsgroup) 05Open→03Resolved a:03Ladsgroup Merged and deployed the patch, it should be doable in half an hour.
[11:08:16] <Amir1>	 jouncebot: nowandnext
[11:08:16] <jouncebot>	 No deployments scheduled for the next 1 hour(s) and 51 minute(s)
[11:08:16] <jouncebot>	 In 1 hour(s) and 51 minute(s): UTC afternoon backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T1300)
[11:08:17] <jouncebot>	 In 1 hour(s) and 51 minute(s): Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T1300)
[11:08:24] <wikibugs>	 (03CR) 10Ladsgroup: [C: 03+2] api: Mark query as read-only to avoid regex on SQL [core] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/903549 (https://phabricator.wikimedia.org/T332942) (owner: 10Ladsgroup)
[11:08:45] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: apply
[11:09:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST events) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:12:18] <wikibugs>	 (03CR) 10Slyngshede: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40363/console" [puppet] - 10https://gerrit.wikimedia.org/r/903265 (owner: 10Slyngshede)
[11:13:16] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to production for kamila - https://phabricator.wikimedia.org/T332921 (10Ladsgroup) Does this need anything from SRE now? I assume Hugh already took care of the most.
[11:14:32] <wikibugs>	 10SRE, 10SRE-Access-Requests: Requesting access to production for kamila - https://phabricator.wikimedia.org/T332921 (10hnowlan) 05Open→03Resolved
[11:14:50] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+2] admin_ng: increase namespace cpu quota for thumbor, increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/899654 (https://phabricator.wikimedia.org/T328033) (owner: 10Hnowlan)
[11:14:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST events) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[11:19:42] <jinxer-wm>	 (SystemdUnitFailed) firing: (3) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:19:53] <wikibugs>	 (03Merged) 10jenkins-bot: admin_ng: increase namespace cpu quota for thumbor, increase replicas [deployment-charts] - 10https://gerrit.wikimedia.org/r/899654 (https://phabricator.wikimedia.org/T328033) (owner: 10Hnowlan)
[11:20:23] <wikibugs>	 (03PS1) 10Btullis: Disable job submission to YARN queues to faciliatate maintenance [puppet] - 10https://gerrit.wikimedia.org/r/903627 (https://phabricator.wikimedia.org/T330165)
[11:21:44] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] START helmfile.d/admin 'apply'.
[11:22:00] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [codfw] DONE helmfile.d/admin 'apply'.
[11:22:06] <wikibugs>	 (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40364/console" [puppet] - 10https://gerrit.wikimedia.org/r/903627 (https://phabricator.wikimedia.org/T330165) (owner: 10Btullis)
[11:22:21] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/admin 'apply'.
[11:22:26] <jinxer-wm>	 (WidespreadPuppetFailure) resolved: Puppet has failed on ganeti cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=ganeti - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[11:23:10] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/admin 'apply'.
[11:23:29] <jinxer-wm>	 (SystemdUnitFailed) resolved: (2) planet_sync_tile_generation-gis.service Failed on maps1009:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:23:29] <jinxer-wm>	 (SystemdUnitFailed) resolved: (2) discard_held_messages.service Failed on lists1003:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:23:29] <jinxer-wm>	 (SystemdUnitFailed) resolved: (3) wmf_auto_restart_ircecho.service Failed on alert2001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:23:39] <jinxer-wm>	 (SystemdUnitFailed) resolved: (5) idm-sync-permissions.service Failed on idm-test1001:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:23:49] <jinxer-wm>	 (SystemdUnitFailed) resolved: (7) train-presync.service Failed on deploy2002:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status?orgId=1&forceLogin&editPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed
[11:24:07] <wikibugs>	 (03PS1) 10Jbond: O:cluster/management: add ldap::bitu profile to cluster::management [puppet] - 10https://gerrit.wikimedia.org/r/903628
[11:24:42] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply
[11:24:58] <icinga-wm>	 PROBLEM - Check systemd state on idm2001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_apache2-htcacheclean.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[11:26:02] <wikibugs>	 (03Merged) 10jenkins-bot: api: Mark query as read-only to avoid regex on SQL [core] (wmf/1.41.0-wmf.1) - 10https://gerrit.wikimedia.org/r/903549 (https://phabricator.wikimedia.org/T332942) (owner: 10Ladsgroup)
[11:27:13] <wikibugs>	 (03PS2) 10Jbond: O:cluster::management: add ldap::bitu profile to cluster::management [puppet] - 10https://gerrit.wikimedia.org/r/903628
[11:28:10] <wikibugs>	 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10BTullis)
[11:28:22] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40365/console" [puppet] - 10https://gerrit.wikimedia.org/r/903628 (owner: 10Jbond)
[11:28:36] <wikibugs>	 (03PS5) 10Clément Goubert: service_catalog: Add mw-api-int k8s service [puppet] - 10https://gerrit.wikimedia.org/r/903217 (https://phabricator.wikimedia.org/T333120)
[11:28:49] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/903265 (owner: 10Slyngshede)
[11:29:21] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Grafana access to babiola - https://phabricator.wikimedia.org/T332868 (10Peter) 05Resolved→03Open Hmm maybe something needs all to be done on the Grafana side? When @BAbiola-WMF tries to login to Grafana she gets //407:Proxy Authentication Required// or UNEXPECTED_PROXY...
[11:30:32] <wikibugs>	 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10fnegri) I "depooled" dbproxy1019 by following the procedure at https://wikitech.wikimedia.org/w/index.php?title=Portal:Data_Services/Admin/Runbooks/Depool_...
[11:32:57] <logmsgbot>	 !log ladsgroup@deploy2002 Started scap: Backport for [[gerrit:903549|api: Mark query as read-only to avoid regex on SQL (T332942)]]
[11:33:03] <stashbot>	 T332942: Warning: SQLPlatform::isWriteQuery fallback to regex (from ApiQueryRevisions) - https://phabricator.wikimedia.org/T332942
[11:34:02] <wikibugs>	 (03PS1) 10Effie Mouzeli: cpufrequtils: ensure that cpufrequtils is reloded on governor change [puppet] - 10https://gerrit.wikimedia.org/r/903632
[11:34:24] <logmsgbot>	 !log ladsgroup@deploy2002 ladsgroup: Backport for [[gerrit:903549|api: Mark query as read-only to avoid regex on SQL (T332942)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet
[11:34:31] <wikibugs>	 (03PS3) 10Slyngshede: sre.ganeti.makevm: run sync-netbox-hiera after creation [cookbooks] - 10https://gerrit.wikimedia.org/r/903596
[11:34:51] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply
[11:36:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.ganeti.makevm: run sync-netbox-hiera after creation [cookbooks] - 10https://gerrit.wikimedia.org/r/903596 (owner: 10Slyngshede)
[11:37:17] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply
[11:38:44] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Error creating interfaces in netbox-next - https://phabricator.wikimedia.org/T333292 (10cmooney) @ayounsi I think this error I'm hitting is possibly similar:  ` pynetbox.core.query.RequestError: The request failed with code 500 Internal Server Error: {'error': 'Cable object...
[11:39:43] <wikibugs>	 (03CR) 10Hashar: [C: 03+1] zuul: fix up service enable and ensure (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/901576 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar)
[11:40:44] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Grafana access to babiola - https://phabricator.wikimedia.org/T332868 (10Ladsgroup) They aren't showing up in https://ldap.toolforge.org/group/wmf maybe I messed up something in ldap change. Let me double check
[11:40:51] <wikibugs>	 (03CR) 10Hashar: "The parent change https://gerrit.wikimedia.org/r/c/operations/puppet/+/901576/ fixed up the Puppet manifests to ensure all three services " [puppet] - 10https://gerrit.wikimedia.org/r/867673 (https://phabricator.wikimedia.org/T324659) (owner: 10Dzahn)
[11:44:13] <wikibugs>	 (03CR) 10Elukey: [C: 03+1] Disable job submission to YARN queues to faciliatate maintenance [puppet] - 10https://gerrit.wikimedia.org/r/903627 (https://phabricator.wikimedia.org/T330165) (owner: 10Btullis)
[11:45:38] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Grafana access to babiola - https://phabricator.wikimedia.org/T332868 (10Ladsgroup) They show up in ldap search: ` ladsgroup@mwmaint2002:~$ ldapsearch -x cn=wmf ... member: uid=babiola,ou=people,dc=wikimedia,dc=org `  My guess is that it needs to propagate but let me check...
[11:47:12] <icinga-wm>	 PROBLEM - Router interfaces on cr1-drmrs is CRITICAL: CRITICAL: host 185.15.58.128, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[11:47:26] <logmsgbot>	 !log hnowlan@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply
[11:47:42] <icinga-wm>	 PROBLEM - OSPF status on cr1-eqiad is CRITICAL: OSPFv2: 6/7 UP : OSPFv3: 6/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[11:48:06] <icinga-wm>	 PROBLEM - BFD status on cr1-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[11:48:18] <wikibugs>	 (03PS4) 10Slyngshede: sre.ganeti.makevm: run sync-netbox-hiera after creation [cookbooks] - 10https://gerrit.wikimedia.org/r/903596
[11:49:49] <wikibugs>	 (03PS2) 10EoghanGaffney: Add doc host apache/php-fpm logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/900410 (https://phabricator.wikimedia.org/T325245)
[11:51:40] <logmsgbot>	 !log ladsgroup@deploy2002 Finished scap: Backport for [[gerrit:903549|api: Mark query as read-only to avoid regex on SQL (T332942)]] (duration: 18m 42s)
[11:51:45] <stashbot>	 T332942: Warning: SQLPlatform::isWriteQuery fallback to regex (from ApiQueryRevisions) - https://phabricator.wikimedia.org/T332942
[11:52:19] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+2] Add doc host apache/php-fpm logs to kafka [puppet] - 10https://gerrit.wikimedia.org/r/900410 (https://phabricator.wikimedia.org/T325245) (owner: 10EoghanGaffney)
[11:52:41] <wikibugs>	 (03CR) 10Slyngshede: sre.ganeti.makevm: run sync-netbox-hiera after creation (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/903596 (owner: 10Slyngshede)
[11:56:22] <elukey>	 !log dist-upgrade kafka-main1002 to debian bullseye - T332013
[11:56:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:56:27] <stashbot>	 T332013: Migrate kafka-main to bullseye - https://phabricator.wikimedia.org/T332013
[11:57:01] <logmsgbot>	 !log elukey@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on kafka-main1002.eqiad.wmnet with reason: stop kafka and dist-upgrade
[11:57:14] <logmsgbot>	 !log elukey@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on kafka-main1002.eqiad.wmnet with reason: stop kafka and dist-upgrade
[11:58:43] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+1] Disable the gobblin timers temporarily for switch maintenance [puppet] - 10https://gerrit.wikimedia.org/r/903610 (https://phabricator.wikimedia.org/T330165) (owner: 10Btullis)
[11:59:09] <wikibugs>	 (03CR) 10Stevemunene: [C: 03+1] Disable job submission to YARN queues to faciliatate maintenance [puppet] - 10https://gerrit.wikimedia.org/r/903627 (https://phabricator.wikimedia.org/T330165) (owner: 10Btullis)
[12:04:51] <wikibugs>	 (03PS2) 10Effie Mouzeli: cpufrequtils: ensure that cpufrequtils is reloded on governor change [puppet] - 10https://gerrit.wikimedia.org/r/903632
[12:05:06] <jinxer-wm>	 (CirrusSearchJobQueueBacklogTooBig) resolved: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 202.3k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig
[12:08:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[12:08:58] <wikibugs>	 (03PS8) 10Ayounsi: Add validator classes for some objects [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/889958 (https://phabricator.wikimedia.org/T310590)
[12:09:31] <logmsgbot>	 !log eoghan@cumin1001 START - Cookbook sre.ganeti.reimage for host aphlict1002.eqiad.wmnet with OS bullseye
[12:10:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[12:10:34] <elukey>	 this is kafka-main1002 being upgraded --^
[12:13:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[12:14:06] <wikibugs>	 (03CR) 10Effie Mouzeli: "PCC has the expected changes" [puppet] - 10https://gerrit.wikimedia.org/r/903632 (owner: 10Effie Mouzeli)
[12:14:25] <wikibugs>	 10SRE, 10Infrastructure-Foundations: Error creating interfaces in netbox-next - https://phabricator.wikimedia.org/T333292 (10ayounsi) 05Open→03Resolved a:03ayounsi Thanks for the dogfooding :)  I removed the TAG check from the CR see diff: https://gerrit.wikimedia.org/r/c/operations/software/netbox-extra...
[12:14:43] <wikibugs>	 (03PS3) 10Effie Mouzeli: cpufrequtils: ensure that cpufrequtils is reloded on governor change [puppet] - 10https://gerrit.wikimedia.org/r/903632
[12:15:50] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.peering with action 'configure' for AS: 45295
[12:16:29] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.peering (exit_code=0) with action 'configure' for AS: 45295
[12:17:16] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "I've not tested it, if it's a noop in the generated results in the repo LGTM." [cookbooks] - 10https://gerrit.wikimedia.org/r/888051 (owner: 10Jbond)
[12:17:48] <wikibugs>	 (03CR) 10Elukey: [C: 03+2] admin: Grant kserve API group read access to deploy user [deployment-charts] - 10https://gerrit.wikimedia.org/r/903297 (https://phabricator.wikimedia.org/T333174) (owner: 10Alexandros Kosiaris)
[12:17:50] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+1] "Minor nit, see inline, looks good otherwise." [cookbooks] - 10https://gerrit.wikimedia.org/r/860080 (https://phabricator.wikimedia.org/T306661) (owner: 10Volans)
[12:20:17] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster main-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=eqiad%20prometheus/ops&var-kafka_cluster=main-eqiad - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[12:20:44] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-staging-codfw] START helmfile.d/admin 'sync'.
[12:20:46] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-staging-codfw] DONE helmfile.d/admin 'sync'.
[12:21:31] <logmsgbot>	 !log eoghan@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on aphlict1002.eqiad.wmnet with reason: host reimage
[12:22:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[12:24:06] <jinxer-wm>	 (CirrusSearchJobQueueBacklogTooBig) firing: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 250.3k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig
[12:24:52] <logmsgbot>	 !log eoghan@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on aphlict1002.eqiad.wmnet with reason: host reimage
[12:26:26] <jinxer-wm>	 (WidespreadPuppetFailure) firing: Puppet has failed on kafka_main cluster - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=kafka_main - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[12:27:18] <icinga-wm>	 PROBLEM - BFD status on cr2-eqsin is CRITICAL: Down: 1 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:27:37] <jinxer-wm>	 (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag
[12:29:13] <wikibugs>	 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ayounsi)
[12:29:35] <wikibugs>	 (03PS1) 10JMeybohm: Revert "Revert "k8s: Remove 1.16 related code"" [puppet] - 10https://gerrit.wikimedia.org/r/903560
[12:30:53] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [cookbooks] - 10https://gerrit.wikimedia.org/r/903596 (owner: 10Slyngshede)
[12:31:26] <jinxer-wm>	 (WidespreadPuppetFailure) resolved: Puppet has failed on kafka_main cluster - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=kafka_main - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[12:31:39] <wikibugs>	 (03CR) 10Slyngshede: [C: 03+2] sre.ganeti.makevm: run sync-netbox-hiera after creation [cookbooks] - 10https://gerrit.wikimedia.org/r/903596 (owner: 10Slyngshede)
[12:31:57] <elukey>	 XioNoX: o/ the Singtel transport link between uslfo and eqsin seems down (at least according to BFD), I don't find any scheduled maintenance though
[12:32:28] <wikibugs>	 (03CR) 10Hashar: doc: upgrade php from 7.3 to 7.4 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/901612 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar)
[12:34:20] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.debug for Netbox circuit ID 112
[12:34:47] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox circuit ID 112
[12:36:08] <XioNoX>	 elukey: seems up right now but flapping regularly, looking
[12:36:20] <icinga-wm>	 RECOVERY - BFD status on cr2-eqsin is OK: UP: 14 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:36:28] <logmsgbot>	 !log eoghan@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host aphlict1002.eqiad.wmnet with OS bullseye
[12:37:29] <wikibugs>	 (03PS3) 10David Caro: maintain-dbusers: run isort and black and use pep563 types [puppet] - 10https://gerrit.wikimedia.org/r/902815 (https://phabricator.wikimedia.org/T303663)
[12:37:35] <wikibugs>	 (03PS5) 10David Caro: maintain-dbusers: refactor [puppet] - 10https://gerrit.wikimedia.org/r/902816 (https://phabricator.wikimedia.org/T303663)
[12:37:39] <wikibugs>	 (03PS5) 10David Caro: maintain-dbusers: only-users match tool users with or without prefix [puppet] - 10https://gerrit.wikimedia.org/r/902817 (https://phabricator.wikimedia.org/T332789)
[12:37:43] <wikibugs>	 (03PS5) 10David Caro: maintain-dbusers: allow filtering by account type for maintain [puppet] - 10https://gerrit.wikimedia.org/r/902818 (https://phabricator.wikimedia.org/T332954)
[12:37:48] <wikibugs>	 (03PS7) 10David Caro: maintain-dbusers: add prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955)
[12:38:10] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.debug for Netbox circuit ID 108
[12:38:17] <XioNoX>	 however telxius seems down
[12:38:22] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox circuit ID 108
[12:38:26] <icinga-wm>	 PROBLEM - OSPF status on cr2-eqsin is CRITICAL: OSPFv2: 4/4 UP : OSPFv3: 3/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[12:38:52] <wikibugs>	 (03CR) 10David Caro: maintain-dbusers: add prometheus metrics (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro)
[12:39:50] <icinga-wm>	 RECOVERY - OSPF status on cr2-eqsin is OK: OSPFv2: 4/4 UP : OSPFv3: 4/4 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[12:40:28] <wikibugs>	 (03Merged) 10jenkins-bot: sre.ganeti.makevm: run sync-netbox-hiera after creation [cookbooks] - 10https://gerrit.wikimedia.org/r/903596 (owner: 10Slyngshede)
[12:41:00] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10observability: Investigate Junos Prometheus exporter - https://phabricator.wikimedia.org/T333210 (10fgiunchedi) I took a quick look at the exporter and looks good to me too! Also +1 on the general testing/deployment plan  re: SSH from a quick read through th...
[12:42:55] <wikibugs>	 (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected: https://puppet-compiler.wmflabs.org/output/903623/40369/" [puppet] - 10https://gerrit.wikimedia.org/r/903622 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez)
[12:43:20] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] maintain-dbusers: add prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro)
[12:43:28] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.debug for Netbox circuit ID 108
[12:43:42] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox circuit ID 108
[12:44:06] <jinxer-wm>	 (CirrusSearchJobQueueBacklogTooBig) resolved: CirrusSearch job topic codfw.mediawiki.job.cirrusSearchLinksUpdate is heavily backlogged with 205.4k messages - TODO - https://grafana.wikimedia.org/d/CbmStnlGk/jobqueue-job?orgId=1&var-dc=codfw%20prometheus/k8s&var-job=cirrusSearchLinksUpdate - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJobQueueBacklogTooBig
[12:44:39] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40370/console" [puppet] - 10https://gerrit.wikimedia.org/r/900704 (https://phabricator.wikimedia.org/T331318) (owner: 10Giuseppe Lavagetto)
[12:44:43] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.network.debug for Netbox circuit ID 108
[12:44:56] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.network.debug (exit_code=0) for Netbox circuit ID 108
[12:46:06] <XioNoX>	 opened https://phabricator.wikimedia.org/T333342 about telxius
[12:47:01] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM if the output works fine for NetOps" [cookbooks] - 10https://gerrit.wikimedia.org/r/888759 (https://phabricator.wikimedia.org/T329272) (owner: 10Jbond)
[12:48:15] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Grafana access to babiola - https://phabricator.wikimedia.org/T332868 (10Ladsgroup) They are now in the list in https://ldap.toolforge.org/group/wmf >User:Barakat Ajadi (more info)  Is it fixed now?
[12:50:43] <wikibugs>	 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ssingh)
[12:50:55] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Disable the gobblin timers temporarily for switch maintenance [puppet] - 10https://gerrit.wikimedia.org/r/903610 (https://phabricator.wikimedia.org/T330165) (owner: 10Btullis)
[12:52:52] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on durum1002 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[12:52:52] <wikibugs>	 (03PS1) 10EoghanGaffney: Add aphlict role to new vm host [puppet] - 10https://gerrit.wikimedia.org/r/903641 (https://phabricator.wikimedia.org/T322369)
[12:53:12] <icinga-wm>	 PROBLEM - BFD status on cr2-eqiad is CRITICAL: Down: 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[12:53:17] <wikibugs>	 (03PS1) 10Ayounsi: Depool eqiad frontends for network maintenance [dns] - 10https://gerrit.wikimedia.org/r/903642 (https://phabricator.wikimedia.org/T330165)
[12:53:52] <wikibugs>	 (03PS1) 10Hashar: wm-checks-api: parse PCC full message [software/gerrit] (deploy/wmf/stable-3.5) - 10https://gerrit.wikimedia.org/r/903643
[12:54:33] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40371/console" [puppet] - 10https://gerrit.wikimedia.org/r/903641 (https://phabricator.wikimedia.org/T322369) (owner: 10EoghanGaffney)
[12:56:27] <wikibugs>	 (03CR) 10Hashar: "Unrelated to this change, Gerrit shows below the commit message "Error while fetching results for wm-checks-api: TypeError: compiled is nu" [puppet] - 10https://gerrit.wikimedia.org/r/901612 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar)
[12:56:29] <wikibugs>	 (03CR) 10Ssingh: [C: 03+1] Depool eqiad frontends for network maintenance [dns] - 10https://gerrit.wikimedia.org/r/903642 (https://phabricator.wikimedia.org/T330165) (owner: 10Ayounsi)
[12:56:39] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-codfw] START helmfile.d/admin 'sync'.
[12:56:41] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-codfw] DONE helmfile.d/admin 'sync'.
[12:56:54] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] START helmfile.d/admin 'sync'.
[12:56:58] <logmsgbot>	 !log elukey@deploy2002 helmfile [ml-serve-eqiad] DONE helmfile.d/admin 'sync'.
[12:57:49] <wikibugs>	 10SRE, 10SRE-Access-Requests, 10Lift-Wing, 10Machine-Learning-Team: Machine Learning team -  k8s resources access - https://phabricator.wikimedia.org/T333174 (10elukey) 05Open→03Resolved a:03elukey Took the liberty to merge Alexandro's proposal, since the isvc resources don't really contain anything...
[12:58:05] <logmsgbot>	 !log akosiaris@cumin1001 START - Cookbook sre.discovery.datacenter depool all active/active services in eqiad: eqiad row B switches upgrade - T330165
[12:58:10] <stashbot>	 T330165: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165
[12:58:27] <wikibugs>	 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in eqiad: eqiad row B switches upgr...
[12:58:39] <wikibugs>	 (03PS8) 10David Caro: maintain-dbusers: add prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955)
[12:58:45] <wikibugs>	 (03CR) 10Ayounsi: [C: 03+2] Depool eqiad frontends for network maintenance [dns] - 10https://gerrit.wikimedia.org/r/903642 (https://phabricator.wikimedia.org/T330165) (owner: 10Ayounsi)
[12:59:00] <wikibugs>	 (03CR) 10Volans: "Some minor nits inline, LGTM in general, but I didn't check at all the logic of the export that I leave to netops." [cookbooks] - 10https://gerrit.wikimedia.org/r/889195 (https://phabricator.wikimedia.org/T329669) (owner: 10Jbond)
[12:59:47] <wikibugs>	 (03CR) 10Vgutierrez: Serve an HTTP response for measurement domains directly from Varnish (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar)
[12:59:49] <XioNoX>	 !log depool eqiad for network maintenance - T330165
[12:59:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:00:04] <jouncebot>	 RoanKattouw, Lucas_WMDE, Urbanecm, awight, TheresNoTime, and taavi: gettimeofday() says it's time for UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T1300)
[13:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[13:00:04] <jouncebot>	 Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T1300)
[13:00:23] <wikibugs>	 (03CR) 10Vgutierrez: [V: 03+1 C: 03+1] trafficserver: make routing to mw on k8s more manageable [puppet] - 10https://gerrit.wikimedia.org/r/900704 (https://phabricator.wikimedia.org/T331318) (owner: 10Giuseppe Lavagetto)
[13:00:50] <Lucas_WMDE>	 looks like nothing to deploy indeed
[13:02:36] <wikibugs>	 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10BTullis)
[13:07:14] <wikibugs>	 (03CR) 10Raymond Ndibe: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/902818 (https://phabricator.wikimedia.org/T332954) (owner: 10David Caro)
[13:07:58] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+1] cpufrequtils: ensure that cpufrequtils is reloded on governor change [puppet] - 10https://gerrit.wikimedia.org/r/903632 (owner: 10Effie Mouzeli)
[13:10:33] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40373/console" [puppet] - 10https://gerrit.wikimedia.org/r/903641 (https://phabricator.wikimedia.org/T322369) (owner: 10EoghanGaffney)
[13:16:23] <wikibugs>	 (03PS2) 10JMeybohm: k8s: Remove 1.16 related code (v2) [puppet] - 10https://gerrit.wikimedia.org/r/903560 (https://phabricator.wikimedia.org/T328291)
[13:16:51] <koi>	 Hi, anyone able to run a maint script for T332241?
[13:16:52] <stashbot>	 T332241: fix Category namespace on gurwiki - https://phabricator.wikimedia.org/T332241
[13:16:57] <wikibugs>	 (03PS3) 10Volans: run_cookbook: fix/improve calls to other cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/902449
[13:17:06] <wikibugs>	 (03PS4) 10Volans: run_cookbook: fix/improve calls to other cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/902449
[13:17:37] <logmsgbot>	 !log akosiaris@cumin1001 END (PASS) - Cookbook sre.discovery.datacenter (exit_code=0) depool all active/active services in eqiad: eqiad row B switches upgrade - T330165
[13:17:46] <stashbot>	 T330165: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165
[13:17:58] <wikibugs>	 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter depool all active/active services in eqiad: eqiad row B switches upgr...
[13:18:44] <wikibugs>	 (03PS29) 10Jameel Kaisar: Serve an HTTP response for measurement domains directly from Varnish [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028)
[13:20:01] <wikibugs>	 (03CR) 10Jameel Kaisar: Serve an HTTP response for measurement domains directly from Varnish (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar)
[13:20:24] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] hiera: temporarily removed dns1003 from authdns_servers [puppet] - 10https://gerrit.wikimedia.org/r/903246 (https://phabricator.wikimedia.org/T330165) (owner: 10Ssingh)
[13:21:14] <wikibugs>	 (03PS1) 10Phuedx: MetricsPlatform: Fix ContextAttributesFactoryTest failing on prod branch [extensions/EventLogging] (wmf/1.41.0-wmf.2) - 10https://gerrit.wikimedia.org/r/903562 (https://phabricator.wikimedia.org/T333291)
[13:21:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:21:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:21:22] <wikibugs>	 (03CR) 10Vgutierrez: "looking good, upload tests are happy," [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar)
[13:21:42] <claime>	 koi: We're in the middle of a network maintenance in eqiad, can it wait until it's done?
[13:22:02] <koi>	 definitely :)
[13:22:13] <jayme>	 !incidents
[13:22:14] <sirenbot>	 3513 (ACKED)  ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw)
[13:22:14] <sirenbot>	 3512 (RESOLVED)  NELHigh sre (tcp.timed_out)
[13:22:29] <jayme>	 !ack 3513
[13:22:30] <sirenbot>	 3513 (ACKED)  ProbeDown sre (10.2.1.24 ip4 thumbor:8800 probes/service http_thumbor_ip4 codfw)
[13:22:38] <jayme>	 hnowlan: is that you?
[13:22:52] <wikibugs>	 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ssingh)
[13:24:07] <akosiaris>	 I acked it btw
[13:24:22] <jayme>	 I see
[13:24:31] <jayme>	 you know whats going on akosiaris?
[13:24:57] <akosiaris>	 hnowlan is trying to increase capacity
[13:24:57] <hnowlan>	 jayme: not me afaik, looking 
[13:25:18] <akosiaris>	 oh, my bad assumption then
[13:25:33] <akosiaris>	 note this is codfw, so nothing with the eqiad row B upgrade (which hasn't even started yet)
[13:25:36] <hnowlan>	 I did do a push earlier but at like 11:47 and it was rolled back 
[13:25:55] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Grafana access to babiola - https://phabricator.wikimedia.org/T332868 (10larissagaulia) >>! In T332868#8734469, @Ladsgroup wrote: > They are now in the list in https://ldap.toolforge.org/group/wmf >>User:Barakat Ajadi (more info) >  > Is it fixed now?  No, not yet. Let me w...
[13:26:16] <hnowlan>	 big spike in loads in codfw
[13:26:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:26:28] <hnowlan>	 qps up 4x 
[13:26:41] <akosiaris>	 some upload?
[13:26:48] <jayme>	 matches slow probes
[13:27:32] <wikibugs>	 (03CR) 10Vgutierrez: [C: 04-1] Serve an HTTP response for measurement domains directly from Varnish (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar)
[13:28:15] <hnowlan>	 eqiad reads down to 0 also? 
[13:28:19] <jayme>	 hnowlan: could it be just because eqiad got depooled and all traffic hits codfw now?
[13:28:41] <hnowlan>	 ah 
[13:28:43] <akosiaris>	 ah yes!
[13:28:49] <hnowlan>	 that'd do it, lmao 
[13:28:51] <akosiaris>	 I depooled eqiad for the row B upgrade
[13:28:58] <hnowlan>	 although it should be able to handle the traffic 
[13:28:59] <jayme>	 fair enough :)
[13:29:08] <icinga-wm>	 PROBLEM - Bird Internet Routing Daemon on dns1003 is CRITICAL: PROCS CRITICAL: 0 processes with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[13:29:15] <sukhe>	 ^ expected
[13:29:20] <akosiaris>	 I was about to ask ;-)
[13:29:32] <sukhe>	 same for durum1002 
[13:30:11] <hnowlan>	 errors are tapering off a bit but it's still high 
[13:30:16] <hnowlan>	 guess it couldn't handle the surge
[13:30:21] <wikibugs>	 (03CR) 10Btullis: [V: 03+1 C: 03+2] Disable job submission to YARN queues to faciliatate maintenance [puppet] - 10https://gerrit.wikimedia.org/r/903627 (https://phabricator.wikimedia.org/T330165) (owner: 10Btullis)
[13:30:30] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [eqiad] START helmfile.d/services/thumbor: apply
[13:30:39] <akosiaris>	 hnowlan: ^ trying something 
[13:30:54] <wikibugs>	 (03PS1) 10Slyngshede: C:idm::jobs absent permission sync. [puppet] - 10https://gerrit.wikimedia.org/r/903647
[13:31:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:31:35] <claime>	 We can maybe get away with depooling the two thumbor hosts that are in row B and repooling the service ?
[13:31:48] <akosiaris>	 experiment failed btw, reverting
[13:32:51] <akosiaris>	 claime: we could, but apparently it's fine now? 
[13:32:58] <akosiaris>	 let's re-evaluate if it alerts again
[13:33:02] <claime>	 yup
[13:33:03] <hnowlan>	 we're not really fine
[13:33:12] <hnowlan>	 partial repooling sgtm, will do 
[13:33:14] <claime>	 thisisfine.png
[13:33:17] <hnowlan>	 we're still high on 5xx
[13:33:40] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=thumbor1001.eqiad.wmnet
[13:33:58] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: name=thumbor1002.eqiad.wmnet
[13:34:18] <jinxer-wm>	 (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:34:34] <jayme>	 yeah, probes are still pretty slow (and flaky)
[13:34:49] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=thumbor,name=eqiad
[13:35:45] <jayme>	 btw. I really like the toggle switch in https://grafana-rw.wikimedia.org/d/Pukjw6cWk/thumbor?orgId=1&from=now-1h&to=now :D
[13:35:53] <wikibugs>	 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10BTullis)
[13:36:02] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [eqiad] DONE helmfile.d/services/thumbor: apply
[13:36:03] <claime>	 jayme: thumbor? yes.
[13:36:11] <jayme>	 yes, that one :D
[13:36:13] <hnowlan>	 i wish it worked for turning it off
[13:36:20] <claime>	 x)
[13:36:29] <hnowlan>	 I literally just found two more problems with that dashboard during this, such a mess
[13:36:47] <claime>	 I was looking at the error graph going "oh that's not too bad"
[13:36:50] <claime>	 Then I saw it was log10
[13:36:56] <jayme>	 +1
[13:37:11] <hnowlan>	 did I do something wrong with conftool to repool there? not seeing anything coming in yet 
[13:37:26] <akosiaris>	 hnowlan: the DC is depooled in discovery
[13:37:29] <hnowlan>	 oh
[13:37:33] <hnowlan>	 that'd do it
[13:37:34] <akosiaris>	 let me fix that
[13:38:36] <akosiaris>	 hnowlan: it's pooled
[13:39:21] <hnowlan>	 thanks! 
[13:39:22] <wikibugs>	 (03PS1) 10Jbond: idp: failover to codfw for switch upgrade [dns] - 10https://gerrit.wikimedia.org/r/903648
[13:40:02] <akosiaris>	 ehm, I wasn't clear. It's pooled without me doing anything
[13:40:07] <akosiaris>	 I am not sure it was ever depooled 
[13:40:24] <akosiaris>	 it's not part of the sre.discovery.datacenter cookbook
[13:40:39] <akosiaris>	 oh, I need to pool swift
[13:41:07] <wikibugs>	 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Jelto)
[13:41:16] <hnowlan>	 ohh
[13:41:58] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=swift-ro,name=eqiad
[13:41:59] <hnowlan>	 it's a shame the scale up on k8s didn't work, it'd actually help a lot with this workload heh
[13:42:05] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=swift,name=eqiad
[13:42:23] <akosiaris>	 hnowlan: check again ;-)
[13:42:26] <akosiaris>	 at least on eqiad
[13:42:31] <akosiaris>	 trick worked after all
[13:42:38] <akosiaris>	 just needed a bit of tickling
[13:42:42] <akosiaris>	 let me upload the patch
[13:42:42] <jayme>	 do we need to depools some swift hosts now from row B?
[13:42:47] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Grafana access to babiola - https://phabricator.wikimedia.org/T332868 (10Ladsgroup) It seems there was another regular job was needed to sync users in ldap with grafana. @fgiunchedi thankfully did a manual kick: ` Amir1: Mar 28 13:39:18 grafana1002 grafana-ldap-users-sync[2...
[13:42:53] <hnowlan>	 akosiaris: oh damn, nice! 
[13:43:18] <akosiaris>	 hnowlan: I 'll deploy this real quick to codfw first 
[13:43:20] <hnowlan>	 if the same will work in codfw we could take some of the load on k8s right now, given that we're already erroring 
[13:43:30] <akosiaris>	 yeah, that was my thinking
[13:44:20] <wikibugs>	 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10herron)
[13:44:23] <claime>	 jayme: there's one ms-fe host that's depooled, there's apparently nothing to do for ms-be
[13:44:24] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: sync
[13:44:29] <Amir1>	 wait thumbor needs to be tickled?
[13:44:38] <akosiaris>	 Amir1: shush
[13:44:40] <claime>	 So I think we're ok on the swift front
[13:44:44] <jayme>	 claime: thanks for checking/knowing
[13:44:57] <claime>	 jayme: I just checked the rowB task
[13:45:10] <jayme>	 still, thanks :p
[13:45:13] <claime>	 ;)
[13:45:25] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: sync
[13:45:40] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for doh5002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[13:45:40] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for aqs2001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[13:45:40] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for kafka-main2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[13:45:42] <Amir1>	 my job here never gets boring :D
[13:45:45] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for kubernetes2018:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[13:45:49] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for doc2001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[13:45:54] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for schema2004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[13:45:55] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [codfw] START helmfile.d/services/thumbor: sync
[13:45:58] <jinxer-wm>	 (NodeTextfileStale) firing: (2) Stale textfile for elastic1057:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[13:46:03] <jinxer-wm>	 (NodeTextfileStale) firing: (5) Stale textfile for db1103:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[13:46:07] <akosiaris>	 whnat are all the NodeTextfilestate ? 
[13:46:08] <jinxer-wm>	 (NodeTextfileStale) firing: (2) Stale textfile for mw1352:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[13:46:08] <jayme>	 some thumbor traffic coming back to eqiad
[13:46:09] <wikibugs>	 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10Jelto)
[13:46:12] <jinxer-wm>	 (NodeTextfileStale) firing: (6) Stale textfile for cloudcephmon1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[13:46:17] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cp1076:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[13:46:21] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for thumbor1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[13:46:26] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for thanos-be1004:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[13:46:29] <godog>	 ah yeah I get it, I'll silence the alerts
[13:46:31] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for mc1047:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[13:46:31] <claime>	 jinxer 'bout to get floodkicked
[13:46:35] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for kafka-logging1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[13:46:38] <logmsgbot>	 !log ayounsi@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on 249 hosts with reason: eqiad row B upgrade
[13:46:40] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for restbase1027:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[13:46:42] <vgutierrez>	 godog: I thought we fixed those stalefile alerts on cp nodes
[13:46:49] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for mw1396:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[13:46:54] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for wdqs1013:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[13:46:56] <logmsgbot>	 !log akosiaris@deploy2002 helmfile [codfw] DONE helmfile.d/services/thumbor: sync
[13:46:58] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ganeti1025:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[13:47:03] <jinxer-wm>	 (NodeTextfileStale) firing: (5) Stale textfile for an-presto1005:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[13:47:05] <godog>	 vgutierrez: that's unrelated
[13:47:07] <vgutierrez>	 ack
[13:47:07] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for ms-be1045:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[13:47:11] <akosiaris>	 hnowlan: 32 pods in codfw too
[13:47:12] <jinxer-wm>	 (NodeTextfileStale) firing: (2) Stale textfile for arclamp1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[13:47:19] <jinxer-wm>	 (NodeTextfileStale) firing: (3) Stale textfile for durum3001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[13:47:20] <godog>	 truly sorry for the spam folks
[13:47:20] <akosiaris>	 should I depool swift in eqiad once more ? 
[13:47:23] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for dns4003:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[13:47:28] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cp6013:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[13:47:30] <hnowlan>	 akosiaris: if it's easier or safer then go for it 
[13:47:33] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for pki2002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[13:47:37] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for lvs2007:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[13:47:37] <hnowlan>	 I'll try pooling thumbor with a lowish weight
[13:47:40] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=swift,name=eqiad
[13:47:41] <hnowlan>	 thumbor-k8s that is
[13:47:42] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for kubestagemaster2001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[13:47:43] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=swift-ro,name=eqiad
[13:47:46] <jinxer-wm>	 (NodeTextfileStale) firing: (3) Stale textfile for mw1445:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[13:47:51] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for aux-k8s-ctrl1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[13:47:56] <jinxer-wm>	 (NodeTextfileStale) firing: (2) Stale textfile for backup1006:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[13:47:56] <akosiaris>	 !log depool swift in eqiad for row B upgrade 
[13:48:00] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for cloudelastic1002:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[13:48:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:48:03] <akosiaris>	 going for it, let's see
[13:48:03] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=4; selector: service=thumbor,name=kubernetes201[0123].codfw.wmnet
[13:48:05] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for an-druid1005:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[13:48:10] <jinxer-wm>	 (NodeTextfileStale) firing: Stale textfile for dse-k8s-etcd1001:9100 - https://wikitech.wikimedia.org/wiki/Prometheus#Stale_file_for_node-exporter_textfile - https://grafana.wikimedia.org/d/knkl4dCWz/node-exporter-textfile - https://alerts.wikimedia.org/?q=alertname%3DNodeTextfileStale
[13:48:30] <urandom>	 o/
[13:48:34] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1033 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) is CRITICAL: Test Get structured talk page for enwiki Salt article returned the unexpected status 503 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:48:52] <claime>	 akosiaris: Is thumbor eqiad going to hit swift codfw ?
[13:49:15] <akosiaris>	 claime: thumbor eqiad shouldn't be receiving traffic any time soon
[13:49:18] <jinxer-wm>	 (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[13:49:19] <logmsgbot>	 !log ayounsi@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on 249 hosts with reason: eqiad row B upgrade
[13:49:38] <wikibugs>	 (03PS1) 10Filippo Giunchedi: sre: ignore role_owner for NodeTextfileStale [alerts] - 10https://gerrit.wikimedia.org/r/903650
[13:49:38] <claime>	 akosiaris: Did you deppol it again ?
[13:49:40] <hnowlan>	 5xx way down for codfw thumbor, but that's probably eqiad 
[13:49:43] <wikibugs>	 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=4c1e12e1-9d5e-4447-880a-f0ec09133a64) set by ayounsi@cumin1001 for 2:00:00 on 249 host(s)...
[13:49:46] <akosiaris>	 claime: yes
[13:49:48] <claime>	 ack
[13:49:52] <claime>	 I didn't see in the flood
[13:50:29] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/903632 (owner: 10Effie Mouzeli)
[13:50:34] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1033 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase
[13:50:44] <jayme>	 claime: AIUI swift is calling thumbor, so if swift is depooled in eqiad, thumbor won't get traffic
[13:51:01] <claime>	 Ah yes it's that way around
[13:51:03] <claime>	 gotcha
[13:51:10] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] sre: ignore role_owner for NodeTextfileStale [alerts] - 10https://gerrit.wikimedia.org/r/903650 (owner: 10Filippo Giunchedi)
[13:51:12] <jayme>	 that's why thumbor was never depooled (by the cookbook) but still did not get traffic
[13:51:26] <claime>	 right yeah
[13:51:33] <wikibugs>	 (03PS30) 10Jameel Kaisar: Serve an HTTP response for measurement domains directly from Varnish [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028)
[13:51:38] <Lucas_WMDE>	 o/ I could in theory run a maint script for koi now
[13:51:43] <Lucas_WMDE>	 but I assume I shouldn’t do that right now
[13:51:47] <koi>	 thanks!
[13:52:13] <hnowlan>	 interestingly this graph was broken up until a few minutes ago  and is the primary indicator for thumbor overload 🙈 https://grafana-rw.wikimedia.org/d/Pukjw6cWk/thumbor?forceLogin&from=now-30m&orgId=1&refresh=30s&to=now&viewPanel=46
[13:52:15] <jayme>	 claime: akosiari.s did just repool swift in eqiad again as I see it and not touched thumbor. So you did not miss anything in the flood
[13:52:16] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] idp: failover to codfw for switch upgrade [dns] - 10https://gerrit.wikimedia.org/r/903648 (owner: 10Jbond)
[13:52:28] <wikibugs>	 (03CR) 10Jameel Kaisar: Serve an HTTP response for measurement domains directly from Varnish (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar)
[13:52:38] <claime>	 jayme: Yeah, what I missed was the actual request flow :')
[13:52:41] <akosiaris>	 jayme: and then I re-depooled it
[13:52:46] <akosiaris>	 it's depooled now btw
[13:52:56] <akosiaris>	 just to be clear
[13:52:57] <jayme>	 yes, yes
[13:53:01] <akosiaris>	 ok
[13:53:27] <jayme>	 I just understood that claime was asking you if you depooled thumbor in eqiad again and you answered "yes"
[13:53:36] <jayme>	 which is not straigt way correct :)
[13:53:43] <hnowlan>	 it's slow as hell as far as processing requests is concerned but thumbor-k8s is doing okay. Will tweak the weight a bit higher 
[13:53:48] <akosiaris>	 hnowlan: does the tripling of pods in codfw help ?
[13:53:53] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/weight=5; selector: service=thumbor,name=kubernetes201[0123].codfw.wmnet
[13:53:59] <hnowlan>	 akosiaris: oh most definiteyl 
[13:54:03] <claime>	 Want me to push the governor change ?
[13:54:04] <akosiaris>	 ok, I must have some lag
[13:54:04] <wikibugs>	 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10jbond)
[13:54:12] <claime>	 It may help a tad
[13:54:13] <akosiaris>	 thanks for the update
[13:54:20] <hnowlan>	 thumbor in eqiad was serving 5xx errors at this weight on the previous setup 
[13:54:49] <Emperor>	 !log depool ms-fe1010 before switch work T330165
[13:54:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:54:56] <stashbot>	 T330165: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165
[13:55:45] <wikibugs>	 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10MatthewVernon)
[13:55:57] <wikibugs>	 (03CR) 10BBlack: Serve an HTTP response for measurement domains directly from Varnish (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar)
[13:56:16] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: thumbor: Set lower requests in pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/903651
[13:56:33] <wikibugs>	 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10BTullis)
[13:57:16] <wikibugs>	 10SRE, 10Traffic, 10Upstream: HAProxy 2.6.10 crashing in the text cluster - https://phabricator.wikimedia.org/T332796 (10Vgutierrez) 2.6.12 has been released https://www.mail-archive.com/haproxy@formilux.org/msg43371.html including the patch that we've been testing in text@ulsfo
[13:58:09] <godog>	 !log depool thanos-fe1002 - T330165
[13:58:16] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:58:31] <hnowlan>	 gonna try bumping weight up a bit again. This is less firefighting as much as experimentation now that we're safe so if there's any concerns I can hold off 
[13:59:11] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] thumbor: Set lower requests in pods [deployment-charts] - 10https://gerrit.wikimedia.org/r/903651 (owner: 10Alexandros Kosiaris)
[13:59:24] <claime>	 hnowlan: Reiterating the offer to push the performance governor change to k8s nodes if you think that can help
[13:59:33] <wikibugs>	 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10fgiunchedi)
[14:00:03] <hnowlan>	 claime: oops, sorry, forgot to reply - that's safe enough for the k8s nodes in general right? Couldn't hurt
[14:00:16] <claime>	 yep, I don't really see what it could break
[14:00:20] * kamila_ has been wondering about that one... thank you claime
[14:00:28] <claime>	 let's go then
[14:00:58] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] cpufrequtils: ensure that cpufrequtils is reloded on governor change [puppet] - 10https://gerrit.wikimedia.org/r/903632 (owner: 10Effie Mouzeli)
[14:01:12] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] cpufrequtils: ensure that cpufrequtils is reloded on governor change (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/903632 (owner: 10Effie Mouzeli)
[14:01:18] <wikibugs>	 (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/903628 (owner: 10Jbond)
[14:01:36] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] "We'll need to land this, then change the branch commit pointer to this new hash." [extensions/EventLogging] (wmf/1.41.0-wmf.2) - 10https://gerrit.wikimedia.org/r/903562 (https://phabricator.wikimedia.org/T333291) (owner: 10Phuedx)
[14:02:00] <wikibugs>	 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10jbond)
[14:02:16] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] O:cluster::management: add ldap::bitu profile to cluster::management [puppet] - 10https://gerrit.wikimedia.org/r/903628 (owner: 10Jbond)
[14:02:25] <wikibugs>	 (03PS31) 10Jameel Kaisar: Serve an HTTP response for measurement domains directly from Varnish [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028)
[14:02:27] <claime>	 Running puppet on kubernetes physical workers
[14:03:06] <hnowlan>	 <3
[14:03:58] <wikibugs>	 (03CR) 10Jameel Kaisar: Serve an HTTP response for measurement domains directly from Varnish (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/900700 (https://phabricator.wikimedia.org/T332028) (owner: 10Jameel Kaisar)
[14:04:15] <wikibugs>	 10SRE, 10LDAP-Access-Requests: Grant Grafana access to babiola - https://phabricator.wikimedia.org/T332868 (10larissagaulia) 05Open→03Resolved Thanks, everyone. Mission accomplished.
[14:04:26] <jinxer-wm>	 (WidespreadPuppetFailure) firing: Puppet has failed on dse_k8s cluster - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=dse_k8s - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:04:45] <_joe_>	 elukey, klausman ^^
[14:04:52] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40375/console" [puppet] - 10https://gerrit.wikimedia.org/r/903560 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[14:05:02] <claime>	 Probably me
[14:05:04] <claime>	 _joe_: 
[14:05:14] <_joe_>	 yeah without probably :)
[14:05:16] <claime>	 I'll go check
[14:05:26] <jinxer-wm>	 (WidespreadPuppetFailure) firing: (2) Puppet has failed on ml_serve cluster - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=ml_serve - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:05:26] <jinxer-wm>	 (WidespreadPuppetFailure) firing: Puppet has failed on kubernetes cluster - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=kubernetes - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:05:58] <XioNoX>	 !log reboot eqiad row B for upgrade - T330165
[14:06:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:06:06] <stashbot>	 T330165: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165
[14:06:19] <claime>	 I know what's going on, pushing fixc
[14:06:26] <jinxer-wm>	 (WidespreadPuppetFailure) firing: Puppet has failed on kubernetes-staging cluster - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=kubernetes-staging - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:06:40] <klausman>	 claime: thx!
[14:07:45] <claime>	 To be clear, it's just breaking puppet runs, nothing more
[14:08:04] <icinga-wm>	 PROBLEM - Widespread puppet agent failures on alert1001 is CRITICAL: 0.03224 ge 0.01 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[14:08:12] <claime>	 yes, shush
[14:08:26] <jinxer-wm>	 (WidespreadPuppetFailure) firing: Puppet has failed on ml_staging cluster - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=ml_staging - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:09:02] <icinga-wm>	 PROBLEM - Host asw2-b-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[14:09:22] <TheresNoTime>	 phab being down is expected, yes? iirc there was a reboot or something today..
[14:09:32] <claime>	 Hmm, I can't push to gerrit
[14:09:35] <hashar>	 some switch is being rebooted
[14:09:38] <icinga-wm>	 PROBLEM - Host gerrit.wikimedia.org is DOWN: PING CRITICAL - Packet loss = 100%
[14:09:50] <claime>	 TheresNoTime: Yes, that's expected
[14:09:56] <claime>	 T330165
[14:09:56] <TheresNoTime>	 ack
[14:09:59] <hashar>	 I have lost contint1002 as well (but that is not the primary
[14:10:10] <icinga-wm>	 PROBLEM - Host ripe-atlas-eqiad IPv6 is DOWN: PING CRITICAL - Packet loss = 100%
[14:10:10] <claime>	 Also expected
[14:10:12] <effie>	 ok there will be some puppet failures, we'll fix them as soon as gerrit is up 
[14:10:25] <XioNoX>	 puppet is disabled fleet wide anyway
[14:10:25] <claime>	 But not being able to push to gerrit, idk why
[14:10:26] <jinxer-wm>	 (WidespreadPuppetFailure) firing: (2) Puppet has failed on kubernetes cluster - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=kubernetes - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:10:27] <effie>	 I took you all down with me with my systemd vs  systemctl
[14:10:27] <hashar>	 isn't redundancy between switches? :]
[14:10:37] <claime>	 effie: yes, that was a tricky one
[14:10:48] <Dreamy_Jazz>	 I can't push to gerrit too
[14:10:53] <XioNoX>	 hashar: there is, but are the services redundant?
[14:10:55] <Dreamy_Jazz>	 Or use the gerrit REST API
[14:10:57] <taavi>	 yes..
[14:11:24] <icinga-wm>	 PROBLEM - configured eth on lvs1020 is CRITICAL: ens1f0np0 reporting no carrier. https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[14:11:26] <jinxer-wm>	 (WidespreadPuppetFailure) firing: (2) Puppet has failed on kubernetes-staging cluster - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=kubernetes-staging - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:11:47] <hashar>	 XioNoX: don't worry :-]
[14:11:48] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 212, down: 5, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:11:50] <claime>	 I'll silence the puppet alerts
[14:12:04] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 197, down: 5, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:12:14] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9400 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 5 threshold =0.15 breach: cluster_name: relforge-eqiad-small-alpha, status: yellow, timed_out: False, number_of_nodes: 1, number_of_data_nodes: 1, active_primary_shards: 5, active_shards: 5, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 5, delayed_unassigned_shards: 0, number_of_pending_tasks: 0,
[14:12:14] <icinga-wm>	 of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 50.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[14:12:17] <jinxer-wm>	 (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[14:12:17] <jinxer-wm>	 (AppserversUnreachable) firing: Appserver unavailable for cluster api_appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[14:12:23] <jinxer-wm>	 (AppserversUnreachable) firing: Appserver unavailable for cluster appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[14:12:26] <icinga-wm>	 PROBLEM - MariaDB Replica IO: analytics-meta-replica on an-coord1002 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@an-coord1001.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on an-coord1001.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:12:32] <icinga-wm>	 PROBLEM - Debmonitor Health Check on debmonitor.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Debmonitor
[14:12:45] <jinxer-wm>	 (JobUnavailable) firing: (5) Reduced availability for job gerrit in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:12:50] <icinga-wm>	 PROBLEM - ElasticSearch health check for shards on 9200 on relforge1003 is CRITICAL: CRITICAL - elasticsearch inactive shards 170 threshold =0.15 breach: cluster_name: relforge-eqiad, status: yellow, timed_out: False, number_of_nodes: 1, number_of_data_nodes: 1, active_primary_shards: 176, active_shards: 176, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 170, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, num
[14:12:50] <icinga-wm>	 n_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 50.86705202312138 https://wikitech.wikimedia.org/wiki/Search%23Administration
[14:12:56] <icinga-wm>	 ACKNOWLEDGEMENT - MariaDB Replica IO: analytics-meta-replica on an-coord1002 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@an-coord1001.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on an-coord1001.eqiad.wmnet (110 Connection timed out) Btullis T330165 https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_
[14:12:56] <icinga-wm>	 a
[14:12:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: (2) kubernetes1010.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:13:00] <icinga-wm>	 PROBLEM - MariaDB Replica IO: analytics_meta on db1108 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@an-coord1001.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on an-coord1001.eqiad.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:13:00] <icinga-wm>	 PROBLEM - VRRP status on cr1-eqiad is CRITICAL: VRRP CRITICAL - 4 inconsistent interfaces, 0 misconfigured interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status
[14:13:03] <jinxer-wm>	 (KubernetesCalicoDown) firing: ml-serve1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=eqiad%20prometheus%2Fk8s-mlserve&var-instance=ml-serve1002.eqiad.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:13:10] <icinga-wm>	 PROBLEM - haproxy failover on dbproxy1018 is CRITICAL: CRITICAL check_failover servers up 12 down 4: https://wikitech.wikimedia.org/wiki/HAProxy
[14:13:52] <jynus>	 ^ Amir1 not sure if ours, but to review later
[14:13:56] <icinga-wm>	 PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_test_sync.service,netbox_ganeti_drmrs01_sync.service,netbox_ganeti_drmrs02_sync.service,netbox_ganeti_eqsin_sync.service,netbox_ganeti_esams_sync.service,netbox_ganeti_ulsfo_sync.service,netbox_report_coherence_rack_run.service,netbox_report_coherence_run.service,netbox_report_puppetdb_virtual_run.service https://w
[14:13:56] <icinga-wm>	 wikimedia.org/wiki/Monitoring/check_systemd_state
[14:14:11] <Amir1>	 okay thanks
[14:14:34] <icinga-wm>	 PROBLEM - PyBal backends health check on lvs1020 is CRITICAL: PYBAL CRITICAL - CRITICAL - wikireplicas-b-s6_3316: Servers dbproxy1019.eqiad.wmnet are marked down but pooled: wikireplicas-b-s5_3315: Servers dbproxy1019.eqiad.wmnet are marked down but pooled: wikireplicas-b-s8_3318: Servers dbproxy1019.eqiad.wmnet are marked down but pooled: wikireplicas-b-s2_3312: Servers dbproxy1019.eqiad.wmnet are marked down but pooled: wikireplicas-b-s
[14:14:34] <icinga-wm>	 Servers dbproxy1019.eqiad.wmnet are marked down but pooled: wikireplicas-b-s1_3311: Servers dbproxy1019.eqiad.wmnet are marked down but pooled: wikireplicas-b-s4_3314: Servers dbproxy1019.eqiad.wmnet are marked down but pooled: wikireplicas-b-s7_3317: Servers dbproxy1019.eqiad.wmnet are marked down but pooled https://wikitech.wikimedia.org/wiki/PyBal
[14:14:37] <Amir1>	 I think Manuel told me we need to reload haproxy, I'll do it once the maint is over
[14:14:42] <jinxer-wm>	 (RdfStreamingUpdaterNotEnoughTaskSlots) firing: The flink session cluster rdf-streaming-updater in eqiad (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots
[14:14:46] <jinxer-wm>	 (KubernetesCalicoDown) firing: (2) kubestage1003.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:14:50] <jinxer-wm>	 (KubernetesCalicoDown) firing: (2) dse-k8s-ctrl1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:14:57] <jinxer-wm>	 (WidespreadPuppetFailure) firing: Puppet has failed on sessionstore cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=sessionstore - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:15:02] <jinxer-wm>	 (WidespreadPuppetFailure) firing: Puppet has failed on puppet cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=puppet - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:15:07] <jinxer-wm>	 (WidespreadPuppetFailure) firing: Puppet has failed on ganeti cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=ganeti - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:15:44] <marostegui>	 Amir1: correct
[14:15:52] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_codfw_test_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:16:03] <jinxer-wm>	 (WidespreadPuppetFailure) firing: Puppet has failed on wdqs cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=wdqs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:16:08] <jinxer-wm>	 (WidespreadPuppetFailure) firing: Puppet has failed on prometheus cluster - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=prometheus - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:16:44] <icinga-wm>	 RECOVERY - Host asw2-b-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.84 ms
[14:16:56] <icinga-wm>	 PROBLEM - Host ripe-atlas-eqiad is DOWN: PING CRITICAL - Packet loss = 100%
[14:17:12] <icinga-wm>	 PROBLEM - Host analytics1069 is DOWN: PING CRITICAL - Packet loss = 100%
[14:17:21] <claime>	 taavi: as long as you oped, can you reop sirenbot ?
[14:17:26] <jinxer-wm>	 (WidespreadPuppetFailure) firing: Puppet has failed on thanos cluster - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=thanos - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:17:31] <jinxer-wm>	 (WidespreadPuppetFailure) firing: Puppet has failed on lvs cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=lvs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:17:36] <jinxer-wm>	 (WidespreadPuppetFailure) firing: (2) Puppet has failed on cache_upload cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=cache_upload - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:17:38] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_esams_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_esams_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:17:40] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_ulsfo_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_ulsfo_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:17:41] <jinxer-wm>	 (WidespreadPuppetFailure) firing: Puppet has failed on cache_text cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:17:45] <jinxer-wm>	 (JobUnavailable) firing: (34) Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:17:46] <icinga-wm>	 PROBLEM - Host 2620:0:861:2:208:80:154:134 is DOWN: PING CRITICAL - Packet loss = 100%
[14:17:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: (5) kubernetes1009.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:17:58] <jinxer-wm>	 (KubernetesCalicoDown) firing: (2) ml-serve-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:18:00] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:18:16] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[14:18:20] <icinga-wm>	 RECOVERY - Host 2620:0:861:2:208:80:154:134 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms
[14:18:24] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) firing: (3) Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration  - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[14:18:26] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9400 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad-small-alpha: cluster_name: relforge-eqiad-small-alpha, status: green, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 5, active_shards: 10, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flig
[14:18:26] <icinga-wm>	 : 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[14:18:36] <icinga-wm>	 RECOVERY - Debmonitor Health Check on debmonitor.wikimedia.org is OK: HTTP OK: Status line output matched HTTP/1.1 302 - 1094 bytes in 0.227 second response time https://wikitech.wikimedia.org/wiki/Debmonitor
[14:18:38] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_eqsin_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqsin_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:18:40] <icinga-wm>	 RECOVERY - MariaDB Replica IO: analytics-meta-replica on an-coord1002 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:19:00] <icinga-wm>	 RECOVERY - Host gerrit.wikimedia.org is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms
[14:19:00] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[14:19:05] <jinxer-wm>	 (RdfStreamingUpdaterNotEnoughTaskSlots) resolved: The flink session cluster rdf-streaming-updater in eqiad (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots
[14:19:06] <icinga-wm>	 RECOVERY - ElasticSearch health check for shards on 9200 on relforge1003 is OK: OK - elasticsearch status relforge-eqiad: cluster_name: relforge-eqiad, status: green, timed_out: False, number_of_nodes: 2, number_of_data_nodes: 2, active_primary_shards: 176, active_shards: 352, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shards: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max
[14:19:06] <icinga-wm>	 _in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration
[14:19:08] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_drmrs02_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:19:09] <jinxer-wm>	 (KubernetesCalicoDown) firing: (2) dse-k8s-ctrl1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:19:16] <icinga-wm>	 RECOVERY - MariaDB Replica IO: analytics_meta on db1108 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica
[14:19:16] <icinga-wm>	 RECOVERY - VRRP status on cr1-eqiad is OK: VRRP OK - 0 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status
[14:19:26] <icinga-wm>	 PROBLEM - Check unit status of netbox_ganeti_drmrs01_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:19:26] <icinga-wm>	 RECOVERY - haproxy failover on dbproxy1018 is OK: OK check_failover servers up 16 down 0: https://wikitech.wikimedia.org/wiki/HAProxy
[14:19:29] <wikibugs>	 (03CR) 10Clément Goubert: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/903652 (owner: 10Clément Goubert)
[14:19:42] <icinga-wm>	 PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.01465 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[14:19:43] <kamila_>	 claime: gerrit is back
[14:19:52] <kamila_>	 (in case the flood is floody)
[14:20:26] <jinxer-wm>	 (WidespreadPuppetFailure) firing: Puppet has failed on swift cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=swift - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:20:28] <wikibugs>	 (03Merged) 10jenkins-bot: MetricsPlatform: Fix ContextAttributesFactoryTest failing on prod branch [extensions/EventLogging] (wmf/1.41.0-wmf.2) - 10https://gerrit.wikimedia.org/r/903562 (https://phabricator.wikimedia.org/T333291) (owner: 10Phuedx)
[14:20:31] <jinxer-wm>	 (WidespreadPuppetFailure) firing: Puppet has failed on memcached cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=memcached - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:20:36] <jinxer-wm>	 (WidespreadPuppetFailure) firing: Puppet has failed on api_appserver cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:20:41] <jinxer-wm>	 (WidespreadPuppetFailure) firing: Puppet has failed on ml_cache cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=ml_cache - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:20:46] <jinxer-wm>	 (WidespreadPuppetFailure) firing: Puppet has failed on kafka_test cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=kafka_test - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:20:50] <jinxer-wm>	 (WidespreadPuppetFailure) firing: Puppet has failed on restbase cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=restbase - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:20:55] <jinxer-wm>	 (WidespreadPuppetFailure) firing: Puppet has failed on kafka_jumbo cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=kafka_jumbo - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:21:00] <godog>	 my apologies for the spam
[14:21:00] <jinxer-wm>	 (WidespreadPuppetFailure) firing: Puppet has failed on relforge cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=relforge - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:21:05] <jinxer-wm>	 (WidespreadPuppetFailure) firing: Puppet has failed on misc cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=misc - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:21:09] <claime>	 kamila_: I know, thanks, I pushed my fix.
[14:21:13] <godog>	 silencing
[14:21:21] <wikibugs>	 (03CR) 10Hnowlan: [C: 03+1] cpufrequtils: fix systemctl call [puppet] - 10https://gerrit.wikimedia.org/r/903652 (owner: 10Clément Goubert)
[14:21:26] <jinxer-wm>	 (WidespreadPuppetFailure) firing: Puppet has failed on ci cluster - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=ci - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:21:26] <jinxer-wm>	 (WidespreadPuppetFailure) firing: Puppet has failed on webperf cluster - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=webperf - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:21:31] <jinxer-wm>	 (WidespreadPuppetFailure) firing: Puppet has failed on redis cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=redis - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:21:36] <jinxer-wm>	 (WidespreadPuppetFailure) firing: Puppet has failed on etcd cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=etcd - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:21:36] <wikibugs>	 (03CR) 10Clément Goubert: [C: 03+2] cpufrequtils: fix systemctl call [puppet] - 10https://gerrit.wikimedia.org/r/903652 (owner: 10Clément Goubert)
[14:21:41] <jinxer-wm>	 (WidespreadPuppetFailure) firing: Puppet has failed on eventschemas cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=eventschemas - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:21:45] <jinxer-wm>	 (WidespreadPuppetFailure) firing: Puppet has failed on appserver cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[14:22:08] <icinga-wm>	 RECOVERY - Host ripe-atlas-eqiad is UP: PING OK - Packet loss = 0%, RTA = 0.47 ms
[14:22:17] <jinxer-wm>	 (AppserversUnreachable) resolved: Appserver unavailable for cluster api_appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[14:22:18] <jinxer-wm>	 (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[14:22:23] <jinxer-wm>	 (AppserversUnreachable) resolved: Appserver unavailable for cluster appserver at eqiad - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=eqiad%20prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable
[14:22:34] <icinga-wm>	 RECOVERY - Host ripe-atlas-eqiad IPv6 is UP: PING OK - Packet loss = 0%, RTA = 0.60 ms
[14:22:35] <wikibugs>	 (03PS1) 10Herron: icinga: remove widespread puppet agent alerts [puppet] - 10https://gerrit.wikimedia.org/r/903654 (https://phabricator.wikimedia.org/T288622)
[14:22:45] <jinxer-wm>	 (JobUnavailable) resolved: (34) Reduced availability for job atlas_exporter in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable
[14:23:03] <jinxer-wm>	 (KubernetesCalicoDown) resolved: (2) ml-serve-ctrl1001.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:23:07] <jinxer-wm>	 (KubernetesCalicoDown) resolved: (5) kubernetes1009.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:23:22] <icinga-wm>	 PROBLEM - Check systemd state on puppetmaster2001 is CRITICAL: CRITICAL - degraded: The following units failed: sync-puppet-volatile.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:23:28] <jinxer-wm>	 (KafkaUnderReplicatedPartitions) resolved: (3) Under replicated partitions for Kafka cluster jumbo-eqiad in eqiad - https://wikitech.wikimedia.org/wiki/Kafka/Administration  - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions
[14:24:41] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[14:24:43] <wikibugs>	 (03PS3) 10JMeybohm: k8s: Remove 1.16 related code (v2) [puppet] - 10https://gerrit.wikimedia.org/r/903560 (https://phabricator.wikimedia.org/T328291)
[14:24:45] <wikibugs>	 (03PS1) 10JMeybohm: k8s: Configure the IPv6 service ip range for apiserver [puppet] - 10https://gerrit.wikimedia.org/r/903655 (https://phabricator.wikimedia.org/T307943)
[14:24:45] <jinxer-wm>	 (KubernetesCalicoDown) resolved: (2) kubestage1003.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:24:49] <jinxer-wm>	 (KubernetesCalicoDown) resolved: (2) dse-k8s-ctrl1002.eqiad.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations  - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown
[14:24:52] <icinga-wm>	 RECOVERY - PyBal backends health check on lvs1020 is OK: PYBAL OK - All pools are healthy https://wikitech.wikimedia.org/wiki/PyBal
[14:25:13] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] k8s: Configure the IPv6 service ip range for apiserver [puppet] - 10https://gerrit.wikimedia.org/r/903655 (https://phabricator.wikimedia.org/T307943) (owner: 10JMeybohm)
[14:25:41] <logmsgbot>	 !log filippo@cumin1001 conftool action : set/pooled=no; selector: name=THANOS-FE-OLD-FQDN,service=thanos-web
[14:25:53] <logmsgbot>	 !log filippo@cumin1001 conftool action : set/pooled=no; selector: name=thanos-fe1002.eqiad.wmnet,service=thanos-web
[14:26:06] <wikibugs>	 (03PS1) 10Jbond: Revert "idp: failover to codfw for switch upgrade" [dns] - 10https://gerrit.wikimedia.org/r/903564
[14:26:26] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] Revert "idp: failover to codfw for switch upgrade" [dns] - 10https://gerrit.wikimedia.org/r/903564 (owner: 10Jbond)
[14:26:55] <wikibugs>	 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10fgiunchedi)
[14:27:07] <wikibugs>	 (03PS2) 10JMeybohm: k8s: Configure the IPv6 service ip range for apiserver [puppet] - 10https://gerrit.wikimedia.org/r/903655 (https://phabricator.wikimedia.org/T307943)
[14:27:15] <wikibugs>	 (03PS1) 10Btullis: Revert "Disable job submission to YARN queues to faciliatate maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/903565
[14:27:23] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40376/console" [puppet] - 10https://gerrit.wikimedia.org/r/903560 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[14:28:40] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_esams_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_esams_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:28:42] <wikibugs>	 (03PS1) 10Ayounsi: Revert "Depool eqiad frontends for network maintenance" [dns] - 10https://gerrit.wikimedia.org/r/903666 (https://phabricator.wikimedia.org/T330165)
[14:28:42] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_ulsfo_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_ulsfo_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:28:44] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "LGTM, i didn't check the script as i assume that has already gone through review but say if not" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/902826 (https://phabricator.wikimedia.org/T266784) (owner: 10CDanis)
[14:28:49] <jinxer-wm>	 (RdfStreamingUpdaterFlinkJobUnstable) resolved: (2) WCQS_Streaming_Updater in eqiad (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable
[14:28:55] <wikibugs>	 (03PS2) 10Ayounsi: Revert "Depool eqiad frontends for network maintenance" [dns] - 10https://gerrit.wikimedia.org/r/903666 (https://phabricator.wikimedia.org/T330165)
[14:29:36] <wikibugs>	 (03PS1) 10Filippo Giunchedi: Revert "prometheus1006: depool from alertmanager" [puppet] - 10https://gerrit.wikimedia.org/r/903667
[14:29:36] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_eqsin_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_eqsin_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:29:52] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 03+2] Revert "prometheus1006: depool from alertmanager" [puppet] - 10https://gerrit.wikimedia.org/r/903667 (owner: 10Filippo Giunchedi)
[14:30:06] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_drmrs02_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs02_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:30:24] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_drmrs01_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_drmrs01_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:30:37] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Revert "Disable job submission to YARN queues to faciliatate maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/903565 (owner: 10Btullis)
[14:30:46] <Amir1>	 I did a reload of haproxy on dbproxy 10 18 and 1019
[14:30:53] <Amir1>	 let's see
[14:31:03] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] Revert "Depool eqiad frontends for network maintenance" [dns] - 10https://gerrit.wikimedia.org/r/903666 (https://phabricator.wikimedia.org/T330165) (owner: 10Ayounsi)
[14:31:52] <sukhe>	 !log run authdns-update to revert eqiad depool
[14:31:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:32:21] <wikibugs>	 (03PS1) 10Btullis: Revert "Disable the gobblin timers temporarily for switch maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/903668
[14:32:27] <logmsgbot>	 !log akosiaris@cumin1001 START - Cookbook sre.discovery.datacenter pool all active/active services in eqiad: eqiad row B switches upgrade done - T330165
[14:32:34] <stashbot>	 T330165: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165
[14:32:50] <wikibugs>	 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: eqiad row B switches upgrad...
[14:33:05] <Amir1>	 hm, 1014 and 1015 needs reload too
[14:33:22] <wikibugs>	 (03PS1) 10Ssingh: Revert "hiera: temporarily removed dns1003 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/903669
[14:34:09] <Lucas_WMDE>	 koi: what’s the maintenance script you needed to run for T332241 anyways? it’s not clear to me from the task
[14:34:10] <stashbot>	 T332241: fix Category namespace on gurwiki - https://phabricator.wikimedia.org/T332241
[14:34:22] <wikibugs>	 (03CR) 10Btullis: [C: 03+2] Revert "Disable the gobblin timers temporarily for switch maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/903668 (owner: 10Btullis)
[14:34:24] <wikibugs>	 (03PS1) 10Andrew Bogott: Revert "clouddumps: make clouddumps1002 the primary during switch maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/903670
[14:34:26] <wikibugs>	 (03PS1) 10Jforrester: Branch commit for wmf/1.41.0-wmf.2 [core] (wmf/1.41.0-wmf.2) - 10https://gerrit.wikimedia.org/r/903657 (https://phabricator.wikimedia.org/T330208)
[14:34:28] <koi>	 Lucas_WMDE, it's "mwscript maintenance/namespaceDupes.php --wiki gurwiki"
[14:34:34] <Lucas_WMDE>	 (it looks like the eqiad row B maintenance is still ongoing, but in principle I could run a maint script after that – I also have a change I’d be interesrted in backporting)
[14:34:47] <wikibugs>	 (03CR) 10Jforrester: [C: 03+2] Branch commit for wmf/1.41.0-wmf.2 [core] (wmf/1.41.0-wmf.2) - 10https://gerrit.wikimedia.org/r/903657 (https://phabricator.wikimedia.org/T330208) (owner: 10Jforrester)
[14:35:05] <wikibugs>	 (03CR) 10Andrew Bogott: [C: 03+2] Revert "clouddumps: make clouddumps1002 the primary during switch maintenance" [puppet] - 10https://gerrit.wikimedia.org/r/903670 (owner: 10Andrew Bogott)
[14:35:12] <wikibugs>	 (03Abandoned) 10Jforrester: Branch commit for wmf/1.41.0-wmf.2 [core] (wmf/1.41.0-wmf.2) - 10https://gerrit.wikimedia.org/r/902622 (https://phabricator.wikimedia.org/T330208) (owner: 10TrainBranchBot)
[14:35:16] <icinga-wm>	 RECOVERY - Check systemd state on puppetmaster2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:35:20] <Amir1>	 did we backport the fix Taavi did yesterday? I didn't see the backport
[14:35:24] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on dns1003 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[14:35:34] <Amir1>	 on namespaceDupes
[14:35:49] * Lucas_WMDE doesn’t know anything about that
[14:36:11] <Lucas_WMDE>	 (what I wanted to backport was the SpecialRecentChangesLinked query() fix)
[14:36:48] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10Papaul) @cmooney can we do this on Thursday ? Can we also do the other batches(3-4) on the same day?
[14:36:53] <taavi>	 yes I backported it
[14:37:01] <Lucas_WMDE>	 yeah I can see it on wmf.1
[14:37:06] <Lucas_WMDE>	 (and REL1_40 too)
[14:37:08] <Amir1>	 oh thanks
[14:37:34] <wikibugs>	 (03CR) 10Ssingh: [C: 03+2] Revert "hiera: temporarily removed dns1003 from authdns_servers" [puppet] - 10https://gerrit.wikimedia.org/r/903669 (owner: 10Ssingh)
[14:37:48] <icinga-wm>	 RECOVERY - Check unit status of netbox_ganeti_codfw_test_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_test_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state
[14:37:53] <wikibugs>	 (03PS2) 10Jbond: Revert "idp: failover to codfw for switch upgrade" [dns] - 10https://gerrit.wikimedia.org/r/903564
[14:38:07] <wikibugs>	 (03PS4) 10JMeybohm: k8s: Remove 1.16 related code (v2) [puppet] - 10https://gerrit.wikimedia.org/r/903560 (https://phabricator.wikimedia.org/T328291)
[14:38:08] <wikibugs>	 (03PS3) 10JMeybohm: k8s: Configure the IPv6 service ip range for apiserver [puppet] - 10https://gerrit.wikimedia.org/r/903655 (https://phabricator.wikimedia.org/T307943)
[14:38:29] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/weight=6; selector: service=thumbor,name=kubernetes201[0123].codfw.wmnet
[14:38:34] <icinga-wm>	 RECOVERY - Bird Internet Routing Daemon on durum1002 is OK: PROCS OK: 1 process with command name bird https://wikitech.wikimedia.org/wiki/Anycast%23Bird_daemon_not_running
[14:39:08] <icinga-wm>	 RECOVERY - BFD status on cr2-eqiad is OK: UP: 18 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[14:40:25] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] O:cluster::management: add ldap::bitu profile to cluster::management [puppet] - 10https://gerrit.wikimedia.org/r/903628 (owner: 10Jbond)
[14:40:32] <wikibugs>	 (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40377/console" [puppet] - 10https://gerrit.wikimedia.org/r/903560 (https://phabricator.wikimedia.org/T328291) (owner: 10JMeybohm)
[14:40:56] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: name=thumbor100[12].eqiad.wmnet
[14:41:14] <hnowlan>	 (^ restoring ineffective change from during the depool) 
[14:41:15] <wikibugs>	 (03PS1) 10Herron: alertmanager: manage data.retention option [puppet] - 10https://gerrit.wikimedia.org/r/903658
[14:41:39] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] alertmanager: manage data.retention option [puppet] - 10https://gerrit.wikimedia.org/r/903658 (owner: 10Herron)
[14:42:18] <icinga-wm>	 RECOVERY - configured eth on lvs1020 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth
[14:42:56] <wikibugs>	 (03PS2) 10Herron: alertmanager: manage data.retention option [puppet] - 10https://gerrit.wikimedia.org/r/903658
[14:43:12] <sukhe>	 jbond: hi! possible puppet failure: https://puppetboard.wikimedia.org/report/dns1001.wikimedia.org/810719d816acdcfa7d86149dfa2c240d195ab40a ?
[14:46:10] <wikibugs>	 (03PS1) 10Bking: rdf-streaming-updater: raise taskManager mem in dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/903659 (https://phabricator.wikimedia.org/T328675)
[14:46:37] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/weight=8; selector: service=thumbor,name=kubernetes201[0123].codfw.wmnet
[14:47:36] <icinga-wm>	 RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[14:48:01] <wikibugs>	 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ops-monitoring-bot) akosiaris@cumin1001 - Cookbook cookbooks.sre.discovery.datacenter pool all active/active services in eqiad: eqiad row B switches upgrad...
[14:48:31] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add a ConftoolClient class [software/conftool] - 10https://gerrit.wikimedia.org/r/903598 (owner: 10Giuseppe Lavagetto)
[14:48:57] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/weight=10; selector: service=thumbor,name=kubernetes201[0123].codfw.wmnet
[14:49:47] <wikibugs>	 (03CR) 10DCausse: [C: 03+1] rdf-streaming-updater: raise taskManager mem in dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/903659 (https://phabricator.wikimedia.org/T328675) (owner: 10Bking)
[14:49:57] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi)
[14:50:14] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10ayounsi)
[14:50:28] <wikibugs>	 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ayounsi)
[14:50:38] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi)
[14:50:48] <wikibugs>	 (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: raise taskManager mem in dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/903659 (https://phabricator.wikimedia.org/T328675) (owner: 10Bking)
[14:51:11] <wikibugs>	 (03Merged) 10jenkins-bot: Add a ConftoolClient class [software/conftool] - 10https://gerrit.wikimedia.org/r/903598 (owner: 10Giuseppe Lavagetto)
[14:51:20] <wikibugs>	 (03CR) 10Herron: "Not a ton of documentation I could find about extending silence history, but this looked promising" [puppet] - 10https://gerrit.wikimedia.org/r/903658 (owner: 10Herron)
[14:51:55] <logmsgbot>	 !log akosiaris@cumin1001 END (FAIL) - Cookbook sre.discovery.datacenter (exit_code=93) pool all active/active services in eqiad: eqiad row B switches upgrade done - T330165
[14:52:02] <stashbot>	 T330165: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165
[14:52:53] <wikibugs>	 (03Merged) 10jenkins-bot: Branch commit for wmf/1.41.0-wmf.2 [core] (wmf/1.41.0-wmf.2) - 10https://gerrit.wikimedia.org/r/903657 (https://phabricator.wikimedia.org/T330208) (owner: 10Jforrester)
[14:52:57] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=swift-ro,name=device-analytics
[14:53:26] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=device-analytics,name=eqiad
[14:53:37] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=device-analytics,name=pki
[14:53:53] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=pki,name=eqiad
[14:54:21] <akosiaris>	 interesting that SAL log show up despite the action being wrong
[14:54:23] <akosiaris>	 anyway
[14:54:52] <logmsgbot>	 !log akosiaris@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=restbase-async,name=codfw
[14:54:54] <wikibugs>	 (03Merged) 10jenkins-bot: rdf-streaming-updater: raise taskManager mem in dse-k8s [deployment-charts] - 10https://gerrit.wikimedia.org/r/903659 (https://phabricator.wikimedia.org/T328675) (owner: 10Bking)
[14:55:40] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[14:55:49] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[14:57:11] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes:weight=5; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet
[14:58:55] <bearloga>	 Kudos to everyone involved in the switches upgrade for minimal downtime of Phab, Gerrit, Hadoop cluster, etc. 👏
[14:59:29] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] deployment_server: ensure Docker is installed [puppet] - 10https://gerrit.wikimedia.org/r/903605 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche)
[14:59:57] <wikibugs>	 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10ayounsi) The switch upgrade itself went smoothly as well, like the other rows.  One issue was that gerrit1001 was missing from the list. This is because th...
[15:01:21] <wikibugs>	 (03CR) 10Jaime Nuche: "Thanks for creating the branch, I'll rerun the train presync." [core] (wmf/1.41.0-wmf.2) - 10https://gerrit.wikimedia.org/r/903657 (https://phabricator.wikimedia.org/T330208) (owner: 10Jforrester)
[15:03:43] <wikibugs>	 (03PS1) 10TrainBranchBot: testwikis wikis to 1.41.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903660 (https://phabricator.wikimedia.org/T330208)
[15:03:45] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.41.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903660 (https://phabricator.wikimedia.org/T330208) (owner: 10TrainBranchBot)
[15:05:11] <wikibugs>	 (03Merged) 10jenkins-bot: testwikis wikis to 1.41.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903660 (https://phabricator.wikimedia.org/T330208) (owner: 10TrainBranchBot)
[15:05:33] <logmsgbot>	 !log jnuche@deploy2002 Started scap: testwikis wikis to 1.41.0-wmf.2  refs T330208
[15:05:37] <logmsgbot>	 !log jnuche@deploy2002 scap failed: CalledProcessError Command '/usr/local/bin/mwscript rebuildLocalisationCache.php --wiki=aawiki --force-version "1.41.0-wmf.2" --no-progress --store-class=LCStoreCDB --threads=30 --lang en  --quiet ' returned non-zero exit status 1. (duration: 00m 03s)
[15:05:39] <stashbot>	 T330208: 1.41.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T330208
[15:07:55] <logmsgbot>	 !log stevemunene@cumin1001 END (PASS) - Cookbook sre.ganeti.reimage (exit_code=0) for host an-test-client1002.eqiad.wmnet with OS bullseye
[15:08:38] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/weight=8; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet
[15:13:47] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T333157 (10Ladsgroup) a:03Ladsgroup Thanks. I'm clinic duty this week.
[15:14:23] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/weight=10; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet
[15:14:36] <icinga-wm>	 RECOVERY - Widespread puppet agent failures on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005868 https://puppetboard.wikimedia.org/nodes?status=failed https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[15:15:16] <logmsgbot>	 !log aokoth@cumin1001 START - Cookbook sre.hosts.downtime for 14 days, 0:00:00 on vrts2001.codfw.wmnet with reason: installation failed due to read-only database
[15:15:19] <logmsgbot>	 !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 14 days, 0:00:00 on vrts2001.codfw.wmnet with reason: installation failed due to read-only database
[15:17:33] <wikibugs>	 (03PS1) 10Jbond: P:ldap::bitu: make the group configurable [puppet] - 10https://gerrit.wikimedia.org/r/903665
[15:18:19] <wikibugs>	 (03PS1) 10Cwhite: logstash: remove envoy deprecated options spamfilter [puppet] - 10https://gerrit.wikimedia.org/r/902625 (https://phabricator.wikimedia.org/T320468)
[15:18:41] <wikibugs>	 (03PS1) 10Ayounsi: Add role_contacts to buster hosts [puppet] - 10https://gerrit.wikimedia.org/r/903686
[15:19:19] <wikibugs>	 (03CR) 10Ayounsi: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/903686 (owner: 10Ayounsi)
[15:19:29] <wikibugs>	 (03CR) 10Jbond: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40378/console" [puppet] - 10https://gerrit.wikimedia.org/r/903665 (owner: 10Jbond)
[15:19:42] <icinga-wm>	 RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.004401 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet
[15:19:58] <wikibugs>	 (03CR) 10Jbond: [C: 03+2] P:ldap::bitu: make the group configurable [puppet] - 10https://gerrit.wikimedia.org/r/903665 (owner: 10Jbond)
[15:20:08] <Lucas_WMDE>	 jouncebot: nowandnext
[15:20:08] <jouncebot>	 No deployments scheduled for the next 0 hour(s) and 39 minute(s)
[15:20:09] <jouncebot>	 In 0 hour(s) and 39 minute(s): Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T1600)
[15:20:11] <wikibugs>	 (03PS1) 10Filippo Giunchedi: sre: introduce cluster vs site wide puppet failures [alerts] - 10https://gerrit.wikimedia.org/r/903687 (https://phabricator.wikimedia.org/T294564)
[15:20:20] <logmsgbot>	 !log jnuche@deploy2002 Started scap: testwikis wikis to 1.41.0-wmf.2  refs T330208
[15:20:25] <stashbot>	 T330208: 1.41.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T330208
[15:20:29] <Lucas_WMDE>	 any objections to me running an mw backport and a maintenance script?
[15:20:38] <Lucas_WMDE>	 hm, well, maybe not while jnuche  is scapping wmf.2 w^
[15:20:40] <Lucas_WMDE>	 *^^
[15:21:32] <jnuche>	 Lucas_WMDE: yeah, the presync failed last night so I'm rerunning manually
[15:21:53] <jnuche>	 it can take a bit, sorry for the inconvenience
[15:21:53] <Lucas_WMDE>	 ok
[15:22:07] <Lucas_WMDE>	 no big deal, don’t think either of the things I wanted to do is urgent
[15:22:50] <wikibugs>	 (03CR) 10Dzahn: releases-jenkins: replace Icinga with Prometheus monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902788 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn)
[15:22:57] <Lucas_WMDE>	 but it sounds like there might be no time before the puppet window, and I won’t be around after that – koi, if you’re still online for the late backport window, perhaps add your maintenance script run there
[15:23:19] <Lucas_WMDE>	 (I’ll only be around again for tomorrow’s UTC afternoon window, I think)
[15:23:35] <wikibugs>	 (03CR) 10Ayounsi: "https://puppet-compiler.wmflabs.org/output/903686/40379/" [puppet] - 10https://gerrit.wikimedia.org/r/903686 (owner: 10Ayounsi)
[15:24:08] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T333157 (10Ladsgroup)
[15:24:18] <wikibugs>	 (03CR) 10Jelto: [C: 03+1] "lgtm, however I'm not sure what happens if we run multiple aphlict instances in eqiad at once. Do you have a plan for that? Will aphlict10" [puppet] - 10https://gerrit.wikimedia.org/r/903641 (https://phabricator.wikimedia.org/T322369) (owner: 10EoghanGaffney)
[15:25:11] <wikibugs>	 (03CR) 10Dzahn: "I really don't think after Andrea did all the work to create doc machines that we should introduce further complication to _avoid_ switchi" [puppet] - 10https://gerrit.wikimedia.org/r/901612 (https://phabricator.wikimedia.org/T322357) (owner: 10Hashar)
[15:25:18] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm thanks" [alerts] - 10https://gerrit.wikimedia.org/r/903687 (https://phabricator.wikimedia.org/T294564) (owner: 10Filippo Giunchedi)
[15:27:29] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] zuul: fix up service enable and ensure (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/901576 (https://phabricator.wikimedia.org/T324659) (owner: 10Hashar)
[15:27:50] <wikibugs>	 (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/903686 (owner: 10Ayounsi)
[15:29:02] <icinga-wm>	 PROBLEM - Check systemd state on db1101 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_prometheus-mysqld-exporter@s7.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:32:22] <wikibugs>	 (03PS5) 10Volans: run_cookbook: fix/improve calls to other cookbooks [cookbooks] - 10https://gerrit.wikimedia.org/r/902449
[15:35:58] <wikibugs>	 (03CR) 10Dzahn: "I can't ssh to deploy-1002.devtools right now for some reason, will try again later." [puppet] - 10https://gerrit.wikimedia.org/r/903605 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche)
[15:36:26] <wikibugs>	 (03CR) 10Raymond Ndibe: maintain-dbusers: only-users match tool users with or without prefix (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902817 (https://phabricator.wikimedia.org/T332789) (owner: 10David Caro)
[15:37:38] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: service=thumbor,name=thumbor200*.codfw.wmnet
[15:38:03] <wikibugs>	 (03CR) 10Herron: "adding volans for awareness and in case there are references to alert1001 outside puppet to account for" [puppet] - 10https://gerrit.wikimedia.org/r/899629 (https://phabricator.wikimedia.org/T331882) (owner: 10Herron)
[15:38:16] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=no; selector: service=thumbor,name=thumbor200[3456].codfw.wmnet
[15:40:42] <wikibugs>	 (03PS1) 10DCausse: rdf-streaming-updater: still use PLAINTEXT for kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/903689 (https://phabricator.wikimedia.org/T328675)
[15:43:32] <wikibugs>	 (03CR) 10Bking: [C: 03+2] rdf-streaming-updater: still use PLAINTEXT for kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/903689 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse)
[15:46:46] <icinga-wm>	 PROBLEM - Host cp1082 is DOWN: PING CRITICAL - Packet loss = 100%
[15:47:22] <icinga-wm>	 PROBLEM - Check systemd state on arclamp1001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_apache2-htcacheclean.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:48:10] <wikibugs>	 (03CR) 10Ahmon Dancy: [C: 03+1] deployment_server: ensure Docker is installed (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/903605 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche)
[15:48:16] <icinga-wm>	 PROBLEM - Check systemd state on phab1004 is CRITICAL: CRITICAL - degraded: The following units failed: logrotate.timer https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[15:48:19] <logmsgbot>	 !log btullis@deploy2002 Started deploy [analytics/refinery@6554ec0]: Regular analytics weekly train [analytics/refinery@6554ec0]
[15:48:28] <wikibugs>	 (03Merged) 10jenkins-bot: rdf-streaming-updater: still use PLAINTEXT for kafka [deployment-charts] - 10https://gerrit.wikimedia.org/r/903689 (https://phabricator.wikimedia.org/T328675) (owner: 10DCausse)
[15:49:42] <jinxer-wm>	 (CirrusSearchNodeIndexingNotIncreasing) firing: (10) Elasticsearch instance elastic1055-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[15:50:17] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=yes; selector: service=thumbor,name=thumbor200[3456].codfw.wmnet
[15:50:26] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[15:50:35] <logmsgbot>	 !log bking@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[15:51:08] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10ayounsi)
[15:53:44] <logmsgbot>	 !log btullis@deploy2002 Finished deploy [analytics/refinery@6554ec0]: Regular analytics weekly train [analytics/refinery@6554ec0] (duration: 05m 24s)
[15:54:42] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=inactive; selector: service=thumbor,name=kubernetes101[0123].eqiad.wmnet
[15:54:53] <logmsgbot>	 !log hnowlan@puppetmaster1001 conftool action : set/pooled=inactive; selector: service=thumbor,name=kubernetes201[0123].codfw.wmnet
[15:55:22] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi)
[15:55:38] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops: eqiad/codfw virtual-chassis upgrades - https://phabricator.wikimedia.org/T327248 (10ayounsi)
[15:55:42] <logmsgbot>	 !log btullis@deploy2002 Started deploy [analytics/refinery@6554ec0] (thin): Regular analytics weekly train THIN [analytics/refinery@6554ec0]
[15:55:48] <wikibugs>	 (03CR) 10Volans: "Thanks for the heads up." [puppet] - 10https://gerrit.wikimedia.org/r/899629 (https://phabricator.wikimedia.org/T331882) (owner: 10Herron)
[15:55:51] <logmsgbot>	 !log btullis@deploy2002 Finished deploy [analytics/refinery@6554ec0] (thin): Regular analytics weekly train THIN [analytics/refinery@6554ec0] (duration: 00m 08s)
[15:55:59] <logmsgbot>	 !log btullis@deploy2002 Started deploy [analytics/refinery@6554ec0] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@6554ec0]
[15:56:45] <wikibugs>	 (03PS1) 10Ladsgroup: admin: Add Oleksandr Tsyba to ldap [puppet] - 10https://gerrit.wikimedia.org/r/903691 (https://phabricator.wikimedia.org/T333157)
[15:57:31] <logmsgbot>	 !log btullis@deploy2002 Finished deploy [analytics/refinery@6554ec0] (hadoop-test): Regular analytics weekly train TEST [analytics/refinery@6554ec0] (duration: 01m 32s)
[15:57:50] <wikibugs>	 (03PS1) 10Jbond: sre.admin.offboard: (WIP) cookbook to offboard users [cookbooks] - 10https://gerrit.wikimedia.org/r/903692
[15:58:34] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10Patch-For-Review: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T333157 (10Ladsgroup) Hi, I guessed your email address via the WMDE's email pattern, can you please confirm this? https://gerrit.wikimedia.org/r/c/operations...
[15:59:57] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.admin.offboard: (WIP) cookbook to offboard users [cookbooks] - 10https://gerrit.wikimedia.org/r/903692 (owner: 10Jbond)
[16:00:05] <jouncebot>	 jbond and rzl: (Dis)respected human, time to deploy Puppet request window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T1600). Please do the needful.
[16:00:05] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[16:00:32] <inflatador>	 !log bking@cumin1001 unban elastic and cloudelastic nodes post maintenance T330165
[16:00:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:00:42] <stashbot>	 T330165: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165
[16:02:03] <jinxer-wm>	 (ProbeDown) firing: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:02:24] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Observability-Alerting, 10netops: Bonded interface setup for alert hosts - https://phabricator.wikimedia.org/T333371 (10ayounsi) See guidelines on https://wikitech.wikimedia.org/wiki/Wikimedia_network_guidelines#Servers_uplinks but it's usually not worth it.  We only...
[16:03:30] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Data-Persistence, 10Thumbor Migration, 10Platform Team Workboards (Platform Engineering Reliability): Pooling thumbor-k8s causes spikes in swift 500 errors - https://phabricator.wikimedia.org/T328033 (10hnowlan) After significantly increasing capacity in thumbor-k8s, we serv...
[16:03:37] <wikibugs>	 10SRE, 10SRE-swift-storage, 10Data-Persistence, 10Thumbor Migration, 10Platform Team Workboards (Platform Engineering Reliability): Pooling thumbor-k8s causes spikes in swift 500 errors - https://phabricator.wikimedia.org/T328033 (10hnowlan) 05Open→03Resolved
[16:03:43] <wikibugs>	 (03PS1) 10Jelto: aphlict: pass ensure flags to logrotate timer [puppet] - 10https://gerrit.wikimedia.org/r/903693 (https://phabricator.wikimedia.org/T332869)
[16:03:55] <wikibugs>	 (03PS2) 10Volans: sre.ganeti.makevm: refactor to simplify expansion [cookbooks] - 10https://gerrit.wikimedia.org/r/860080 (https://phabricator.wikimedia.org/T306661)
[16:03:57] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp1082.eqiad.wmnet,service=cdn
[16:03:58] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp1082.eqiad.wmnet,service=ats-be
[16:04:13] <wikibugs>	 (03CR) 10EoghanGaffney: [V: 03+1] Add aphlict role to new vm host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/903641 (https://phabricator.wikimedia.org/T322369) (owner: 10EoghanGaffney)
[16:04:46] <wikibugs>	 (03PS9) 10David Caro: maintain-dbusers: add prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955)
[16:04:56] <wikibugs>	 (03CR) 10Herron: alerting_host: failover icinga and alertmanger from eqiad to codfw (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/899629 (https://phabricator.wikimedia.org/T331882) (owner: 10Herron)
[16:05:19] <wikibugs>	 (03PS3) 10Volans: sre.ganeti.makevm: refactor to simplify expansion [cookbooks] - 10https://gerrit.wikimedia.org/r/860080 (https://phabricator.wikimedia.org/T306661)
[16:05:23] <wikibugs>	 (03CR) 10Volans: "addressed comments" [cookbooks] - 10https://gerrit.wikimedia.org/r/860080 (https://phabricator.wikimedia.org/T306661) (owner: 10Volans)
[16:07:03] <jinxer-wm>	 (ProbeDown) resolved: Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown
[16:07:19] <wikibugs>	 (03PS9) 10Ayounsi: Add validator classes for some objects [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/889958 (https://phabricator.wikimedia.org/T310590)
[16:09:15] <bblack>	 !log reboot cp1082 (NIC issues)
[16:09:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:10:12] <logmsgbot>	 !log jnuche@deploy2002 Finished scap: testwikis wikis to 1.41.0-wmf.2  refs T330208 (duration: 49m 52s)
[16:10:18] <stashbot>	 T330208: 1.41.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T330208
[16:14:46] <icinga-wm>	 RECOVERY - Host cp1082 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms
[16:18:11] <wikibugs>	 (03PS1) 10Volans: CHANGELOG: add changelogs for release v6.4.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/903697
[16:19:41] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10ayounsi)
[16:19:42] <jinxer-wm>	 (CirrusSearchNodeIndexingNotIncreasing) resolved: (10) Elasticsearch instance elastic1055-production-search-eqiad is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing
[16:20:14] <wikibugs>	 (03PS10) 10Ayounsi: Add validator classes for some objects [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/889958 (https://phabricator.wikimedia.org/T310590)
[16:22:08] <wikibugs>	 (03CR) 10Volans: [C: 03+2] CHANGELOG: add changelogs for release v6.4.0 [software/spicerack] - 10https://gerrit.wikimedia.org/r/903697 (owner: 10Volans)
[16:22:26] <jinxer-wm>	 (WidespreadPuppetFailure) resolved: Puppet has failed on cache_upload cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=cache_upload - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[16:27:41] <jinxer-wm>	 (WidespreadPuppetFailure) firing: (2) Puppet has failed on cache_upload cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=cache_upload - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[16:31:54] <wikibugs>	 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, and 4 others: Some files cannot be deleted "Error deleting file: An unknown error occurred in storage backend "local-multiwrite". " - https://phabricator.wikimedia.org/T244567 (10Krd) https://commons.wikimedia.org/wiki/...
[16:32:59] <wikibugs>	 10SRE-swift-storage, 10Commons, 10MediaWiki-File-management, 10MediaWiki-Page-deletion, and 4 others: Some files cannot be deleted "Error deleting file: An unknown error occurred in storage backend "local-multiwrite". " - https://phabricator.wikimedia.org/T244567 (10Krd) https://commons.wikimedia.org/wiki/...
[16:34:56] <icinga-wm>	 PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:36:16] <icinga-wm>	 PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:36:37] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10MatthewVernon)
[16:36:59] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10MatthewVernon)
[16:43:46] <icinga-wm>	 RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.259 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:44:20] <icinga-wm>	 RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49708 bytes in 0.066 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring
[16:47:26] <jinxer-wm>	 (WidespreadPuppetFailure) firing: Puppet has failed on lvs cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=lvs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[16:52:16] <volans>	 !log uploaded spicerack_6.4.0 to apt.wikimedia.org bullseye-wikimedia (but I'll deploy it to the cumin hosts tomorrow)
[16:52:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:55:17] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1082.eqiad.wmnet,service=cdn
[16:55:17] <logmsgbot>	 !log sukhe@puppetmaster1001 conftool action : set/pooled=yes; selector: name=cp1082.eqiad.wmnet,service=ats-be
[17:00:05] <jouncebot>	 Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T1700)
[17:02:21] <logmsgbot>	 !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[17:02:41] <jinxer-wm>	 (WidespreadPuppetFailure) resolved: Puppet has failed on cache_upload cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=cache_upload - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[17:02:42] <logmsgbot>	 !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[17:05:14] <icinga-wm>	 PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:07:08] <icinga-wm>	 RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[17:10:37] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Observability-Alerting, 10netops: Bonded interface setup for alert hosts - https://phabricator.wikimedia.org/T333371 (10herron) 05Open→03Declined Thanks, fwiw I added a talk topic on wiki in hopes that link redundancy can be explored the next time switch upgrades/...
[17:12:26] <jinxer-wm>	 (WidespreadPuppetFailure) resolved: Puppet has failed on lvs cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=lvs - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[17:16:44] <logmsgbot>	 !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[17:16:54] <logmsgbot>	 !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[17:17:09] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983 (10Papaul) We will have to first upgrade the firmware on this server . Most of the time the firmware upgrade might help on  1 - resolving this issue  2- providing also in the idrac l...
[17:17:26] <jinxer-wm>	 (WidespreadPuppetFailure) firing: Puppet has failed on prometheus cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=prometheus - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[17:19:27] <wikibugs>	 10SRE, 10ops-eqiad, 10DC-Ops, 10serviceops-collab: Q3:rack/setup/install gerrit1003 - https://phabricator.wikimedia.org/T326366 (10Papaul) a:05Cmjohnson→03Papaul
[17:19:29] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10Papaul) a:05Cmjohnson→03Papaul
[17:20:54] <icinga-wm>	 PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The following units failed: produce_canary_events.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:30:20] <icinga-wm>	 RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[17:52:26] <jinxer-wm>	 (WidespreadPuppetFailure) resolved: Puppet has failed on prometheus cluster no resources reported - https://puppetboard.wikimedia.org/nodes?status=failed - https://grafana.wikimedia.org/d/yOxVDGvWk/puppet?orgId=1&viewPanel=3&var-cluster=prometheus - https://alerts.wikimedia.org/?q=alertname%3DWidespreadPuppetFailure
[17:57:48] <logmsgbot>	 !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] START helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[17:57:55] <logmsgbot>	 !log dcausse@deploy2002 helmfile [dse-k8s-eqiad] DONE helmfile.d/dse-k8s-services/rdf-streaming-updater: apply
[18:00:05] <jouncebot>	 dduvall and dancy: Dear deployers, time to do the MediaWiki train - Utc-7 Version deploy. Don't look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T1800).
[18:00:24] <dancy>	 o/
[18:06:52] <icinga-wm>	 PROBLEM - Postgres Replication Lag on puppetdb2002 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB puppetdb (host:localhost) 186846552 and 3 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[18:08:36] <wikibugs>	 10SRE, 10SRE-Access-Requests: Grant Access to analytics_privatedata_users for FNavas-foundation - https://phabricator.wikimedia.org/T331482 (10FNavas-foundation) hi @Ottomata - yes they are two supersets i need to get into [[ https://superset.wikimedia.org/superset/dashboard/riskobservatory |1 ]] & [[ https://...
[18:08:48] <icinga-wm>	 RECOVERY - Postgres Replication Lag on puppetdb2002 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB puppetdb (host:localhost) 797408 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring
[18:21:59] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[18:23:54] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for new db nodes - pt1979@cumin2002"
[18:25:06] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for new db nodes - pt1979@cumin2002"
[18:25:06] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[18:28:03] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1207.mgmt.eqiad.wmnet with reboot policy FORCED
[18:28:36] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1208.mgmt.eqiad.wmnet with reboot policy FORCED
[18:32:04] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1207.mgmt.eqiad.wmnet with reboot policy FORCED
[18:32:08] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1208.mgmt.eqiad.wmnet with reboot policy FORCED
[18:33:55] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1207.mgmt.eqiad.wmnet with reboot policy FORCED
[18:36:47] <logmsgbot>	 !log xcollazo@deploy2002 Started deploy [airflow-dags/platform_eng@0f1c9e8]: Deploy latest image_suggestions on platform_eng Airflow instance
[18:37:07] <logmsgbot>	 !log xcollazo@deploy2002 Finished deploy [airflow-dags/platform_eng@0f1c9e8]: Deploy latest image_suggestions on platform_eng Airflow instance (duration: 00m 20s)
[18:37:09] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1207.mgmt.eqiad.wmnet with reboot policy FORCED
[18:38:54] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Observability-Alerting, 10netops: Bonded interface setup for alert hosts - https://phabricator.wikimedia.org/T333371 (10cmooney) Yeah I tend to agree, with one top-of-rack switch two connections only protects against link failure (as they both land on the same switch)...
[18:39:17] <wikibugs>	 (03CR) 10Raymond Ndibe: maintain-dbusers: refactor (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902816 (https://phabricator.wikimedia.org/T303663) (owner: 10David Caro)
[18:40:57] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1207.mgmt.eqiad.wmnet with reboot policy FORCED
[18:41:32] <wikibugs>	 (03PS5) 10Ryan Kemper: [WIP] wdqs: test new metric option [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/900430 (https://phabricator.wikimedia.org/T328306)
[18:42:18] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1207.mgmt.eqiad.wmnet with reboot policy FORCED
[18:43:10] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1207.mgmt.eqiad.wmnet with reboot policy FORCED
[18:45:34] <wikibugs>	 10SRE, 10Domains, 10Traffic: Register wiki(m|p)edia.ro - https://phabricator.wikimedia.org/T222080 (10BCornwall)
[18:46:21] <wikibugs>	 10SRE, 10Domains, 10Traffic: Register wiki(m|p)edia.ro - https://phabricator.wikimedia.org/T222080 (10BCornwall) 05Open→03Stalled p:05Medium→03Low
[18:57:51] <wikibugs>	 (03PS3) 10Jbond: sre.admin.offboard: (WIP) cookbook to offboard users [cookbooks] - 10https://gerrit.wikimedia.org/r/903692
[18:59:57] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.admin.offboard: (WIP) cookbook to offboard users [cookbooks] - 10https://gerrit.wikimedia.org/r/903692 (owner: 10Jbond)
[19:13:49] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Observability-Alerting, 10netops: Bonded interface setup for alert hosts - https://phabricator.wikimedia.org/T333371 (10herron) >>! In T333371#8736041, @cmooney wrote: > In the case of a server failure do the alert hosts fail over?    Not automatically at the present...
[19:15:26] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+1] "LGTM!  All the ops I was trying on netbox-next are working with the latest patchset." [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/889958 (https://phabricator.wikimedia.org/T310590) (owner: 10Ayounsi)
[19:16:36] <wikibugs>	 10SRE, 10LDAP-Access-Requests, 10WMF-NDA-Requests, 10Patch-For-Review: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T333157 (10WMDE-leszek) It is correct email address
[19:19:27] <logmsgbot>	 !log dduvall@deploy2002 Started scap: testwikis wikis to 1.41.0-wmf.2  refs T330208
[19:19:34] <stashbot>	 T330208: 1.41.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T330208
[19:23:16] <wikibugs>	 (03CR) 10Cathal Mooney: "In genrnal the approach here looks ok to me.  I'm not overly familiar with the existing puppet profile for the Bird config, but as it's ba" [puppet] - 10https://gerrit.wikimedia.org/r/868731 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez)
[19:24:40] <wikibugs>	 (03CR) 10Cathal Mooney: "I'll leave it to Arzhel to +1 as he's the most knowledgeable on the Bird Anycast vars.  But for my part happy for this to be merged and pr" [puppet] - 10https://gerrit.wikimedia.org/r/868731 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez)
[19:26:51] <logmsgbot>	 !log dduvall@deploy2002 Finished scap: testwikis wikis to 1.41.0-wmf.2  refs T330208 (duration: 07m 24s)
[19:26:57] <stashbot>	 T330208: 1.41.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T330208
[19:29:16] <logmsgbot>	 !log dduvall@deploy2002 Pruned MediaWiki: 1.40.0-wmf.27 (duration: 02m 11s)
[19:29:23] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10netops, 10cloud-services-team (FY2022/2023-Q3): Configure cloudsw1-b1-codfw and migrate cloud hosts in codfw B1 to it - https://phabricator.wikimedia.org/T327919 (10cmooney) >>! In T327919#8735024, @Papaul wrote: > @cmooney can we do this on Thursday ? Can we also do...
[19:39:06] <wikibugs>	 (03PS4) 10Jbond: sre.admin.offboard: (WIP) cookbook to offboard users [cookbooks] - 10https://gerrit.wikimedia.org/r/903692
[19:41:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] sre.admin.offboard: (WIP) cookbook to offboard users [cookbooks] - 10https://gerrit.wikimedia.org/r/903692 (owner: 10Jbond)
[19:44:18] <hashar>	 jouncebot: now
[19:44:19] <jouncebot>	 For the next 0 hour(s) and 15 minute(s): MediaWiki train - Utc-7 Version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T1800)
[19:49:49] <wikibugs>	 10SRE, 10DBA, 10Data-Engineering, 10Infrastructure-Foundations, and 9 others: eqiad row C switches upgrade - https://phabricator.wikimedia.org/T331882 (10Eevans)
[19:50:49] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Only run edit check on main namespace [extensions/VisualEditor] (wmf/1.41.0-wmf.2) - 10https://gerrit.wikimedia.org/r/903684
[19:51:39] <wikibugs>	 10SRE, 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, and 7 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10Eevans)
[19:52:12] <wikibugs>	 10SRE, 10DNS, 10Traffic-Icebox, 10Wikimedia-Language-setup, 10Patch-For-Review: nan and minnan subdomain redirects are a mess - https://phabricator.wikimedia.org/T86915 (10BCornwall)
[19:52:48] <wikibugs>	 10SRE, 10DNS, 10Traffic-Icebox, 10Wikimedia-Language-setup, 10Patch-For-Review: Chinese subdomain redirect improvements - https://phabricator.wikimedia.org/T86915 (10BCornwall)
[19:54:21] <wikibugs>	 10SRE, 10DNS, 10Traffic, 10Wikimedia-Language-setup, 10Patch-For-Review: Chinese subdomain redirect improvements - https://phabricator.wikimedia.org/T86915 (10BCornwall)
[19:54:42] <wikibugs>	 10SRE, 10DNS, 10Traffic, 10Wikimedia-Language-setup, 10Patch-For-Review: Chinese subdomain redirect improvements - https://phabricator.wikimedia.org/T86915 (10BCornwall) I've updated the description to accurately reflect the current issues. Note that per T230382 there are no longer minnan/zh-cfr aliases.
[19:56:50] <wikibugs>	 (03PS5) 10BCornwall: Add redirects for https://nan.wik{tionary,iquote,ibooks,isource}.org [puppet] - 10https://gerrit.wikimedia.org/r/529418 (https://phabricator.wikimedia.org/T86915) (owner: 10Fomafix)
[19:58:42] <wikibugs>	 (03CR) 10Volans: setup.py: update dnspython requierments to match spicerack (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/903734 (owner: 10Jbond)
[19:59:14] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Enable hidden tag for "Edit Check" project on Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903759 (https://phabricator.wikimedia.org/T324733)
[19:59:26] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1] "PCC SUCCESS (NOOP 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40383/console" [puppet] - 10https://gerrit.wikimedia.org/r/529418 (https://phabricator.wikimedia.org/T86915) (owner: 10Fomafix)
[19:59:32] <MatmaRex>	 jouncebot: next
[19:59:32] <jouncebot>	 In 0 hour(s) and 0 minute(s): UTC late backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T2000)
[19:59:39] <MatmaRex>	 i have some patches, one sec :)
[20:00:04] <jouncebot>	 RoanKattouw, Urbanecm, cjming, TheresNoTime, kindrobot, and taavi: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T2000).
[20:00:04] <jouncebot>	 No Gerrit patches in the queue for this window AFAICS.
[20:01:08] <TheresNoTime>	 o/
[20:01:08] <MatmaRex>	 updated: https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230328T2000
[20:01:12] <urbanecm>	 MatmaRex: if you've patches, i can deploy for you tonight :)
[20:01:44] <TheresNoTime>	 (go ahead)
[20:01:49] <MatmaRex>	 thanks
[20:01:53] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Only run edit check on main namespace [extensions/VisualEditor] (wmf/1.41.0-wmf.2) - 10https://gerrit.wikimedia.org/r/903684 (owner: 10Bartosz Dziewoński)
[20:01:59] <wikibugs>	 (03CR) 10Urbanecm: [C: 03+2] Change name of the editcheck-needreference tag to editcheck-references [extensions/VisualEditor] (wmf/1.41.0-wmf.2) - 10https://gerrit.wikimedia.org/r/903685 (owner: 10Bartosz Dziewoński)
[20:02:22] <urbanecm>	 MatmaRex: the config patch seems to depend on the backport(s). is that right?
[20:02:38] <MatmaRex>	 yes. i can't really test, since the feature isn't deployed anywhere yet
[20:02:46] <urbanecm>	 okay
[20:02:47] <MatmaRex>	 but we wanted it to roll out with the train this week
[20:03:04] <urbanecm>	 so then it'd be at testwiki by now (that has wmf.2 now)?
[20:03:15] <MatmaRex>	 yes
[20:03:21] <urbanecm>	 ok
[20:03:29] <MatmaRex>	 i guess if both the backport and the config are deployed, i could test it there
[20:03:33] <wikibugs>	 (03CR) 10Cathal Mooney: "Ought to work well.  In terms of naming I think we should make it clear that 185.15.57.24/29 is for public vips.  We can assign private VI" [puppet] - 10https://gerrit.wikimedia.org/r/903622 (https://phabricator.wikimedia.org/T324992) (owner: 10Arturo Borrero Gonzalez)
[20:04:29] <urbanecm>	 okay, i can do both at once, no problem
[20:07:50] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[20:08:42] <wikibugs>	 10SRE, 10DNS, 10Traffic, 10Wikimedia-Language-setup, 10Patch-For-Review: zh-min-nan.wikinews.org redirects to unprefixed incubator - https://phabricator.wikimedia.org/T86915 (10BCornwall)
[20:09:03] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:09:08] <wikibugs>	 10SRE, 10DNS, 10Traffic, 10Wikimedia-Language-setup, 10Patch-For-Review: zh-min-nan.wikinews.org redirects to unprefixed incubator - https://phabricator.wikimedia.org/T86915 (10BCornwall) Further trimmed some stuff as T173966 is tracking the redirects.
[20:16:22] <logmsgbot>	 !log cmooney@cumin1001 START - Cookbook sre.dns.netbox
[20:17:32] <logmsgbot>	 !log cmooney@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[20:18:36] <wikibugs>	 (03Merged) 10jenkins-bot: Only run edit check on main namespace [extensions/VisualEditor] (wmf/1.41.0-wmf.2) - 10https://gerrit.wikimedia.org/r/903684 (owner: 10Bartosz Dziewoński)
[20:18:42] <wikibugs>	 (03Merged) 10jenkins-bot: Change name of the editcheck-needreference tag to editcheck-references [extensions/VisualEditor] (wmf/1.41.0-wmf.2) - 10https://gerrit.wikimedia.org/r/903685 (owner: 10Bartosz Dziewoński)
[20:18:46] <wikibugs>	 (03PS6) 10BCornwall: Add nan to zh-min-nan redirects [puppet] - 10https://gerrit.wikimedia.org/r/529418 (https://phabricator.wikimedia.org/T173966) (owner: 10Fomafix)
[20:19:52] <wikibugs>	 (03PS7) 10BCornwall: Add nan to zh-min-nan redirects [puppet] - 10https://gerrit.wikimedia.org/r/529418 (https://phabricator.wikimedia.org/T173966) (owner: 10Fomafix)
[20:21:58] <icinga-wm>	 RECOVERY - Router interfaces on cr1-drmrs is OK: OK: host 185.15.58.128, interfaces up: 58, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[20:22:50] <icinga-wm>	 RECOVERY - BFD status on cr1-eqiad is OK: UP: 24 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status
[20:23:10] <icinga-wm>	 RECOVERY - OSPF status on cr1-eqiad is OK: OSPFv2: 7/7 UP : OSPFv3: 7/7 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status
[20:23:28] <wikibugs>	 10SRE, 10ops-eqiad, 10SRE Observability (FY2022/2023-Q3): Decommission centrallog1001 - https://phabricator.wikimedia.org/T328803 (10lmata) @cmooney thank you!
[20:24:59] <wikibugs>	 (03CR) 10BCornwall: [V: 03+1 C: 03+2] "Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/529418 (https://phabricator.wikimedia.org/T173966) (owner: 10Fomafix)
[20:27:08] <logmsgbot>	 !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@e6febfd]: increase dynamic partition limit when importing cirrus indexes
[20:27:22] <logmsgbot>	 !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@e6febfd]: increase dynamic partition limit when importing cirrus indexes (duration: 00m 13s)
[20:31:17] <MatmaRex>	 urbanecm: the backports merged btw
[20:31:35] <urbanecm>	 MatmaRex: thanks for the ping & apologies, i somewhat totally missed that.
[20:31:43] <wikibugs>	 10SRE, 10DNS, 10Traffic-Icebox, 10Wikimedia-Apache-configuration: Like nan.wikipedia.org, redirect other nan.*.org to the proper zh-min-nan.*.org domains - https://phabricator.wikimedia.org/T173966 (10BCornwall) 05Open→03Resolved a:03BCornwall Thank you for the patch and for your patience, @Fomafix!...
[20:32:05] <MatmaRex>	 :D easy thing to do when it takes half an hour
[20:32:19] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by urbanecm@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903759 (https://phabricator.wikimedia.org/T324733) (owner: 10Bartosz Dziewoński)
[20:32:19] <MatmaRex>	 i only checked just now myself
[20:32:35] <urbanecm>	 yup yup. scap'll ping once config+backports are at mwdebug.
[20:32:54] <wikibugs>	 10SRE-Sprint-Week-Sustainability-March2023, 10Data-Persistence, 10SRE Observability, 10observability, and 2 others: Database alerting - https://phabricator.wikimedia.org/T172492 (10lmata)
[20:33:13] <wikibugs>	 (03Merged) 10jenkins-bot: Enable hidden tag for "Edit Check" project on Wikipedias [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903759 (https://phabricator.wikimedia.org/T324733) (owner: 10Bartosz Dziewoński)
[20:33:31] <wikibugs>	 10SRE-Sprint-Week-Sustainability-March2023, 10Data-Persistence, 10Observability-Alerting, 10observability, and 2 others: Database alerting - https://phabricator.wikimedia.org/T172492 (10lmata)
[20:34:22] <logmsgbot>	 !log urbanecm@deploy2002 Started scap: Backport for [[gerrit:903684|Only run edit check on main namespace]], [[gerrit:903685|Change name of the editcheck-needreference tag to editcheck-references]], [[gerrit:903759|Enable hidden tag for "Edit Check" project on Wikipedias (T324733)]]
[20:34:28] <stashbot>	 T324733: Introduce a tag to identify edits that meet the Edit Check heuristic   - https://phabricator.wikimedia.org/T324733
[20:34:37] <urbanecm>	 MatmaRex: can you try to test now? :)
[20:34:57] <MatmaRex>	 yeah
[20:37:01] <MatmaRex>	 ughhhh testwiki has some edit filters that are preventing me from editing. need a minute
[20:37:42] <wikibugs>	 (03CR) 10Cathal Mooney: Remove EventGate Icinga checks that have been moved to alertmanager (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902703 (https://phabricator.wikimedia.org/T309009) (owner: 10Cathal Mooney)
[20:37:56] <wikibugs>	 (03CR) 10Cathal Mooney: [C: 03+2] Remove Eventlogging prometheus-based Icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/902454 (https://phabricator.wikimedia.org/T309007) (owner: 10Cathal Mooney)
[20:38:05] <wikibugs>	 10SRE, 10Observability-Logging, 10Release-Engineering-Team, 10Wikimedia-Logstash, 10SRE Observability (FY2022/2023-Q3): mediawiki-errors logstash dashboard's "errors over time" panel broken - https://phabricator.wikimedia.org/T332273 (10lmata) thanks @colewhite!
[20:38:26] <urbanecm>	 MatmaRex: i disabled the filter you were hittng. it was marked as "testing" and untouched since '21, so should be fine.
[20:39:42] <wikibugs>	 10SRE-Sprint-Week-Sustainability-March2023, 10Infrastructure-Foundations, 10Mail, 10Observability-Logging, and 2 others: Graph outbound mail volume on per-service or hostgroup level - https://phabricator.wikimedia.org/T197171 (10lmata)
[20:41:16] <wikibugs>	 10SRE-Sprint-Week-Sustainability-March2023, 10Infrastructure-Foundations, 10Mail, 10Observability-Alerting, and 2 others: Improve outbound mail service alerting - https://phabricator.wikimedia.org/T197172 (10lmata)
[20:41:32] <urbanecm>	 MatmaRex: and apologies, i pinged too early... seems it's not ready yet, it only started pulling it to mwdebug :-/
[20:42:05] <MatmaRex>	 thanks, i was just trying to figure out why it didn't work
[20:42:08] <MatmaRex>	 no problem
[20:46:55] <wikibugs>	 10SRE, 10Commons, 10Traffic: Specific PNG thumbnail of SVG file is outdated / stuck (European caching cluster) - https://phabricator.wikimedia.org/T333042 (10Lionel_Scheepmans) Hi folks.  I'm in front of a very strange phenomenon probably linked to this bug, and this time it concerns a PDF File.  So.  Go to...
[20:49:22] <urbanecm>	 ...the new scap backport sometimes does take a while
[20:49:40] <wikibugs>	 10SRE, 10DNS, 10Wikimedia-Language-setup, 10Patch-For-Review: zh-min-nan.wikinews.org redirects to unprefixed incubator - https://phabricator.wikimedia.org/T86915 (10BCornwall)
[20:51:10] <logmsgbot>	 !log urbanecm@deploy2002 urbanecm and matmarex: Backport for [[gerrit:903684|Only run edit check on main namespace]], [[gerrit:903685|Change name of the editcheck-needreference tag to editcheck-references]], [[gerrit:903759|Enable hidden tag for "Edit Check" project on Wikipedias (T324733)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet
[20:51:16] <stashbot>	 T324733: Introduce a tag to identify edits that meet the Edit Check heuristic   - https://phabricator.wikimedia.org/T324733
[20:51:17] <urbanecm>	 finally!
[20:51:20] <urbanecm>	 MatmaRex: now it should work
[20:52:37] <MatmaRex>	 heh
[20:53:35] <wikibugs>	 10SRE, 10DBA, 10Data Pipelines, 10Data-Engineering-Planning, and 10 others: eqiad row B switches upgrade - https://phabricator.wikimedia.org/T330165 (10colewhite)
[20:53:55] <MatmaRex>	 and it does! thanks urbanecm
[20:54:32] <urbanecm>	 awesome! syncing
[20:56:18] <wikibugs>	 10SRE, 10Data-Engineering, 10Data-Persistence, 10Discovery-Search, and 7 others: eqiad row D switches upgrade - https://phabricator.wikimedia.org/T333377 (10colewhite)
[20:56:35] <wikibugs>	 (03PS1) 10Herron: grizzly: adapt slo dashboards to 0.2 metadata approach [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/903776 (https://phabricator.wikimedia.org/T332895)
[20:57:22] <wikibugs>	 (03PS1) 10BCornwall: pybal: Add runbook link to alert [alerts] - 10https://gerrit.wikimedia.org/r/903777 (https://phabricator.wikimedia.org/T310933)
[21:03:15] <logmsgbot>	 !log urbanecm@deploy2002 Finished scap: Backport for [[gerrit:903684|Only run edit check on main namespace]], [[gerrit:903685|Change name of the editcheck-needreference tag to editcheck-references]], [[gerrit:903759|Enable hidden tag for "Edit Check" project on Wikipedias (T324733)]] (duration: 28m 53s)
[21:03:21] <stashbot>	 T324733: Introduce a tag to identify edits that meet the Edit Check heuristic   - https://phabricator.wikimedia.org/T324733
[21:03:22] <urbanecm>	 MatmaRex: finally live. thanks for your patience
[21:03:27] <urbanecm>	 anything else?
[21:03:38] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1207.mgmt.eqiad.wmnet with reboot policy FORCED
[21:03:46] <MatmaRex>	 thanks urbanecm
[21:03:51] <urbanecm>	 any time
[21:04:44] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1207.mgmt.eqiad.wmnet with reboot policy FORCED
[21:05:12] <logmsgbot>	 !log phedenskog@deploy2002 Started deploy [performance/navtiming@4d22874]: (no justification provided)
[21:05:18] <logmsgbot>	 !log phedenskog@deploy2002 Finished deploy [performance/navtiming@4d22874]: (no justification provided) (duration: 00m 06s)
[21:05:33] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: Create and deploy per-CDN-site DNS domains - https://phabricator.wikimedia.org/T332025 (10BCornwall) 05Open→03Resolved a:03BCornwall Thanks @JameelKaisar for the patch! Looks like this is resolved. If this was in error, please feel f...
[21:05:35] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic: GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 (10BCornwall)
[21:06:07] <urandom>	 !log updating image_suggestions default table TTL(s) from 1209600 to 1814400 (seconds) — T333319
[21:06:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:06:12] <stashbot>	 T333319: Increase TTL in Cassandra image_suggestions keyspace to 3 weeks - https://phabricator.wikimedia.org/T333319
[21:07:07] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic, 10Patch-For-Review: Create and deploy per-CDN-site DNS domains - https://phabricator.wikimedia.org/T332025 (10BCornwall) a:05BCornwall→03JameelKaisar
[21:07:59] <wikibugs>	 10SRE, 10Infrastructure-Foundations, 10Traffic: GeoIP mapping experiments - https://phabricator.wikimedia.org/T332024 (10BCornwall) Hi, @CDanis. Thanks for creating this ticket. Would you mind expanding on the nature of the report? Thanks!
[21:10:03] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Enable history page visual diffs on remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903780 (https://phabricator.wikimedia.org/T314588)
[21:10:05] <wikibugs>	 (03PS1) 10Bartosz Dziewoński: Clean up history page visual diffs beta feature config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903781
[21:13:22] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1207.mgmt.eqiad.wmnet with reboot policy FORCED
[21:15:07] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: remove envoy deprecated options spamfilter [puppet] - 10https://gerrit.wikimedia.org/r/902625 (https://phabricator.wikimedia.org/T320468) (owner: 10Cwhite)
[21:16:18] <wikibugs>	 (03CR) 10Cwhite: [C: 03+1] "Looks good, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/903265 (owner: 10Slyngshede)
[21:20:04] <icinga-wm>	 RECOVERY - Check systemd state on idm2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:22:55] <logmsgbot>	 !log ebernhardson@deploy2002 Started deploy [airflow-dags/search@9b31c6b]: correct mw_sql_to_hive.py cli arguments
[21:23:09] <logmsgbot>	 !log ebernhardson@deploy2002 Finished deploy [airflow-dags/search@9b31c6b]: correct mw_sql_to_hive.py cli arguments (duration: 00m 13s)
[21:25:48] <icinga-wm>	 PROBLEM - Check systemd state on idm2001 is CRITICAL: CRITICAL - degraded: The following units failed: wmf_auto_restart_apache2-htcacheclean.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state
[21:46:05] <wikibugs>	 (03CR) 10Subramanya Sastry: [C: 03+1] Enabled native gallery editing in Parsoid [mediawiki-config] - 10https://gerrit.wikimedia.org/r/889257 (https://phabricator.wikimedia.org/T329662) (owner: 10Arlolra)
[22:15:49] <wikibugs>	 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983 (10wiki_willy) Hi @MatthewVernon - for additional context, in the past we've seen drive failure issues being resolved after upgrading the firmware.  Sometimes, old firmware causes is...
[22:17:48] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host db1207.mgmt.eqiad.wmnet with reboot policy FORCED
[22:23:25] <wikibugs>	 10SRE-Sprint-Week-Sustainability-March2023, 10MediaWiki-General, 10Observability-Logging, 10Wikimedia-Logstash, and 2 others: MediaWiki log spam during row D blip / rack D2 unavailable - https://phabricator.wikimedia.org/T233739 (10lmata) Adding back #observability-logging which is a component tag within #...
[22:29:30] <wikibugs>	 (03CR) 10Cwhite: [C: 03+2] logstash: add grafana-server ECS filters and tests [puppet] - 10https://gerrit.wikimedia.org/r/901642 (https://phabricator.wikimedia.org/T234565) (owner: 10Cwhite)
[22:32:06] <icinga-wm>	 PROBLEM - Persistent high iowait on clouddumps1001 is CRITICAL: 56.91 ge 10 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005
[22:33:28] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.dns.netbox
[22:33:41] <wikibugs>	 (03CR) 10EoghanGaffney: [C: 03+1] aphlict: pass ensure flags to logrotate timer [puppet] - 10https://gerrit.wikimedia.org/r/903693 (https://phabricator.wikimedia.org/T332869) (owner: 10Jelto)
[22:36:24] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.puppet.sync-netbox-hiera generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for gerrit1003 - pt1979@cumin2002"
[22:39:44] <icinga-wm>	 RECOVERY - Persistent high iowait on clouddumps1001 is OK: (C)10 ge (W)5 ge 2.656 https://wikitech.wikimedia.org/wiki/Portal:Data_Services/Admin/Labstore https://grafana.wikimedia.org/d/000000568/labstore1004-1005
[22:42:26] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.puppet.sync-netbox-hiera (exit_code=0) generate netbox hiera data: "Triggered by cookbooks.sre.dns.netbox: Add DNS for gerrit1003 - pt1979@cumin2002"
[22:42:26] <logmsgbot>	 !log pt1979@cumin2002 END (PASS) - Cookbook sre.dns.netbox (exit_code=0)
[22:43:39] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host db1207.mgmt.eqiad.wmnet with reboot policy FORCED
[22:43:44] <wikibugs>	 10SRE, 10Observability-Logging, 10Wikimedia-Logstash: Logstash SLO excursion on 2023-02-11 - https://phabricator.wikimedia.org/T331461 (10lmata) Hi @RLazarus   Apologies for the radio silence, I'm now circling back to this, and as I review, I have one or two questions :D.  Will we file a retroactive incident...
[22:44:07] <logmsgbot>	 !log pt1979@cumin2002 START - Cookbook sre.hosts.provision for host gerrit1003.mgmt.eqiad.wmnet with reboot policy FORCED
[22:48:42] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10Papaul) While running the provision cookbook on 2 of the db nodes (db1206 and db1207) and gerrit1003 i am getting the error . ` Raised while handling: The `choices` argument is empty and...
[22:51:02] <logmsgbot>	 !log pt1979@cumin2002 END (FAIL) - Cookbook sre.hosts.provision (exit_code=99) for host gerrit1003.mgmt.eqiad.wmnet with reboot policy FORCED
[22:53:40] <wikibugs>	 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: Q3:rack/setup/install db1207-db1225 - https://phabricator.wikimedia.org/T326661 (10Papaul)
[22:57:58] <wikibugs>	 (03CR) 10Dzahn: [C: 03+1] "I also think the risk is low (since the discovery name is used that just points to the current host). If you wanted to be even more carefu" [puppet] - 10https://gerrit.wikimedia.org/r/903641 (https://phabricator.wikimedia.org/T322369) (owner: 10EoghanGaffney)
[22:59:17] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] "confirmed noop on production deploy servers and on deploy-1004.devtools there is now docker installed." [puppet] - 10https://gerrit.wikimedia.org/r/903605 (https://phabricator.wikimedia.org/T329622) (owner: 10Jaime Nuche)
[23:00:09] <zabe>	 jouncebot: nowandnext
[23:00:09] <jouncebot>	 No deployments scheduled for the next 6 hour(s) and 59 minute(s)
[23:00:09] <jouncebot>	 In 6 hour(s) and 59 minute(s): MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230329T0600)
[23:10:08] <icinga-wm>	 RECOVERY - Backup freshness on backup1001 is OK: Fresh: 125 jobs https://wikitech.wikimedia.org/wiki/Bacula%23Monitoring
[23:15:08] <wikibugs>	 (03PS1) 10Dzahn: alertmanager: delete unused serviceops-collab receivers [puppet] - 10https://gerrit.wikimedia.org/r/903792 (https://phabricator.wikimedia.org/T329587)
[23:19:16] <wikibugs>	 (03PS1) 10Zabe: Initial configuration for ckbwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903794 (https://phabricator.wikimedia.org/T331831)
[23:20:08] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Initial configuration for ckbwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903794 (https://phabricator.wikimedia.org/T331831) (owner: 10Zabe)
[23:24:03] <wikibugs>	 (03PS1) 10Dzahn: alertmanager: create receiver for both sre-collab and releng combined [puppet] - 10https://gerrit.wikimedia.org/r/903796 (https://phabricator.wikimedia.org/T329587)
[23:24:24] <wikibugs>	 (03PS2) 10Zabe: Initial configuration for ckbwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903794 (https://phabricator.wikimedia.org/T331831)
[23:24:30] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Initial configuration for ckbwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903794 (https://phabricator.wikimedia.org/T331831) (owner: 10Zabe)
[23:25:18] <wikibugs>	 (03Merged) 10jenkins-bot: Initial configuration for ckbwiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903794 (https://phabricator.wikimedia.org/T331831) (owner: 10Zabe)
[23:25:30] <wikibugs>	 (03CR) 10Dzahn: releases-jenkins: replace Icinga with Prometheus monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902788 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn)
[23:25:48] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] releases-jenkins: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/902788 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn)
[23:27:06] <zabe>	 !log central Kurdish Wiktionary (ckbwiktionary)
[23:27:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:27:28] <wikibugs>	 (03CR) 10Dzahn: [C: 03+2] alertmanager: delete unused serviceops-collab receivers [puppet] - 10https://gerrit.wikimedia.org/r/903792 (https://phabricator.wikimedia.org/T329587) (owner: 10Dzahn)
[23:27:53] <logmsgbot>	 !log zabe@deploy2002 Started scap: T331831
[23:28:01] <stashbot>	 T331831: Create Central Kurdish Wiktionary - https://phabricator.wikimedia.org/T331831
[23:29:04] <Jhs>	 zabe, is cbkwiktionary happening now?
[23:29:33] <wikibugs>	 (03PS1) 10Zabe: Add ckbwiktionary to rtl dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903798 (https://phabricator.wikimedia.org/T331831)
[23:29:45] <zabe>	 Jhs: yes
[23:29:53] <Jhs>	 niice
[23:30:15] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] Add ckbwiktionary to rtl dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903798 (https://phabricator.wikimedia.org/T331831) (owner: 10Zabe)
[23:34:55] <logmsgbot>	 !log zabe@deploy2002 Finished scap: T331831 (duration: 07m 01s)
[23:34:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (POST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[23:35:01] <stashbot>	 T331831: Create Central Kurdish Wiktionary - https://phabricator.wikimedia.org/T331831
[23:36:01] <wikibugs>	 (03PS2) 10Zabe: Add ckbwiktionary to rtl dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903798 (https://phabricator.wikimedia.org/T331831)
[23:36:12] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903798 (https://phabricator.wikimedia.org/T331831) (owner: 10Zabe)
[23:37:16] <wikibugs>	 (03Merged) 10jenkins-bot: Add ckbwiktionary to rtl dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903798 (https://phabricator.wikimedia.org/T331831) (owner: 10Zabe)
[23:38:14] <wikibugs>	 (03PS1) 10Zabe: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903714
[23:38:16] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903714 (owner: 10Zabe)
[23:39:00] <wikibugs>	 (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903714 (owner: 10Zabe)
[23:39:29] <logmsgbot>	 !log zabe@deploy2002 Started scap: T331831
[23:39:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: High Kubernetes API latency (POST secrets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[23:44:13] <Jhs>	 zabe, would you be able to fix T332380 for anpwiki (and preferrably ckbwiktionary too if possible) btw? The lack of RESTBase is causing a lot of problems
[23:44:14] <stashbot>	 T332380: Add anpwiki to RESTBase - https://phabricator.wikimedia.org/T332380
[23:45:03] <zabe>	 I can write the necesarry patch, but I don't have the necesarry powers to deploy it
[23:45:46] <icinga-wm>	 PROBLEM - PyBal IPVS diff check on lvs1020 is CRITICAL: CRITICAL: Hosts known to PyBal but not to IPVS: set([mw1351.eqiad.wmnet]) https://wikitech.wikimedia.org/wiki/PyBal
[23:45:53] <Jhs>	 could you ping whoever does?
[23:46:20] <logmsgbot>	 !log zabe@deploy2002 Finished scap: T331831 (duration: 06m 50s)
[23:46:26] <stashbot>	 T331831: Create Central Kurdish Wiktionary - https://phabricator.wikimedia.org/T331831
[23:46:58] <jinxer-wm>	 (KubernetesAPILatency) firing: High Kubernetes API latency (LIST replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[23:48:36] <wikibugs>	 (03PS1) 10Dzahn: noc: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/903801 (https://phabricator.wikimedia.org/T331901)
[23:48:57] <wikibugs>	 (03CR) 10CI reject: [V: 04-1] noc: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/903801 (https://phabricator.wikimedia.org/T331901) (owner: 10Dzahn)
[23:49:50] <zabe>	 I added hno_wlan to the patch, they are usually quite fast at getting those deployed
[23:51:14] <Jhs>	 zabe, great, thanks
[23:51:25] <wikibugs>	 (03PS2) 10Dzahn: noc: replace Icinga with Prometheus monitoring [puppet] - 10https://gerrit.wikimedia.org/r/903801 (https://phabricator.wikimedia.org/T331901)
[23:51:40] <icinga-wm>	 RECOVERY - PyBal IPVS diff check on lvs1020 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal
[23:51:58] <jinxer-wm>	 (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency
[23:57:36] <wikibugs>	 10SRE, 10ops-codfw, 10Traffic: cp2035 IPMI and management console issues - https://phabricator.wikimedia.org/T333312 (10Jhancock.wm) 05Open→03Resolved a:03Jhancock.wm confirmed with Sukhe that it was depoooled. worked remotely with Papaul to update the idrac and the bios.
[23:57:44] <wikibugs>	 (03PS1) 10Zabe: throttle: Remove expired throttle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903803
[23:58:14] <wikibugs>	 (03CR) 10Zabe: [C: 03+2] throttle: Remove expired throttle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903803 (owner: 10Zabe)
[23:58:52] <wikibugs>	 (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy2002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903803 (owner: 10Zabe)
[23:58:58] <wikibugs>	 (03Merged) 10jenkins-bot: throttle: Remove expired throttle [mediawiki-config] - 10https://gerrit.wikimedia.org/r/903803 (owner: 10Zabe)
[23:59:23] <logmsgbot>	 !log zabe@deploy2002 Started scap: Backport for [[gerrit:903803|throttle: Remove expired throttle]]