[00:05:20] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:07:18] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:09:04] PROBLEM - Postgres Replication Lag on maps2006 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 1793243456 and 13853 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:09:14] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 2804687016 and 13862 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:09:54] PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 13014235664 and 13901 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:10:18] PROBLEM - Postgres Replication Lag on maps2007 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 13015911488 and 13925 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:16:20] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.226 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:16:42] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49419 bytes in 0.068 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [00:22:02] PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 20351120080 and 14630 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:34:08] PROBLEM - OpenSearch health check for shards on 9200 on logstash1023 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7f2025b50280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.wi [00:34:08] org/wiki/Search%23Administration [00:34:14] PROBLEM - Check systemd state on logstash1023 is CRITICAL: CRITICAL - degraded: The following units failed: opensearch_2@production-elk7-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:35:44] RECOVERY - OpenSearch health check for shards on 9200 on logstash1023 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: green, timed_out: False, number_of_nodes: 16, number_of_data_nodes: 10, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 667, active_shards: 1509, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shar [00:35:44] umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [00:35:50] RECOVERY - Check systemd state on logstash1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:40:25] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10conftool, and 2 others: Scap deploy failed to depool codfw servers - https://phabricator.wikimedia.org/T327041 (10Papaul) @Joe will do [00:42:46] RECOVERY - Postgres Replication Lag on maps2006 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1200 and 928 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [00:45:30] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [00:50:54] RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 265384 and 1416 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:02:56] (03CR) 10Ssingh: "recheck" [debs/cadvisor] - 10https://gerrit.wikimedia.org/r/880530 (https://phabricator.wikimedia.org/T325557) (owner: 10Ssingh) [01:06:18] RECOVERY - Postgres Replication Lag on maps2007 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 2341 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:12:16] RECOVERY - Postgres Replication Lag on maps2005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 2697 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:14:28] (03PS1) 10Krinkle: mediawiki: Add auto_prepend_file to PHP config_cli [puppet] - 10https://gerrit.wikimedia.org/r/880561 (https://phabricator.wikimedia.org/T253547) [01:14:48] RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 0 and 88 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:15:09] (03PS2) 10Krinkle: mediawiki: Add auto_prepend_file to PHP config_cli [puppet] - 10https://gerrit.wikimedia.org/r/880561 (https://phabricator.wikimedia.org/T253547) [01:16:54] (03CR) 10CI reject: [V: 04-1] mediawiki: Add auto_prepend_file to PHP config_cli [puppet] - 10https://gerrit.wikimedia.org/r/880561 (https://phabricator.wikimedia.org/T253547) (owner: 10Krinkle) [01:20:32] (03PS3) 10Krinkle: mediawiki: Add auto_prepend_file to PHP config_cli [puppet] - 10https://gerrit.wikimedia.org/r/880561 (https://phabricator.wikimedia.org/T253547) [01:41:54] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 15465799712 and 1714 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:41:54] PROBLEM - Postgres Replication Lag on maps2010 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 11032027200 and 1714 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:54:42] RECOVERY - Postgres Replication Lag on maps2010 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 2043176 and 2481 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:56:18] PROBLEM - Postgres Replication Lag on maps2008 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 42532328 and 2577 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [01:59:32] RECOVERY - Postgres Replication Lag on maps2008 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1705720 and 2771 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [02:04:23] (03CR) 10Gergő Tisza: Enable the topic match mode in all wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/830202 (https://phabricator.wikimedia.org/T305408) (owner: 10Sergio Gimeno) [02:07:46] (JobUnavailable) firing: (5) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:12:46] (JobUnavailable) firing: (12) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:26:46] RECOVERY - MegaRAID on an-worker1086 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [02:27:46] (JobUnavailable) firing: (14) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:37:46] (JobUnavailable) firing: (14) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:47:46] (JobUnavailable) firing: (14) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:52:46] (JobUnavailable) firing: (14) Reduced availability for job gitaly in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:58:34] PROBLEM - MegaRAID on an-worker1086 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [03:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230116T0800) [03:00:05] Deploy window Automatic branching of MediaWiki, extensions, skins, and vendor – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230117T0300) [03:07:57] (03PS1) 10TrainBranchBot: Branch commit for wmf/1.40.0-wmf.19 [core] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/880511 (https://phabricator.wikimedia.org/T325582) [03:08:01] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/1.40.0-wmf.19 [core] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/880511 (https://phabricator.wikimedia.org/T325582) (owner: 10TrainBranchBot) [03:23:51] (03Merged) 10jenkins-bot: Branch commit for wmf/1.40.0-wmf.19 [core] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/880511 (https://phabricator.wikimedia.org/T325582) (owner: 10TrainBranchBot) [03:28:24] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 201 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:29:58] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:36:59] (03PS1) 10Andrew Bogott: Horizon: put into maintenance mode for Zed upgrade [puppet] - 10https://gerrit.wikimedia.org/r/880564 (https://phabricator.wikimedia.org/T323086) [03:40:17] (03PS1) 10Andrew Bogott: Move eqiad1 OpenStack control plane to version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/880565 (https://phabricator.wikimedia.org/T323086) [03:40:19] (03PS1) 10Andrew Bogott: Revert "Horizon: put into maintenance mode for Zed upgrade" [puppet] - 10https://gerrit.wikimedia.org/r/880566 [03:40:21] (03PS1) 10Andrew Bogott: Move cloud-vps client manifests to OpenStack verison 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/880567 (https://phabricator.wikimedia.org/T323086) [03:42:40] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 185 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:45:52] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 30 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:56:54] PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 5377872544 and 9813 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [04:00:05] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230116T0800) [04:00:05] Deploy window Automatic deployment of of MediaWiki, extensions, skins, and vendor to testwikis only – see Heterogeneous_deployment/Train_deploys (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230117T0400) [04:02:53] (03PS1) 10HMonroy: Enable Phonos on afwiktionary and arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/880568 (https://phabricator.wikimedia.org/T324561) [04:03:30] (03CR) 10CI reject: [V: 04-1] Enable Phonos on afwiktionary and arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/880568 (https://phabricator.wikimedia.org/T324561) (owner: 10HMonroy) [04:04:54] PROBLEM - Postgres Replication Lag on maps2005 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 276889600 and 21 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [04:05:24] PROBLEM - Check systemd state on deploy1002 is CRITICAL: CRITICAL - degraded: The following units failed: train-presync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:08:18] (03PS2) 10HMonroy: Enable Phonos on afwiktionary and arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/880568 (https://phabricator.wikimedia.org/T324561) [04:09:42] RECOVERY - Postgres Replication Lag on maps2005 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 1816 and 131 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [04:10:18] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:23:06] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:45:30] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [05:04:37] (03CR) 10Samwilson: [C: 03+1] Enable Phonos on afwiktionary and arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/880568 (https://phabricator.wikimedia.org/T324561) (owner: 10HMonroy) [05:10:36] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 142, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:10:54] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 226, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:25:44] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw: Relocate servers racked in U27 in all racks in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10Papaul) @ayounsi since A1 and A8 are supposed to be our network racks I will prefer possible to put one spine in A1 and the other s... [05:36:24] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [05:49:06] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:06:38] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on 27 hosts with reason: Primary switchover s6 T326134 [06:06:53] T326134: Switchover s6 master (db1173 -> db1131) - https://phabricator.wikimedia.org/T326134 [06:06:57] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on 27 hosts with reason: Primary switchover s6 T326134 [06:07:11] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set db1131 with weight 0 T326134', diff saved to https://phabricator.wikimedia.org/P43160 and previous config saved to /var/cache/conftool/dbconfig/20230117-060710-ladsgroup.json [06:07:14] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:07:24] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:12:06] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:14:40] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_ring_manager.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:18:15] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1180 maint', diff saved to https://phabricator.wikimedia.org/P43161 and previous config saved to /var/cache/conftool/dbconfig/20230117-061815-ladsgroup.json [06:18:16] ladsgroup@cumin1001: Failed to log message to wiki. Somebody should check the error logs. [06:24:46] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Mon 20 Feb 2023 05:31:14 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:24:52] <_joe_> here [06:25:00] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 6.096 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:25:05] ugh, that's me [06:25:10] db1198 [06:25:17] that's the ack being expired [06:25:36] <_joe_> ok lemme re-ack it [06:25:40] it should have been resolved once we got back [06:26:19] I think it should be resolved, we have a ticket for the memory issue [06:26:22] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49420 bytes in 0.113 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [06:26:48] (03PS1) 10KartikMistry: "testwiki: Use Parsoid in Mediawiki Core for Content Translation"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879998 (https://phabricator.wikimedia.org/T323667) [06:27:13] I'm going to resolve it [06:28:02] (03PS2) 10KartikMistry: testwiki: Use Parsoid in Mediawiki Core for Content Translation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879998 (https://phabricator.wikimedia.org/T323667) [06:32:17] _joe_: the host has been accessible for most of the day, I wonder why it didn't get automatically resolved :/ [06:45:38] (03PS2) 10Ladsgroup: mariadb: Promote db1131 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/874828 (https://phabricator.wikimedia.org/T326134) (owner: 10Gerrit maintenance bot) [06:45:43] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] mariadb: Promote db1131 to s6 master [puppet] - 10https://gerrit.wikimedia.org/r/874828 (https://phabricator.wikimedia.org/T326134) (owner: 10Gerrit maintenance bot) [06:52:46] (JobUnavailable) firing: (4) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:55:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [07:00:19] !log Starting s6 eqiad failover from db1173 to db1131 - T326134 [07:00:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:00:35] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Set s6 eqiad as read-only for maintenance - T326134', diff saved to https://phabricator.wikimedia.org/P43162 and previous config saved to /var/cache/conftool/dbconfig/20230117-070035-ladsgroup.json [07:01:02] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Promote db1131 to s6 primary and set section read-write T326134', diff saved to https://phabricator.wikimedia.org/P43163 and previous config saved to /var/cache/conftool/dbconfig/20230117-070102-ladsgroup.json [07:01:23] T326134: Switchover s6 master (db1173 -> db1131) - https://phabricator.wikimedia.org/T326134 [07:03:29] (03PS2) 10Ladsgroup: wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/874829 (https://phabricator.wikimedia.org/T326134) (owner: 10Gerrit maintenance bot) [07:04:00] (03CR) 10Ladsgroup: [V: 03+2 C: 03+2] wmnet: Update s6-master alias [dns] - 10https://gerrit.wikimedia.org/r/874829 (https://phabricator.wikimedia.org/T326134) (owner: 10Gerrit maintenance bot) [07:05:32] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'Depool db1173 T326134', diff saved to https://phabricator.wikimedia.org/P43164 and previous config saved to /var/cache/conftool/dbconfig/20230117-070532-ladsgroup.json [07:07:07] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P43165 and previous config saved to /var/cache/conftool/dbconfig/20230117-070707-ladsgroup.json [07:10:50] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance [07:11:15] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance [07:16:10] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 12:00:00 on db1173.eqiad.wmnet with reason: Maintenance [07:16:12] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 12:00:00 on db1173.eqiad.wmnet with reason: Maintenance [07:22:12] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P43166 and previous config saved to /var/cache/conftool/dbconfig/20230117-072212-ladsgroup.json [07:23:10] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:26:50] 10SRE, 10ops-codfw, 10Infrastructure-Foundations, 10netops: codfw: Relocate servers racked in U27 in all racks in rowA and rowB - https://phabricator.wikimedia.org/T326564 (10ayounsi) I see, what would be the best later on for the rows C and D spines? C1/C8 or C1/`D1` ? Is using A1/A8 better for eqiad as w... [07:37:17] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P43167 and previous config saved to /var/cache/conftool/dbconfig/20230117-073717-ladsgroup.json [07:50:01] (CirrusSearchHighOldGCFrequency) resolved: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [07:52:22] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1180 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P43168 and previous config saved to /var/cache/conftool/dbconfig/20230117-075222-ladsgroup.json [07:57:20] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:59:14] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 143, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [07:59:24] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 227, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:00:05] Amir1 and Urbanecm: How many deployers does it take to do UTC morning backport window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230117T0800). [08:00:05] Dreamy_Jazz and kart_: A patch you scheduled for UTC morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [08:00:12] * kart_ is here [08:01:23] (03CR) 10Phedenskog: [C: 03+1] Remove former EventLogging streams for navtiming [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879926 (https://phabricator.wikimedia.org/T281103) (owner: 10Krinkle) [08:02:51] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/880496 (owner: 10Filippo Giunchedi) [08:03:04] Hello [08:05:27] kart_: you can self serve I assume [08:06:17] Amir1: yes. [08:06:18] once done, we probably can switch to Dreamy's patch [08:06:21] ping me [08:06:23] OK! [08:06:51] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by kartik@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879998 (https://phabricator.wikimedia.org/T323667) (owner: 10KartikMistry) [08:07:37] (03Merged) 10jenkins-bot: testwiki: Use Parsoid in Mediawiki Core for Content Translation [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879998 (https://phabricator.wikimedia.org/T323667) (owner: 10KartikMistry) [08:08:49] !log kartik@deploy1002 Started scap: Backport for [[gerrit:879998|testwiki: Use Parsoid in Mediawiki Core for Content Translation (T323667)]] [08:09:24] T323667: Use Parsoid in Mediawiki Core for Content Translation - https://phabricator.wikimedia.org/T323667 [08:13:35] !log kartik@deploy1002 kartik and kartik: Backport for [[gerrit:879998|testwiki: Use Parsoid in Mediawiki Core for Content Translation (T323667)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [08:18:06] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:19:48] Still testing.. [08:26:31] !log zabe@mwmaint1002:~$ mwscript extensions/Flow/maintenance/FlowFixInconsistentBoards.php --wiki=zhwiki --namespaceName='USER_TALK' # T327146 [08:26:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:35] T327146: Structured Discussions workflow is not associated with this page - https://phabricator.wikimedia.org/T327146 [08:29:14] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:29:45] !log kartik@deploy1002 Finished scap: Backport for [[gerrit:879998|testwiki: Use Parsoid in Mediawiki Core for Content Translation (T323667)]] (duration: 20m 56s) [08:29:49] T323667: Use Parsoid in Mediawiki Core for Content Translation - https://phabricator.wikimedia.org/T323667 [08:29:55] Amir1: done. [08:30:00] cool [08:30:05] Dreamy_Jazz: shall we go? [08:30:11] Sure [08:30:30] It's my first backport, so let me know if the patch needs changes. [08:31:04] (03PS6) 10Dreamy Jazz: Start writing to cul_reason_id and cul_reason_plaintext_id on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879652 (https://phabricator.wikimedia.org/T233004) [08:31:17] Rebased it as it said it had a merge conflict [08:32:38] (03CR) 10Ladsgroup: [C: 03+2] Start writing to cul_reason_id and cul_reason_plaintext_id on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879652 (https://phabricator.wikimedia.org/T233004) (owner: 10Dreamy Jazz) [08:32:40] (03CR) 10Muehlenhoff: [C: 03+2] "Looks good, merging!" [puppet] - 10https://gerrit.wikimedia.org/r/880505 (owner: 10Zabe) [08:33:20] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ladsgroup@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879652 (https://phabricator.wikimedia.org/T233004) (owner: 10Dreamy Jazz) [08:33:29] (03Merged) 10jenkins-bot: Start writing to cul_reason_id and cul_reason_plaintext_id on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879652 (https://phabricator.wikimedia.org/T233004) (owner: 10Dreamy Jazz) [08:33:42] !log ladsgroup@deploy1002 Started scap: Backport for [[gerrit:879652|Start writing to cul_reason_id and cul_reason_plaintext_id on testwiki (T233004)]] [08:33:46] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [08:33:56] (03PS1) 10Muehlenhoff: Move ssh-key-ldap-lookup to profile::base::labs [puppet] - 10https://gerrit.wikimedia.org/r/880883 [08:33:58] (03PS1) 10Muehlenhoff: Remove ldap::client::utils [puppet] - 10https://gerrit.wikimedia.org/r/880884 [08:34:37] (03CR) 10CI reject: [V: 04-1] Move ssh-key-ldap-lookup to profile::base::labs [puppet] - 10https://gerrit.wikimedia.org/r/880883 (owner: 10Muehlenhoff) [08:35:22] (03PS2) 10Ayounsi: WIP: add rt_flow grokking [puppet] - 10https://gerrit.wikimedia.org/r/880500 (https://phabricator.wikimedia.org/T325806) (owner: 10Filippo Giunchedi) [08:35:35] !log ladsgroup@deploy1002 ladsgroup and dreamyjazz: Backport for [[gerrit:879652|Start writing to cul_reason_id and cul_reason_plaintext_id on testwiki (T233004)]] synced to the testservers: mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet [08:35:49] Dreamy_Jazz: it's in mwdebug now, can you test it? [08:36:05] probably doing a CU on a test account [08:36:19] I don't have perms on test wiki to do a CU [08:36:31] (03PS2) 10Muehlenhoff: Move ssh-key-ldap-lookup to profile::base::labs [puppet] - 10https://gerrit.wikimedia.org/r/880883 [08:37:07] (03PS20) 10Elukey: sre.k8s.pool-depool-cluster: handle active/passive services [cookbooks] - 10https://gerrit.wikimedia.org/r/869771 (https://phabricator.wikimedia.org/T277677) [08:37:10] If I was granted them again I could test it's working [08:37:18] (03CR) 10CI reject: [V: 04-1] WIP: add rt_flow grokking [puppet] - 10https://gerrit.wikimedia.org/r/880500 (https://phabricator.wikimedia.org/T325806) (owner: 10Filippo Giunchedi) [08:37:30] hmm, I don't have it either [08:37:31] Was granted them for testing a security bug [08:37:43] I can test [08:37:54] Thanks [08:40:23] (03CR) 10Ayounsi: "I got it very close to working, hopefully you can figure out the last piece of the puzzle: why the "juniper" stuff doesn't get parsed?" [puppet] - 10https://gerrit.wikimedia.org/r/880500 (https://phabricator.wikimedia.org/T325806) (owner: 10Filippo Giunchedi) [08:40:26] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:40:39] (03PS16) 10Elukey: Upgrade knative to 1.7.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/861395 (https://phabricator.wikimedia.org/T323793) [08:40:58] Dreamy_Jazz, Amir1 https://phabricator.wikimedia.org/P43169 [08:41:27] cool [08:41:35] Thank! [08:42:00] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:42:00] moving forward [08:42:02] thanks [08:42:04] That looks to be working as expected [08:42:58] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [08:43:04] and replication hasn't broken in s3 [08:43:21] sigh, has someone again started scarping mailman? [08:45:30] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [08:45:39] (03CR) 10Jelto: [C: 03+1] aptrepo: update Grafana url and key [puppet] - 10https://gerrit.wikimedia.org/r/880496 (owner: 10Filippo Giunchedi) [08:47:23] (03CR) 10Elukey: [C: 03+2] Upgrade knative to 1.7.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/861395 (https://phabricator.wikimedia.org/T323793) (owner: 10Elukey) [08:47:32] !log ladsgroup@deploy1002 Finished scap: Backport for [[gerrit:879652|Start writing to cul_reason_id and cul_reason_plaintext_id on testwiki (T233004)]] (duration: 13m 50s) [08:47:36] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [08:47:53] we are done [08:48:06] Nice. Thanks both. [08:49:19] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance [08:49:21] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance [08:54:47] (03CR) 10Muehlenhoff: peopleweb: ensure rsync service is stopped on passive host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/879878 (https://phabricator.wikimedia.org/T326888) (owner: 10Dzahn) [09:00:04] jnuche and jeena: It is that lovely time of the day again! You are hereby commanded to deploy MediaWiki train - Utc-0+Utc-7 Version. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230117T0900). [09:00:47] (03CR) 10Filippo Giunchedi: [C: 03+2] aptrepo: update Grafana url and key [puppet] - 10https://gerrit.wikimedia.org/r/880496 (owner: 10Filippo Giunchedi) [09:05:59] (03PS5) 10Dreamy Jazz: Start writing to cul_reason[_plaintext]_id on group0 and group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879653 (https://phabricator.wikimedia.org/T233004) [09:11:33] (03PS1) 10Ayounsi: Add PTR resolution to firewall logs [puppet] - 10https://gerrit.wikimedia.org/r/880889 (https://phabricator.wikimedia.org/T327095) [09:13:20] (03CR) 10CI reject: [V: 04-1] Add PTR resolution to firewall logs [puppet] - 10https://gerrit.wikimedia.org/r/880889 (https://phabricator.wikimedia.org/T327095) (owner: 10Ayounsi) [09:15:44] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:24:58] (03PS1) 10TrainBranchBot: testwikis wikis to 1.40.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/880891 (https://phabricator.wikimedia.org/T325582) [09:25:00] (03CR) 10TrainBranchBot: [C: 03+2] testwikis wikis to 1.40.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/880891 (https://phabricator.wikimedia.org/T325582) (owner: 10TrainBranchBot) [09:25:37] (03Merged) 10jenkins-bot: testwikis wikis to 1.40.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/880891 (https://phabricator.wikimedia.org/T325582) (owner: 10TrainBranchBot) [09:26:00] !log jnuche@deploy1002 Started scap: testwikis wikis to 1.40.0-wmf.19 refs T325582 [09:26:04] T325582: 1.40.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T325582 [09:26:51] !log jnuche@deploy1002 scap failed: PermissionError [Errno 13] Permission denied: '/home/jnuche/scap-image-build-and-push-log' (duration: 00m 50s) [09:26:54] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:28:53] !log jnuche@deploy1002 Started scap: testwikis wikis to 1.40.0-wmf.19 refs T325582 [09:42:17] !log aqu@deploy1002 Started deploy [airflow-dags/analytics_test@9568478]: (no justification provided) [09:42:29] !log aqu@deploy1002 Finished deploy [airflow-dags/analytics_test@9568478]: (no justification provided) (duration: 00m 12s) [09:42:34] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49421 bytes in 5.881 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:43:18] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 6.580 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [09:46:14] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:46:34] (03CR) 10Klausman: [V: 03+2 C: 03+2] knative: import new upstream version 1.7.2 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/861349 (https://phabricator.wikimedia.org/T323793) (owner: 10Elukey) [09:48:34] (03CR) 10Muehlenhoff: [C: 03+2] Add new bastions [puppet] - 10https://gerrit.wikimedia.org/r/880433 (https://phabricator.wikimedia.org/T324974) (owner: 10Muehlenhoff) [09:50:22] RECOVERY - MegaRAID on an-worker1086 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [09:56:16] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [09:56:34] (03CR) 10Clément Goubert: [V: 03+1 C: 03+2] P:logstash::production: mediawiki-php-fpm-slowlog [puppet] - 10https://gerrit.wikimedia.org/r/879417 (https://phabricator.wikimedia.org/T326794) (owner: 10Clément Goubert) [09:58:00] PROBLEM - Check unit status of httpbb_kubernetes_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [09:59:08] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:07:34] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:08:02] PROBLEM - Check systemd state on logstash1023 is CRITICAL: CRITICAL - degraded: The following units failed: opensearch_2@production-elk7-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:08:28] PROBLEM - OpenSearch health check for shards on 9200 on logstash1023 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fb16ae6d280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.wi [10:08:28] org/wiki/Search%23Administration [10:08:36] RECOVERY - Check unit status of httpbb_kubernetes_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:10:26] PROBLEM - OpenSearch health check for shards on 9200 on logstash1024 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(urllib3.connection.HTTPConnection object at 0x7fa688a4a280: Failed to establish a new connection: [Errno 111] Connection refused)) https://wikitech.wi [10:10:26] org/wiki/Search%23Administration [10:10:30] PROBLEM - Check systemd state on logstash1024 is CRITICAL: CRITICAL - degraded: The following units failed: opensearch_2@production-elk7-eqiad.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:11:20] !log jnuche@deploy1002 Finished scap: testwikis wikis to 1.40.0-wmf.19 refs T325582 (duration: 42m 26s) [10:11:24] T325582: 1.40.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T325582 [10:12:44] !log jnuche@deploy1002 scap failed: average error rate on 9/9 canaries increased by 10x (rerun with --force to override this check, see https://logstash.wikimedia.org for details) [10:12:55] (LogstashNoLogsIndexed) firing: Logstash logs are not being indexed by Elasticsearch - https://wikitech.wikimedia.org/wiki/Logstash#No_logs_indexed - https://grafana.wikimedia.org/d/000000561/logstash?var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashNoLogsIndexed [10:13:52] mmhh not sure what happened yet with logstash there, cc jnuche as it might be related to the canaries check [10:14:32] PROBLEM - Check systemd state on mwmaint1002 is CRITICAL: CRITICAL - degraded: The following units failed: mediawiki_job_startupregistrystats-testwiki.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:16:39] !log restart opensearch_2@production-elk7-eqiad.service on logstash102[34] [10:16:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:16:42] (OOM) [10:16:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:17:42] RECOVERY - Check systemd state on logstash1023 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:18:08] RECOVERY - OpenSearch health check for shards on 9200 on logstash1023 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: green, timed_out: False, number_of_nodes: 16, number_of_data_nodes: 10, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 660, active_shards: 1489, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shar [10:18:08] umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [10:18:27] godog: do you think that it needs a bigger heap size or maybe something related to sudden pressure etc..? [10:18:30] RECOVERY - OpenSearch health check for shards on 9200 on logstash1024 is OK: OK - elasticsearch status production-elk7-eqiad: cluster_name: production-elk7-eqiad, status: green, timed_out: False, number_of_nodes: 16, number_of_data_nodes: 10, discovered_master: True, discovered_cluster_manager: True, active_primary_shards: 660, active_shards: 1489, relocating_shards: 0, initializing_shards: 0, unassigned_shards: 0, delayed_unassigned_shar [10:18:30] umber_of_pending_tasks: 0, number_of_in_flight_fetch: 0, task_max_waiting_in_queue_millis: 0, active_shards_percent_as_number: 100.0 https://wikitech.wikimedia.org/wiki/Search%23Administration [10:18:34] RECOVERY - Check systemd state on logstash1024 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:18:57] godog: the failed canaries checks happened right at the end during the branch cleaning, the previous canary chackes passed normally, no idea what happened there [10:19:01] elukey: don't know yet! [10:19:48] jnuche: ack, thank you, yeah not sure yet either what happened on the opensearch side [10:20:20] ack lemme know if you need a hand! [10:20:36] (03PS1) 10Muehlenhoff: cumin: Update docs for Debian package [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/880894 [10:20:54] thank you [10:21:55] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:22:10] PROBLEM - MegaRAID on an-worker1086 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [10:22:53] (03CR) 10Volans: [C: 03+1] "LGTM, thanks!" [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/880894 (owner: 10Muehlenhoff) [10:22:55] (LogstashNoLogsIndexed) resolved: Logstash logs are not being indexed by Elasticsearch - https://wikitech.wikimedia.org/wiki/Logstash#No_logs_indexed - https://grafana.wikimedia.org/d/000000561/logstash?var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashNoLogsIndexed [10:23:55] godog: (just FYI, looked into the big scap error stack, they are connection failure errors) [10:25:03] jnuche: ok thank you, connection failure to which service/url ? [10:26:01] host='logstash1023.eqiad.wmnet', port=9200 [10:27:42] (03CR) 10Muehlenhoff: [C: 03+2] cumin: Update docs for Debian package [software/cumin] (debian) - 10https://gerrit.wikimedia.org/r/880894 (owner: 10Muehlenhoff) [10:30:16] jnuche: cheers, could you remind me what is the script/path that does those checks ? [10:32:44] godog: seems to tbe `/usr/local/bin/logstash_checker.py` [10:32:47] on the deployment server [10:34:56] cheers [10:35:08] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service,httpbb_kubernetes_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:36:09] (03PS4) 10Btullis: Add a third-party apt repo for ceph-quincy packages [puppet] - 10https://gerrit.wikimedia.org/r/880461 (https://phabricator.wikimedia.org/T326945) [10:41:44] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:43:22] PROBLEM - Check unit status of httpbb_kubernetes_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:47:07] 10SRE, 10MW-on-K8s, 10observability, 10serviceops, 10Patch-For-Review: New mediawiki.httpd.accesslog topic on kafka-logging + logstash and dashboard - https://phabricator.wikimedia.org/T324439 (10Clement_Goubert) There was a typo made when creating the topics (`mediawiki.http.accesslog` instead of `media... [10:48:37] (03PS1) 10Jcrespo: dbbackups: Setup dbprov1004, dbprov2004 as empty dbprov [puppet] - 10https://gerrit.wikimedia.org/r/880896 (https://phabricator.wikimedia.org/T327155) [10:48:55] godog: Think it's related to me merging https://gerrit.wikimedia.org/r/879417 ? (the logstash issues) [10:51:49] jnuche: testwiki main page seems to be erroring [10:51:56] (03CR) 10Jcrespo: [C: 03+2] "Looks good: https://puppet-compiler.wmflabs.org/output/880896/39146/" [puppet] - 10https://gerrit.wikimedia.org/r/880896 (https://phabricator.wikimedia.org/T327155) (owner: 10Jcrespo) [10:51:58] (03PS1) 10Elukey: kserve: upgrade to upstream version 0.9 [deployment-charts] - 10https://gerrit.wikimedia.org/r/880897 (https://phabricator.wikimedia.org/T325528) [10:52:14] (03CR) 10CI reject: [V: 04-1] kserve: upgrade to upstream version 0.9 [deployment-charts] - 10https://gerrit.wikimedia.org/r/880897 (https://phabricator.wikimedia.org/T325528) (owner: 10Elukey) [10:52:46] (JobUnavailable) firing: (4) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:53:10] (03PS1) 10Hnowlan: thumbor: add and use haproxy healthz lvs check [puppet] - 10https://gerrit.wikimedia.org/r/880898 (https://phabricator.wikimedia.org/T233196) [10:53:14] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [10:53:35] claime: ah yeah totally, that could be it [10:53:40] taavi: my guess is it's related to this https://phabricator.wikimedia.org/T327158 [10:54:07] godog: In that case I'm very sorry :( [10:54:16] PROBLEM - Check systemd state on cumin2002 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_hourly_appserver.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:54:20] httpbb_hourly_appserver - is that related to the ongoing codfw app server issues, _joe_ ? [10:54:35] jynus: No it's wikidata serving 500s [10:54:44] I see [10:54:49] need help? [10:54:50] Jan 17 10:31:51 cumin1001 sh[2126400]: https://test.wikidata.org/wiki/Wikidata:Main_Page (/srv/deployment/httpbb-tests/appserver/test_main.yaml:124) [10:54:50] <_joe_> jynus: codfw is depooled though [10:54:52] Jan 17 10:31:51 cumin1001 sh[2126400]: Status code: expected 200, got 500. [10:54:54] Jan 17 10:31:51 cumin1001 sh[2126400]: Body: expected to contain 'test instance', got '\n !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host mc1048.eqiad.wmnet with OS bullseye [11:07:59] (03CR) 10Ilias Sarantopoulos: [C: 03+1] "LGTM!" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/880499 (https://phabricator.wikimedia.org/T325528) (owner: 10Elukey) [11:08:41] !log upgraded cumin on cumin2002 to 4.2.0-1+deb11u1 [11:08:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:21] PROBLEM - Check unit status of httpbb_kubernetes_hourly on cumin2002 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [11:13:06] (03PS1) 10Zabe: Start writing to rev_comment_id everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/880902 (https://phabricator.wikimedia.org/T299954) [11:13:08] (03PS1) 10Zabe: Stop writing to cul_user and cul_user_text everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/880903 (https://phabricator.wikimedia.org/T233004) [11:13:10] (03PS1) 10Zabe: Start reading from cuc_actor everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/880904 (https://phabricator.wikimedia.org/T233004) [11:13:12] (03PS1) 10Zabe: Start reading from cuc_comment_id on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/880905 (https://phabricator.wikimedia.org/T233004) [11:16:03] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on mc1048.eqiad.wmnet with reason: host reimage [11:17:12] (03CR) 10Filippo Giunchedi: WIP: add rt_flow grokking (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/880500 (https://phabricator.wikimedia.org/T325806) (owner: 10Filippo Giunchedi) [11:18:47] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on mc1048.eqiad.wmnet with reason: host reimage [11:20:56] PROBLEM - ElasticSearch unassigned shard check - 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - eswiki_titlesuggest_1670642616[1](2023-01-14T08:19:31.469Z), shwiki_titlesuggest_1666850148[0](2023-01-14T08:19:31.478Z), enwiki_titlesuggest_1670642032[1](2023-01-14T08:19:31.476Z), cebwiki_titlesuggest_1670640561[0](2023-01-14T08:19:31.473Z) https://wikitech.wikimedia.org/wiki/Search%23Administration [11:22:08] !log jiji@maps2009 imposm-removebackup-import - T314472 [11:22:08] T314472: Re-import full planet data into eqiad and codfw - https://phabricator.wikimedia.org/T314472 [11:28:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [11:32:01] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host mc1048.eqiad.wmnet with OS bullseye [11:33:13] (03PS1) 10Muehlenhoff: Add new edge bastions to ssh-client-config (and add missing drmrs) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/880930 [11:35:54] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "The change is technically correct - remember it won't take effect until we restart pybal, and that's blocked on the resolution of T327001," [puppet] - 10https://gerrit.wikimedia.org/r/880898 (https://phabricator.wikimedia.org/T233196) (owner: 10Hnowlan) [11:37:01] (03CR) 10Volans: [C: 03+1] "LGTM" [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/880930 (owner: 10Muehlenhoff) [11:37:37] (03PS1) 10Effie Mouzeli: tegola-vector-tiles: use new tegola swift container in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/880933 (https://phabricator.wikimedia.org/T314472) [11:38:15] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Add new edge bastions to ssh-client-config (and add missing drmrs) [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/880930 (owner: 10Muehlenhoff) [11:38:44] (03PS3) 10Dreamy Jazz: Write to cul_reason[_plaintext]_id everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879946 (https://phabricator.wikimedia.org/T233004) [11:43:28] (03PS1) 10Effie Mouzeli: tegola-vector-tiles: disable tile pregeneration in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/880934 (https://phabricator.wikimedia.org/T314472) [11:44:09] (03CR) 10Esanders: [C: 03+1] Enable visual enhancements on all talk namespaces [extensions/DiscussionTools] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/879103 (https://phabricator.wikimedia.org/T325417) (owner: 10Bartosz Dziewoński) [11:44:25] (03CR) 10Jgiannelos: [C: 03+1] tegola-vector-tiles: disable tile pregeneration in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/880934 (https://phabricator.wikimedia.org/T314472) (owner: 10Effie Mouzeli) [11:45:02] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:45:11] (03CR) 10Jgiannelos: [C: 03+1] tegola-vector-tiles: use new tegola swift container in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/880933 (https://phabricator.wikimedia.org/T314472) (owner: 10Effie Mouzeli) [11:46:30] (03CR) 10Effie Mouzeli: [C: 03+2] tegola-vector-tiles: use new tegola swift container in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/880933 (https://phabricator.wikimedia.org/T314472) (owner: 10Effie Mouzeli) [11:49:17] (03PS1) 10Gmodena: Add mediawiki-stream-enrichmnet chart. [deployment-charts] - 10https://gerrit.wikimedia.org/r/880938 [11:51:51] (03Merged) 10jenkins-bot: tegola-vector-tiles: use new tegola swift container in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/880933 (https://phabricator.wikimedia.org/T314472) (owner: 10Effie Mouzeli) [11:52:35] (03CR) 10Effie Mouzeli: [C: 03+2] tegola-vector-tiles: disable tile pregeneration in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/880934 (https://phabricator.wikimedia.org/T314472) (owner: 10Effie Mouzeli) [11:53:02] (03PS1) 10Btullis: Rename ceph roles and profiles to cloudceph [puppet] - 10https://gerrit.wikimedia.org/r/880939 (https://phabricator.wikimedia.org/T326945) [11:53:09] (03PS1) 10Muehlenhoff: Bump changelog [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/880940 [11:55:52] godog: logstash seems stable now and I need to rerun the train presync for an unrelated issue, is that ok ? [11:57:56] (03Merged) 10jenkins-bot: tegola-vector-tiles: disable tile pregeneration in codfw [deployment-charts] - 10https://gerrit.wikimedia.org/r/880934 (https://phabricator.wikimedia.org/T314472) (owner: 10Effie Mouzeli) [12:03:26] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Bump changelog [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/880940 (owner: 10Muehlenhoff) [12:03:35] (03CR) 10Krinkle: [C: 03+1] eventlogging: Remove obsoleted navtiming schemas [puppet] - 10https://gerrit.wikimedia.org/r/726852 (https://phabricator.wikimedia.org/T281103) (owner: 10Phedenskog) [12:03:50] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:12:39] (03PS1) 10Muehlenhoff: Retain old hosts as well [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/880948 [12:17:23] (03CR) 10Muehlenhoff: [C: 03+1] "Looks great, nice work!" [debs/cadvisor] - 10https://gerrit.wikimedia.org/r/880530 (https://phabricator.wikimedia.org/T325557) (owner: 10Ssingh) [12:17:44] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Retain old hosts as well [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/880948 (owner: 10Muehlenhoff) [12:22:50] 10SRE, 10Acme-chief, 10Traffic: Ci check for acme-chief changes - https://phabricator.wikimedia.org/T326942 (10LSobanski) [12:31:42] (03PS1) 10Btullis: Duplicate existing secrets for profile::ceph to profile::cloudceph [labs/private] - 10https://gerrit.wikimedia.org/r/880949 (https://phabricator.wikimedia.org/T326945) [12:35:01] !log installing ipython security updates [12:35:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:08] (03PS2) 10Btullis: Duplicate existing secrets for profile::ceph to profile::cloudceph [labs/private] - 10https://gerrit.wikimedia.org/r/880949 (https://phabricator.wikimedia.org/T326945) [12:35:31] !log jelto@cumin1001 START - Cookbook sre.gitlab.upgrade [12:42:42] (03PS2) 10Btullis: Rename ceph profiles to cloudceph [puppet] - 10https://gerrit.wikimedia.org/r/880939 (https://phabricator.wikimedia.org/T326945) [12:44:41] (03CR) 10Krinkle: "Puppet compiler result: https://puppet-compiler.wmflabs.org/output/880561/39149/" [puppet] - 10https://gerrit.wikimedia.org/r/880561 (https://phabricator.wikimedia.org/T253547) (owner: 10Krinkle) [12:45:00] (03CR) 10Btullis: [V: 03+2 C: 03+2] Duplicate existing secrets for profile::ceph to profile::cloudceph [labs/private] - 10https://gerrit.wikimedia.org/r/880949 (https://phabricator.wikimedia.org/T326945) (owner: 10Btullis) [12:45:30] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [12:46:17] jnuche: yeah totally! thanks for checking in [12:46:48] godog: thx! [12:47:30] (03PS3) 10Btullis: Rename ceph profiles to cloudceph [puppet] - 10https://gerrit.wikimedia.org/r/880939 (https://phabricator.wikimedia.org/T326945) [12:48:09] for sure [12:48:29] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39152/console" [puppet] - 10https://gerrit.wikimedia.org/r/880939 (https://phabricator.wikimedia.org/T326945) (owner: 10Btullis) [12:48:40] RECOVERY - Host elastic2077 is UP: PING OK - Packet loss = 0%, RTA = 33.75 ms [12:48:40] RECOVERY - Host elastic2063 is UP: PING OK - Packet loss = 0%, RTA = 33.20 ms [12:48:40] RECOVERY - Host ml-cache2002 is UP: PING OK - Packet loss = 0%, RTA = 33.25 ms [12:48:40] RECOVERY - Host elastic2078 is UP: PING OK - Packet loss = 0%, RTA = 34.46 ms [12:48:40] RECOVERY - Host cp2031 is UP: PING OK - Packet loss = 0%, RTA = 33.24 ms [12:48:41] RECOVERY - Host lvs2008 is UP: PING OK - Packet loss = 0%, RTA = 33.19 ms [12:48:41] RECOVERY - Host mc2043 is UP: PING OK - Packet loss = 0%, RTA = 33.26 ms [12:48:42] RECOVERY - Host elastic2057 is UP: PING OK - Packet loss = 0%, RTA = 33.14 ms [12:48:42] RECOVERY - Host ms-fe2010 is UP: PING OK - Packet loss = 0%, RTA = 33.21 ms [12:48:43] RECOVERY - Host elastic2041 is UP: PING OK - Packet loss = 0%, RTA = 33.22 ms [12:48:44] RECOVERY - Host ms-be2041 is UP: PING OK - Packet loss = 0%, RTA = 33.16 ms [12:48:44] RECOVERY - Host cp2032 is UP: PING OK - Packet loss = 0%, RTA = 33.14 ms [12:48:46] Oh hello [12:48:48] RECOVERY - Host elastic2064 is UP: PING OK - Packet loss = 0%, RTA = 33.16 ms [12:48:56] RECOVERY - Host elastic2042 is UP: PING OK - Packet loss = 0%, RTA = 33.19 ms [12:48:59] What's a host like you doing in a place like this :p [12:49:08] RECOVERY - Host thanos-fe2002 is UP: PING WARNING - Packet loss = 90%, RTA = 33.15 ms [12:49:36] PROBLEM - Host ldap-corp2001 is DOWN: PING CRITICAL - Packet loss = 100% [12:49:41] PROBLEM - Host db2148 #page is DOWN: PING CRITICAL - Packet loss = 100% [12:49:48] PROBLEM - Host logstash2036 is DOWN: PING CRITICAL - Packet loss = 100% [12:49:48] PROBLEM - Host logstash2024 is DOWN: PING CRITICAL - Packet loss = 100% [12:49:48] PROBLEM - Host aqs2007 is DOWN: PING CRITICAL - Packet loss = 100% [12:49:48] PROBLEM - Host aqs2008 is DOWN: PING CRITICAL - Packet loss = 100% [12:49:51] PROBLEM - Host db2177 #page is DOWN: PING CRITICAL - Packet loss = 100% [12:49:51] PROBLEM - Host conf2004 is DOWN: PING CRITICAL - Packet loss = 100% [12:49:51] PROBLEM - Host db2160 is DOWN: PING CRITICAL - Packet loss = 100% [12:49:53] PROBLEM - Host db2124 #page is DOWN: PING CRITICAL - Packet loss = 100% [12:49:58] PROBLEM - Host poolcounter2004 is DOWN: PING CRITICAL - Packet loss = 100% [12:50:00] PROBLEM - Host db2159 #page is DOWN: PING CRITICAL - Packet loss = 100% [12:50:01] PROBLEM - Host db2107 #page is DOWN: PING CRITICAL - Packet loss = 100% [12:50:02] PROBLEM - Host db2108 #page is DOWN: PING CRITICAL - Packet loss = 100% [12:50:04] PROBLEM - Host mc-gp2002 is DOWN: PING CRITICAL - Packet loss = 100% [12:50:05] PROBLEM - Host pc2012 #page is DOWN: PING CRITICAL - Packet loss = 100% [12:50:06] PROBLEM - Host kafka-main2002 is DOWN: PING CRITICAL - Packet loss = 100% [12:50:09] around [12:50:10] PROBLEM - Host mc2044 is DOWN: PING CRITICAL - Packet loss = 100% [12:50:14] PROBLEM - Host ganeti2021 is DOWN: PING CRITICAL - Packet loss = 100% [12:50:19] PROBLEM - Host db2096 #page is DOWN: PING CRITICAL - Packet loss = 100% [12:50:20] PROBLEM - Host db2123 #page is DOWN: PING CRITICAL - Packet loss = 100% [12:50:20] PROBLEM - Host parse2008 is DOWN: PING CRITICAL - Packet loss = 100% [12:50:26] PROBLEM - Host aqs2006 is DOWN: PING CRITICAL - Packet loss = 100% [12:50:26] PROBLEM - Host aqs2005 is DOWN: PING CRITICAL - Packet loss = 100% [12:50:26] PROBLEM - Host cp2034 is DOWN: PING CRITICAL - Packet loss = 100% [12:50:27] PROBLEM - Host db2111 #page is DOWN: PING CRITICAL - Packet loss = 100% [12:50:31] same [12:50:33] PROBLEM - Host db2137 #page is DOWN: PING CRITICAL - Packet loss = 100% [12:50:35] PROBLEM - Host db2163 #page is DOWN: PING CRITICAL - Packet loss = 100% [12:50:35] PROBLEM - Host db2164 #page is DOWN: PING CRITICAL - Packet loss = 100% [12:50:37] PROBLEM - Host es2029 #page is DOWN: PING CRITICAL - Packet loss = 100% [12:50:37] PROBLEM - Host es2025 #page is DOWN: PING CRITICAL - Packet loss = 100% [12:50:38] PROBLEM - Host elastic2080 is DOWN: PING CRITICAL - Packet loss = 100% [12:50:38] PROBLEM - Host urldownloader2002 is DOWN: PING CRITICAL - Packet loss = 100% [12:50:38] PROBLEM - Host mw2270 is DOWN: PING CRITICAL - Packet loss = 100% [12:50:38] PROBLEM - Host mw2261 is DOWN: PING CRITICAL - Packet loss = 100% [12:50:38] PROBLEM - Host mw2262 is DOWN: PING CRITICAL - Packet loss = 100% [12:50:38] PROBLEM - Host mw2334 is DOWN: PING CRITICAL - Packet loss = 100% [12:50:43] PROBLEM - Host db2147 #page is DOWN: PING CRITICAL - Packet loss = 100% [12:50:43] PROBLEM - Host parse2007 is DOWN: PING CRITICAL - Packet loss = 100% [12:50:44] PROBLEM - Host db2178 #page is DOWN: PING CRITICAL - Packet loss = 100% [12:50:46] PROBLEM - Host db2134 is DOWN: PING CRITICAL - Packet loss = 100% [12:50:47] PROBLEM - Host db2161 #page is DOWN: PING CRITICAL - Packet loss = 100% [12:50:54] PROBLEM - Host irc2001 is DOWN: PING CRITICAL - Packet loss = 100% [12:50:59] me too [12:51:02] PROBLEM - Host backup2005 is DOWN: PING CRITICAL - Packet loss = 100% [12:51:06] PROBLEM - Host logstash2025 is DOWN: PING CRITICAL - Packet loss = 100% [12:51:06] PROBLEM - Host maps2006 is DOWN: PING CRITICAL - Packet loss = 100% [12:51:06] PROBLEM - Host ms-be2067 is DOWN: PING CRITICAL - Packet loss = 100% [12:51:15] PROBLEM - Host db2109 #page is DOWN: PING CRITICAL - Packet loss = 100% [12:51:18] PROBLEM - Host thumbor2003 is DOWN: PING CRITICAL - Packet loss = 100% [12:51:18] PROBLEM - Host thumbor2004 is DOWN: PING CRITICAL - Packet loss = 100% [12:51:18] PROBLEM - Host wcqs2001 is DOWN: PING CRITICAL - Packet loss = 100% [12:51:22] PROBLEM - Host wdqs2007 is DOWN: PING CRITICAL - Packet loss = 100% [12:51:22] PROBLEM - Host wdqs2005 is DOWN: PING CRITICAL - Packet loss = 100% [12:51:29] PROBLEM - Host db2162 #page is DOWN: PING CRITICAL - Packet loss = 100% [12:51:29] PROBLEM - Host backup2008 is DOWN: PING CRITICAL - Packet loss = 100% [12:51:29] PROBLEM - Host mw2328 is DOWN: PING CRITICAL - Packet loss = 100% [12:51:29] PROBLEM - Host restbase2014 is DOWN: PING CRITICAL - Packet loss = 100% [12:51:29] PROBLEM - Host mw2263 is DOWN: PING CRITICAL - Packet loss = 100% [12:51:30] PROBLEM - Host graphite2004 is DOWN: PING CRITICAL - Packet loss = 100% [12:51:30] PROBLEM - Host mw2323 is DOWN: PING CRITICAL - Packet loss = 100% [12:51:42] Ok cool thanks libera [12:51:59] codfw apps are depooled, right? [12:52:03] (ProbeDown) firing: (5) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:52:45] PROBLEM - Host cp2031 is DOWN: PING CRITICAL - Packet loss = 100% [12:52:45] PROBLEM - Host elastic2041 is DOWN: PING CRITICAL - Packet loss = 100% [12:52:46] PROBLEM - Host ms-fe2010 is DOWN: PING CRITICAL - Packet loss = 100% [12:52:46] (JobUnavailable) firing: (4) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:52:48] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 129, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [12:52:56] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [12:53:04] I better push off that scap presync deploy... [12:53:12] PROBLEM - Host thanos-fe2002 is DOWN: PING CRITICAL - Packet loss = 100% [12:53:12] yes please [12:53:21] NEL increased, probably some caches and frontends caught [12:53:24] PROBLEM - BFD status on cr2-codfw is CRITICAL: CRIT: Down: 3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:53:29] PROBLEM - VRRP status on cr1-codfw is CRITICAL: VRRP CRITICAL - 5 inconsistent interfaces, 0 misconfigured interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [12:53:33] (virtual-chassis crash) firing: Alert for device asw-b-codfw.mgmt.codfw.wmnet - virtual-chassis crash - https://alerts.wikimedia.org/?q=alertname%3Dvirtual-chassis+crash [12:53:36] PROBLEM - restbase endpoints health on restbase2015 is CRITICAL: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:53:36] PROBLEM - BFD status on cr1-codfw is CRITICAL: CRIT: Down: 3 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [12:53:38] PROBLEM - haproxy failover on dbproxy2004 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [12:53:40] PROBLEM - haproxy failover on dbproxy2003 is CRITICAL: CRITICAL check_failover servers up 0 down 3: https://wikitech.wikimedia.org/wiki/HAProxy [12:53:42] PROBLEM - restbase endpoints health on restbase2022 is CRITICAL: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:53:43] XioNoX you around ? [12:53:46] PROBLEM - restbase endpoints health on restbase2017 is CRITICAL: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch url http://10.192.48.119:7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while downloading http://10.192.48.119:7231/en.wikipedia.org/v1/media/math/check/tex https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:53:48] PROBLEM - restbase endpoints health on restbase2023 is CRITICAL: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:53:48] PROBLEM - restbase endpoints health on restbase2025 is CRITICAL: /en.wikipedia.org/v1/page/mobile-sections/{title} (Get mobile-sections for a test page on enwiki) timed out before a response was received: /en.wikipedia.org/v1/page/talk/{title} (Get structured talk page for enwiki Salt article) timed out before a response was received: /en.wikipedia.org/v1/media/math/check/{type} (Mathoid - check test formula) is CRITICAL: Could not fetch [12:53:48] ://10.192.32.71:7231/en.wikipedia.org/v1/media/math/check/tex: Timeout on connection while downloading http://10.192.32.71:7231/en.wikipedia.org/v1/media/math/check/tex https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:53:50] PROBLEM - MariaDB Replica IO: s2 on db2125 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2107.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2107.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:53:54] PROBLEM - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is CRITICAL: /v1/page/{language}/{title} (Fetch enwiki protected page) timed out before a response was received https://wikitech.wikimedia.org/wiki/CX [12:53:56] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: WARNING:urllib3.connectionpool:Retrying (Retry(total=0, connect=None, read=None, redirect=0, status=None)) after connection broken by ConnectTimeoutError(urllib3.connection.VerifiedHTTPSConnection object at 0x7fb4559fcd68, Connection to restbase.svc.codfw.wmnet timed out. (connect timeout=15)): /en.wikipedia.org/v1/?spec https://wikitech.wikimedia.org/wiki/RESTBase [12:53:56] PROBLEM - MariaDB Replica IO: s7 on db2095 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2159.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2159.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:53:58] PROBLEM - MariaDB Replica IO: x1 on db2101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2096.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2096.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:54:10] PROBLEM - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is CRITICAL: /{domain}/v1/feed/onthisday/{type}/{month}/{day} (retrieve selected events on January 15) timed out before a response was received: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most read articles for Januar [12:54:10] 6) timed out before a response was received: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received: /{domain}/v1/page/news (get In the News content) timed out before a response was received https://wikitech.wikimedia.org/wiki/Wikifeeds [12:54:10] PROBLEM - restbase endpoints health on restbase2018 is CRITICAL: /en.wikipedia.org/v1/page/media-list/{title} (Get media-list from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:54:13] claime: see _security [12:54:16] ack [12:54:17] (AppserversUnreachable) firing: Appserver unavailable for cluster api_appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [12:54:20] PROBLEM - haproxy failover on dbproxy2001 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [12:54:23] (AppserversUnreachable) firing: Appserver unavailable for cluster appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [12:54:24] PROBLEM - MariaDB Replica IO: s2 on db2097 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2107.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2107.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:54:24] PROBLEM - MariaDB Replica IO: s2 on db2104 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2107.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2107.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:54:28] (AppserversUnreachable) firing: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [12:54:30] PROBLEM - Check unit status of netbox_ganeti_codfw_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [12:54:35] (03PS4) 10Btullis: Rename ceph profiles to cloudceph [puppet] - 10https://gerrit.wikimedia.org/r/880939 (https://phabricator.wikimedia.org/T326945) [12:54:52] PROBLEM - MariaDB Replica IO: s2 on db2138 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2107.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2107.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:54:58] PROBLEM - MariaDB Replica IO: s2 on db2170 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2107.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2107.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:55:00] PROBLEM - MariaDB Replica IO: s2 on db2175 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2107.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on db2107.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:55:02] PROBLEM - MariaDB Replica IO: s2 on db2126 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2107.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2107.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:55:08] PROBLEM - MariaDB Replica IO: es4 on es2020 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@es2021.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on es2021.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:55:12] PROBLEM - MariaDB Replica IO: x1 on db2115 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2096.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2096.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:55:12] PROBLEM - MariaDB Replica IO: x1 on db2131 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2096.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2096.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:55:14] RECOVERY - restbase endpoints health on restbase2015 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:55:16] RECOVERY - restbase endpoints health on restbase2022 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:55:18] PROBLEM - MariaDB Replica IO: es4 on es2022 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@es2021.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on es2021.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [12:55:22] RECOVERY - restbase endpoints health on restbase2017 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:55:22] RECOVERY - restbase endpoints health on restbase2023 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:55:22] RECOVERY - restbase endpoints health on restbase2025 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:55:26] (KubernetesCalicoDown) firing: kubernetes2010.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2010.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:55:28] RECOVERY - Cxserver LVS codfw on cxserver.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/CX [12:55:31] (KubernetesCalicoDown) firing: ml-serve2006.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2006.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:55:34] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:55:37] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39153/console" [puppet] - 10https://gerrit.wikimedia.org/r/880939 (https://phabricator.wikimedia.org/T326945) (owner: 10Btullis) [12:55:46] PROBLEM - Maps edge codfw on upload-lb.codfw.wikimedia.org is CRITICAL: /osm-intl/info.json (tile service info for osm-intl) timed out before a response was received: /private-info/info.json (private tile service info for osm-intl) timed out before a response was received: /v4/marker/pin-m-fuel+ffffff@2x.png (scaled pushpin marker with an icon) timed out before a response was received: /v4/marker/pin-m-fuel+ffffff.png (Untitled test) time [12:55:46] fore a response was received: /v4/marker/pin-m+ffffff.png (Untitled test) timed out before a response was received: /v4/marker/pin-m+ffffff@2x.png (Untitled test) timed out before a response was received: /_info (Untitled test) timed out before a response was received https://wikitech.wikimedia.org/wiki/Maps/RunBook [12:55:46] RECOVERY - restbase endpoints health on restbase2018 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [12:55:46] RECOVERY - wikifeeds codfw on wikifeeds.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Wikifeeds [12:55:55] (LogstashNoLogsIndexed) firing: Logstash logs are not being indexed by Elasticsearch - https://wikitech.wikimedia.org/wiki/Logstash#No_logs_indexed - https://grafana.wikimedia.org/d/000000561/logstash?var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashNoLogsIndexed [12:55:57] (03PS1) 10Jcrespo: dns: Depool all of codfw [dns] - 10https://gerrit.wikimedia.org/r/880952 [12:56:03] (KubernetesCalicoDown) firing: kubestage2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-staging&var-instance=kubestage2002.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:56:08] (KubernetesCalicoDown) firing: ml-staging-ctrl2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlstaging&var-instance=ml-staging-ctrl2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [12:56:18] (ProbeDown) firing: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:56:32] (Emergency syslog message) firing: Alert for device asw-b-codfw.mgmt.codfw.wmnet - Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [12:56:47] (03CR) 10Ayounsi: [C: 03+1] dns: Depool all of codfw [dns] - 10https://gerrit.wikimedia.org/r/880952 (owner: 10Jcrespo) [12:56:49] (RdfStreamingUpdaterNotEnoughTaskSlots) firing: The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [12:56:58] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [12:56:58] (KubernetesAPILatency) firing: (4) High Kubernetes API latency (LIST destinationrules) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [12:57:00] (03CR) 10Jcrespo: [C: 03+2] dns: Depool all of codfw [dns] - 10https://gerrit.wikimedia.org/r/880952 (owner: 10Jcrespo) [12:57:03] (ProbeDown) firing: (9) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [12:57:04] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-logging2001 is CRITICAL: 76 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=logging-codfw&var-kafka_broker=kafka-logging2001 [12:57:05] RECOVERY - Maps edge codfw on upload-lb.codfw.wikimedia.org is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Maps/RunBook [12:57:07] (03CR) 10Jelto: [C: 03+1] dns: Depool all of codfw [dns] - 10https://gerrit.wikimedia.org/r/880952 (owner: 10Jcrespo) [12:57:46] (JobUnavailable) firing: (17) Reduced availability for job benthos in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [12:58:42] RECOVERY - haproxy failover on dbproxy2004 is OK: OK check_failover servers up 1 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [12:58:52] PROBLEM - Check systemd state on mwmaint2002 is CRITICAL: CRITICAL - degraded: The following units failed: fetch_dbconfig.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:59:16] PROBLEM - carbon-frontend-relay metric drops on graphite1005 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [100.0] https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting https://grafana.wikimedia.org/d/000000020/graphite-eqiad?orgId=1&viewPanel=21 https://grafana.wikimedia.org/d/000000337/graphite-codfw?orgId=1&viewPanel=21 [13:00:12] (KubernetesCalicoDown) firing: (4) kubernetes2006.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:00:16] (KubernetesCalicoDown) firing: (2) ml-serve2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:00:30] RECOVERY - Check systemd state on mwmaint2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:00:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [13:00:52] RECOVERY - carbon-frontend-relay metric drops on graphite1005 is OK: OK: Less than 80.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting https://grafana.wikimedia.org/d/000000020/graphite-eqiad?orgId=1&viewPanel=21 https://grafana.wikimedia.org/d/000000337/graphite-codfw?orgId=1&viewPanel=21 [13:01:18] (ProbeDown) resolved: Service text-https:443 has failed probes (http_text-https_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text-https:443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:01:36] !log oblivian@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=restbase-async,name=.* [13:01:46] !log oblivian@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=restbase-async,name=codfw [13:01:49] (RdfStreamingUpdaterNotEnoughTaskSlots) resolved: The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [13:01:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [13:02:03] (ProbeDown) firing: (14) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:02:18] (ProbeDown) firing: (4) Service text-https:443 has failed probes (http_text-https_ip4) #page - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:03:36] PROBLEM - haproxy failover on dbproxy2004 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [13:04:35] (KafkaUnderReplicatedPartitions) firing: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [13:05:18] PROBLEM - MariaDB Replica Lag: s2 on db2138 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 978.63 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:05:24] PROBLEM - MariaDB Replica Lag: s2 on db2104 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 986.07 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:05:26] PROBLEM - MariaDB Replica Lag: s2 on db2175 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 988.29 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:05:28] PROBLEM - MariaDB Replica Lag: s7 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 989.74 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:05:30] PROBLEM - MariaDB Replica Lag: s2 on db2126 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 991.44 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:05:46] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-main2004 is CRITICAL: 68 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2004 [13:05:46] PROBLEM - MariaDB Replica Lag: s2 on db2125 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1007.12 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:05:48] PROBLEM - MariaDB Replica Lag: s2 on db2170 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1009.90 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:05:52] PROBLEM - MariaDB Replica Lag: s2 on db2095 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1013.61 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:05:54] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-main2005 is CRITICAL: 65 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2005 [13:05:54] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-main2001 is CRITICAL: 302 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2001 [13:05:56] PROBLEM - MariaDB Replica Lag: es4 on es2020 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1017.68 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:05:58] PROBLEM - MariaDB Replica Lag: es4 on es2022 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1019.57 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:06:00] PROBLEM - MariaDB Replica Lag: s2 on db2097 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1021.53 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:06:14] PROBLEM - Kafka Broker Under Replicated Partitions on kafka-main2003 is CRITICAL: 361 ge 10 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2003 [13:06:16] PROBLEM - MariaDB Replica Lag: x1 on db2115 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1038.29 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:06:24] PROBLEM - MariaDB Replica Lag: x1 on db2131 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1045.76 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:06:32] (Emergency syslog message) resolved: Device asw-b-codfw.mgmt.codfw.wmnet recovered from Emergency syslog message - https://alerts.wikimedia.org/?q=alertname%3DEmergency+syslog+message [13:06:42] PROBLEM - MariaDB Replica Lag: x1 on db2101 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1062.65 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:07:03] (ProbeDown) firing: (16) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:07:18] (ProbeDown) firing: (7) Service commons.wikimedia.org:443 has failed probes (http_commons_wikimedia_org_ip4) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:07:43] (03PS1) 10Muehlenhoff: Failover irc.w.o to irc1001 [dns] - 10https://gerrit.wikimedia.org/r/880954 [13:07:49] (RdfStreamingUpdaterNotEnoughTaskSlots) firing: The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [13:08:48] !log jelto@cumin1001 END (FAIL) - Cookbook sre.gitlab.upgrade (exit_code=99) [13:08:58] PROBLEM - carbon-frontend-relay metric drops on graphite1005 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [100.0] https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting https://grafana.wikimedia.org/d/000000020/graphite-eqiad?orgId=1&viewPanel=21 https://grafana.wikimedia.org/d/000000337/graphite-codfw?orgId=1&viewPanel=21 [13:09:17] (03CR) 10Muehlenhoff: [C: 03+2] Failover irc.w.o to irc1001 [dns] - 10https://gerrit.wikimedia.org/r/880954 (owner: 10Muehlenhoff) [13:10:04] RECOVERY - haproxy failover on dbproxy2004 is OK: OK check_failover servers up 1 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [13:10:49] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [13:12:12] RECOVERY - carbon-frontend-relay metric drops on graphite1005 is OK: OK: Less than 80.00% above the threshold [25.0] https://wikitech.wikimedia.org/wiki/Graphite%23Operations_troubleshooting https://grafana.wikimedia.org/d/000000020/graphite-eqiad?orgId=1&viewPanel=21 https://grafana.wikimedia.org/d/000000337/graphite-codfw?orgId=1&viewPanel=21 [13:12:42] PROBLEM - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is CRITICAL: 104 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:12:44] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 127, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [13:13:33] (virtual-chassis crash) resolved: Device asw-b-codfw.mgmt.codfw.wmnet recovered from virtual-chassis crash - https://alerts.wikimedia.org/?q=alertname%3Dvirtual-chassis+crash [13:13:34] !log oblivian@cumin1001 START - Cookbook sre.discovery.service-route check citoid: maintenance [13:13:34] !log oblivian@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) check citoid: maintenance [13:14:01] !log oblivian@cumin1001 START - Cookbook sre.discovery.service-route depool mobileapps in codfw: maintenance [13:14:18] RECOVERY - MediaWiki exceptions and fatals per minute for parsoid on alert1001 is OK: (C)100 gt (W)50 gt 30 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:14:52] PROBLEM - haproxy failover on dbproxy2004 is CRITICAL: CRITICAL check_failover servers up 1 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [13:14:58] PROBLEM - configured eth on lvs2009 is CRITICAL: ens3f0np0 reporting no carrier. https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [13:15:36] !log mvernon@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=swift,name=codfw [13:16:22] (03PS1) 10BBlack: Depool all services in codfw (dnsdisc) [dns] - 10https://gerrit.wikimedia.org/r/880956 [13:16:24] RECOVERY - VRRP status on cr1-codfw is OK: VRRP OK - 0 misconfigured interfaces, 0 inconsistent interfaces https://wikitech.wikimedia.org/wiki/Network_monitoring%23VRRP_status [13:16:25] (FNMNotReported) firing: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [13:16:26] RECOVERY - Host mw2311 is UP: PING OK - Packet loss = 0%, RTA = 33.22 ms [13:16:26] RECOVERY - Host parse2009 is UP: PING OK - Packet loss = 0%, RTA = 33.19 ms [13:16:26] RECOVERY - Host cp2034 is UP: PING OK - Packet loss = 0%, RTA = 33.15 ms [13:16:26] RECOVERY - Host conf2004 is UP: PING OK - Packet loss = 0%, RTA = 33.22 ms [13:16:26] RECOVERY - Host mw2262 is UP: PING OK - Packet loss = 0%, RTA = 33.19 ms [13:16:27] RECOVERY - Host mw2322 is UP: PING OK - Packet loss = 0%, RTA = 33.17 ms [13:16:27] RECOVERY - Host ores2004 is UP: PING OK - Packet loss = 0%, RTA = 33.22 ms [13:16:28] RECOVERY - Host restbase2019 is UP: PING OK - Packet loss = 0%, RTA = 33.17 ms [13:16:28] RECOVERY - Host mw2260 is UP: PING OK - Packet loss = 0%, RTA = 33.13 ms [13:16:29] RECOVERY - Host mw2334 is UP: PING OK - Packet loss = 0%, RTA = 33.19 ms [13:16:29] RECOVERY - Host es2029 #page is UP: PING OK - Packet loss = 0%, RTA = 33.21 ms [13:16:30] RECOVERY - Host mw2259 is UP: PING OK - Packet loss = 0%, RTA = 33.15 ms [13:16:30] RECOVERY - Host parse2006 is UP: PING OK - Packet loss = 0%, RTA = 33.22 ms [13:16:31] RECOVERY - Host mc2046 is UP: PING OK - Packet loss = 0%, RTA = 33.20 ms [13:16:31] (CirrusSearchJVMGCYoungPoolInsufficient) firing: Elasticsearch instance elastic2078-production-search-psi-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [13:16:31] RECOVERY - Host ores2003 is UP: PING OK - Packet loss = 0%, RTA = 33.27 ms [13:16:46] (ThanosSidecarBucketOperationsFailed) firing: (6) Thanos Sidecar bucket operations are failing - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarBucketOperationsFailed [13:16:58] (KubernetesAPILatency) firing: (5) High Kubernetes API latency (LIST destinationrules) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:17:03] (ProbeDown) firing: (15) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:17:55] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-main2003 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=main-codfw&var-kafka_broker=kafka-main2003 [13:17:59] RECOVERY - MariaDB Replica Lag: x1 on db2115 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:18:05] RECOVERY - MariaDB Replica IO: s2 on db2138 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:18:05] RECOVERY - MariaDB Replica Lag: x1 on db2131 is OK: OK slave_sql_lag Replication lag: 0.43 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:18:05] RECOVERY - Host kubernetes2020 is UP: PING OK - Packet loss = 0%, RTA = 33.29 ms [13:18:09] RECOVERY - MariaDB Replica IO: s2 on db2170 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:18:11] RECOVERY - MariaDB Replica IO: s2 on db2175 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:18:15] RECOVERY - MariaDB Replica IO: s2 on db2126 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:18:23] RECOVERY - MariaDB Replica Lag: x1 on db2101 is OK: OK slave_sql_lag Replication lag: 0.12 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:18:23] RECOVERY - MariaDB Replica IO: x1 on db2115 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:18:23] RECOVERY - MariaDB Replica IO: x1 on db2131 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:18:27] RECOVERY - MariaDB Replica IO: es4 on es2022 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:18:35] RECOVERY - haproxy failover on dbproxy2003 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [13:19:05] !log oblivian@cumin1001 END (PASS) - Cookbook sre.discovery.service-route (exit_code=0) depool mobileapps in codfw: maintenance [13:19:33] PROBLEM - Etcd cluster health on ml-staging-etcd2002 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [13:19:35] PROBLEM - haproxy failover on dbproxy2002 is CRITICAL: CRITICAL check_failover servers up 2 down 1: https://wikitech.wikimedia.org/wiki/HAProxy [13:19:47] PROBLEM - cassandra-a CQL 10.192.16.183:9042 on aqs2006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [13:19:55] PROBLEM - cassandra-a CQL 10.192.16.153:9042 on restbase2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [13:20:17] PROBLEM - Check systemd state on registry2004 is CRITICAL: CRITICAL - degraded: The following units failed: build-homepage.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:20:31] PROBLEM - aqs endpoints health on aqs2006 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected [13:20:31] 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:20:33] RECOVERY - MariaDB Replica Lag: s7 on db2095 is OK: OK slave_sql_lag Replication lag: 0.14 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:20:35] PROBLEM - cassandra-a CQL 10.192.16.82:9042 on restbase2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [13:20:37] PROBLEM - Etcd cluster health on conf2004 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [13:20:39] PROBLEM - Etcd cluster health on ml-etcd2001 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [13:20:43] PROBLEM - aqs endpoints health on aqs2007 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:20:45] PROBLEM - cassandra-a CQL 10.192.16.95:9042 on sessionstore2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [13:20:47] PROBLEM - cassandra-b CQL 10.192.16.187:9042 on aqs2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [13:20:47] PROBLEM - cassandra-c CQL 10.192.16.84:9042 on restbase2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [13:20:49] PROBLEM - cassandra-c CQL 10.192.16.155:9042 on restbase2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [13:20:49] PROBLEM - cassandra-a CQL 10.192.16.186:9042 on aqs2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [13:20:49] PROBLEM - aqs endpoints health on aqs2008 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/aggregate/{referer}/{media_type}/{agent}/{granularity}/{start}/{end} (Get aggregate mediarequests) is CRITICAL: Test Get aggregate mediarequests returne [13:20:49] expected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:20:51] PROBLEM - cassandra-b CQL 10.192.16.112:9042 on restbase2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [13:20:55] PROBLEM - Wikidough DoH Check -IPv6- on doh2002 is CRITICAL: connect to address 2620:0:860:2:208:80:153:38 and port 443: No route to host https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [13:20:55] PROBLEM - cassandra-b CQL 10.192.16.99:9042 on restbase2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [13:20:57] RECOVERY - MariaDB Replica Lag: s2 on db2095 is OK: OK slave_sql_lag Replication lag: 0.01 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:20:59] PROBLEM - cassandra-b CQL 10.192.16.189:9042 on aqs2008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [13:21:03] PROBLEM - Widespread puppet agent failures- no resources reported on alert1001 is CRITICAL: 0.06778 ge 0.01 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:21:05] RECOVERY - MariaDB Replica Lag: s2 on db2097 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:21:07] PROBLEM - Check systemd state on prometheus2005 is CRITICAL: CRITICAL - degraded: The following units failed: generate-mysqld-exporter-config.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:21:11] PROBLEM - aqs endpoints health on aqs2005 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test Get per fi [13:21:11] sts returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:21:13] PROBLEM - cassandra-a CQL 10.192.16.111:9042 on restbase2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [13:21:13] PROBLEM - cassandra-b CQL 10.192.16.83:9042 on restbase2013 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [13:21:19] PROBLEM - cassandra-c CQL 10.192.16.113:9042 on restbase2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [13:21:19] PROBLEM - cassandra-a CQL 10.192.16.85:9042 on restbase2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [13:21:21] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [13:21:30] (03PS1) 10Zabe: objectcache: Fix DI for MultiWriteBagOStuff sub caches [core] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/880908 (https://phabricator.wikimedia.org/T327158) [13:21:37] PROBLEM - cassandra-b CQL 10.192.16.185:9042 on aqs2006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [13:21:38] (03CR) 10Zabe: [C: 03+2] objectcache: Fix DI for MultiWriteBagOStuff sub caches [core] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/880908 (https://phabricator.wikimedia.org/T327158) (owner: 10Zabe) [13:21:39] PROBLEM - Wikidough DoT Check -IPv6- on doh2002 is CRITICAL: connect to address 2620:0:860:2:208:80:153:38 and port 853: No route to host https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [13:21:47] PROBLEM - cassandra-c CQL 10.192.16.87:9042 on restbase2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [13:21:47] PROBLEM - cassandra-b CQL 10.192.16.86:9042 on restbase2014 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [13:21:51] PROBLEM - cassandra-a CQL 10.192.16.98:9042 on restbase2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [13:21:51] PROBLEM - cassandra-c CQL 10.192.16.100:9042 on restbase2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [13:21:53] PROBLEM - cassandra-b CQL 10.192.16.154:9042 on restbase2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [13:21:55] PROBLEM - Etcd cluster health on kubetcd2006 is CRITICAL: The etcd server is unhealthy https://wikitech.wikimedia.org/wiki/Etcd [13:22:01] PROBLEM - cassandra-a CQL 10.192.16.188:9042 on aqs2008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [13:22:02] zabe: please don't deploy anythingg atm, see _security [13:22:09] PROBLEM - cassandra-a CQL 10.192.16.174:9042 on aqs2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [13:22:09] PROBLEM - cassandra-b CQL 10.192.16.179:9042 on aqs2005 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [13:22:15] sure, +2'ed to get CI running [13:22:53] RECOVERY - haproxy failover on dbproxy2002 is OK: OK check_failover servers up 2 down 0: https://wikitech.wikimedia.org/wiki/HAProxy [13:23:06] you shouldn't +2 before you're actually ready to deploy, just to ensure the deployment branch doesn't get out of sync with production [13:23:35] ok, will wait [13:24:33] RECOVERY - aqs endpoints health on aqs2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:26:23] <_joe_> !log depooling all services in codfw [13:26:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:27:41] RECOVERY - MariaDB Replica Lag: es4 on es2020 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:27:46] !log restarting manually replication on es2020, may require data check afterwards [13:27:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:21] RECOVERY - MariaDB Replica IO: es4 on es2020 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [13:30:59] RECOVERY - Check systemd state on prometheus2005 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:32:01] RECOVERY - aqs endpoints health on aqs2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:32:13] (ProbeDown) firing: (51) Service commons.wikimedia.org:443 has failed probes (http_commons_wikimedia_org_ip4) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:32:25] PROBLEM - Host db2147 #page is DOWN: PING CRITICAL - Packet loss = 100% [13:32:54] RECOVERY - Host db2147 #page is UP: PING OK - Packet loss = 0%, RTA = 33.43 ms [13:33:34] PROBLEM - Host db2159 #page is DOWN: PING CRITICAL - Packet loss = 100% [13:33:53] PROBLEM - Host elastic2058 is DOWN: PING CRITICAL - Packet loss = 100% [13:33:56] (CirrusSearchNodeIndexingNotIncreasing) firing: (2) Elasticsearch instance elastic2042-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [13:34:00] RECOVERY - Host db2159 #page is UP: PING OK - Packet loss = 0%, RTA = 33.22 ms [13:34:19] PROBLEM - aqs endpoints health on aqs2005 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/aggregate/{referer}/{media_type}/{agent}/{granularity}/{start}/{end} (Get aggregate mediarequests) is C [13:34:19] Test Get aggregate mediarequests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:34:27] PROBLEM - Host elastic2044 is DOWN: PING CRITICAL - Packet loss = 100% [13:34:50] PROBLEM - Host db2164 #page is DOWN: PING CRITICAL - Packet loss = 100% [13:35:02] !log mvernon@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=thanos-swift,name=codfw [13:35:10] RECOVERY - Host db2164 #page is UP: PING OK - Packet loss = 0%, RTA = 33.26 ms [13:35:33] (JobUnavailable) firing: (142) Reduced availability for job alertmanager in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:35:38] !log mvernon@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=thanos-query,name=codfw [13:36:50] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) firing: Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [13:37:07] 10SRE, 10SRE-Access-Requests: Requesting access to Analytics Data for Muhammad Jaziraly - https://phabricator.wikimedia.org/T327172 (10Muhammad_Yasser_Jazirahly_WMDE) [13:37:14] (AppserversUnreachable) resolved: Appserver unavailable for cluster api_appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=api_appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [13:37:19] (AppserversUnreachable) resolved: Appserver unavailable for cluster jobrunner at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=jobrunner - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [13:37:24] (AppserversUnreachable) resolved: Appserver unavailable for cluster appserver at codfw - https://wikitech.wikimedia.org/wiki/Application_servers - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1&var-datasource=codfw%20prometheus/ops&var-cluster=appserver - https://alerts.wikimedia.org/?q=alertname%3DAppserversUnreachable [13:37:38] (KafkaUnderReplicatedPartitions) resolved: Under replicated partitions for Kafka cluster main-codfw in codfw - https://wikitech.wikimedia.org/wiki/Kafka/Administration - https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&var-datasource=codfw%20prometheus/ops&var-kafka_cluster=main-codfw - https://alerts.wikimedia.org/?q=alertname%3DKafkaUnderReplicatedPartitions [13:37:40] !log jiji@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=recommendation-api,name=codfw [13:37:41] RECOVERY - Host elastic2058 is UP: PING OK - Packet loss = 0%, RTA = 33.62 ms [13:37:45] PROBLEM - Host elastic2080 is DOWN: PING CRITICAL - Packet loss = 100% [13:38:05] PROBLEM - Host ganeti2031 is DOWN: PING CRITICAL - Packet loss = 100% [13:38:09] (KubernetesCalicoDown) resolved: (18) kubernetes2005.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:38:14] (KubernetesCalicoDown) resolved: (10) ml-serve-ctrl2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:38:21] RECOVERY - Host ganeti2031 is UP: PING OK - Packet loss = 0%, RTA = 33.26 ms [13:38:48] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [13:38:53] (LogstashNoLogsIndexed) resolved: Logstash logs are not being indexed by Elasticsearch - https://wikitech.wikimedia.org/wiki/Logstash#No_logs_indexed - https://grafana.wikimedia.org/d/000000561/logstash?var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashNoLogsIndexed [13:39:04] (KubernetesCalicoDown) resolved: (2) kubestage2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:39:08] (KubernetesCalicoDown) firing: (4) ml-staging-ctrl2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [13:39:34] (FNMNotReported) resolved: FastNetMon metrics not reported - https://wikitech.wikimedia.org/wiki/Fastnetmon - https://w.wiki/8oU - https://alerts.wikimedia.org/?q=alertname%3DFNMNotReported [13:39:38] PROBLEM - Host db2163 #page is DOWN: PING CRITICAL - Packet loss = 100% [13:39:41] RECOVERY - Host elastic2044 is UP: PING OK - Packet loss = 0%, RTA = 33.28 ms [13:39:45] RECOVERY - Host db2163 #page is UP: PING OK - Packet loss = 0%, RTA = 33.23 ms [13:40:01] (ThanosSidecarBucketOperationsFailed) resolved: (8) Thanos Sidecar bucket operations are failing - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarBucketOperationsFailed [13:40:15] PROBLEM - Host elastic2079 is DOWN: PING CRITICAL - Packet loss = 100% [13:40:23] (KubernetesAPILatency) firing: (9) High Kubernetes API latency (LIST destinationrules) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:40:25] RECOVERY - aqs endpoints health on aqs2008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:40:27] (ThanosSidecarBucketOperationsFailed) firing: (8) Thanos Sidecar bucket operations are failing - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarBucketOperationsFailed [13:40:31] (ProbeDown) firing: (60) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:41:45] PROBLEM - aqs endpoints health on aqs2006 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:42:27] RECOVERY - Host elastic2080 is UP: PING OK - Packet loss = 0%, RTA = 33.19 ms [13:43:13] PROBLEM - Host elastic2043 is DOWN: PING CRITICAL - Packet loss = 100% [13:43:27] RECOVERY - Host elastic2079 is UP: PING OK - Packet loss = 0%, RTA = 33.15 ms [13:43:47] RECOVERY - Widespread puppet agent failures- no resources reported on alert1001 is OK: (C)0.01 ge (W)0.006 ge 0.005575 https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/yOxVDGvWk/puppet [13:45:15] PROBLEM - aqs endpoints health on aqs2008 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: [13:45:15] t per file requests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:47:23] RECOVERY - Host elastic2043 is UP: PING OK - Packet loss = 0%, RTA = 33.19 ms [13:50:39] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [13:51:13] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:51:54] (03CR) 10Krinkle: [C: 03+1] objectcache: Fix DI for MultiWriteBagOStuff sub caches [core] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/880908 (https://phabricator.wikimedia.org/T327158) (owner: 10Zabe) [13:52:25] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:53:02] PROBLEM - Host db2161 #page is DOWN: PING CRITICAL - Packet loss = 100% [13:53:14] RECOVERY - Host db2161 #page is UP: PING OK - Packet loss = 0%, RTA = 33.22 ms [13:54:13] PROBLEM - Host elastic2070 is DOWN: PING CRITICAL - Packet loss = 100% [13:54:31] (ProbeDown) firing: (54) Service commons.wikimedia.org:443 has failed probes (http_commons_wikimedia_org_ip4) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:54:35] RECOVERY - aqs endpoints health on aqs2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:55:27] PROBLEM - Host elastic2058 is DOWN: PING CRITICAL - Packet loss = 100% [13:56:01] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:56:49] (ThanosSidecarBucketOperationsFailed) resolved: (2) Thanos Sidecar bucket operations are failing - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/b19644bfbf0ec1e108027cce268d99f7/thanos-sidecar - https://alerts.wikimedia.org/?q=alertname%3DThanosSidecarBucketOperationsFailed [13:57:02] (JobUnavailable) firing: (142) Reduced availability for job alertmanager in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [13:57:19] (RdfStreamingUpdaterNotEnoughTaskSlots) resolved: The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [13:57:31] RECOVERY - Host elastic2070 is UP: PING OK - Packet loss = 0%, RTA = 33.35 ms [13:57:55] PROBLEM - Host logstash2025 is DOWN: PING CRITICAL - Packet loss = 100% [13:57:59] RECOVERY - aqs endpoints health on aqs2007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [13:58:09] RECOVERY - Host logstash2025 is UP: PING OK - Packet loss = 0%, RTA = 33.38 ms [13:58:31] (RdfStreamingUpdaterFlinkProcessingLatencyIsHigh) resolved: Processing latency of WDQS_Streaming_Updater in codfw (k8s) is above 10 minutes - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/fdU5Zx-Mk/wdqs-streaming-updater - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkProcessingLatencyIsHigh [13:58:34] (03CR) 10Ottomata: [C: 03+1] "I think this should be fine. IIRC eventgate-analytics-external caches its stream config (removed for these streams in I58ae2db77313b4253c" [puppet] - 10https://gerrit.wikimedia.org/r/726852 (https://phabricator.wikimedia.org/T281103) (owner: 10Phedenskog) [13:58:39] (KubernetesRsyslogDown) firing: (2) rsyslog on kubernetes2010:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [13:58:44] (03CR) 10Ottomata: [C: 03+1] "TY!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879926 (https://phabricator.wikimedia.org/T281103) (owner: 10Krinkle) [13:58:51] PROBLEM - Host logstash2027 is DOWN: PING CRITICAL - Packet loss = 100% [13:59:05] RECOVERY - Host elastic2058 is UP: PING OK - Packet loss = 0%, RTA = 33.15 ms [13:59:11] (WdqsStreamingUpdaterFlinkJobNotRunning) firing: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning [13:59:16] (WcqsStreamingUpdaterFlinkJobNotRunning) firing: WCQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning [13:59:23] RECOVERY - Host logstash2027 is UP: PING OK - Packet loss = 0%, RTA = 33.08 ms [13:59:44] (RdfStreamingUpdaterNotEnoughTaskSlots) firing: The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [14:00:04] (KubernetesAPILatency) firing: (76) High Kubernetes API latency (GET clusterinformations) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:00:04] RoanKattouw, Lucas_WMDE, Urbanecm, awight, and TheresNoTime: OwO what's this, a deployment window?? UTC afternoon backport window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230117T1400). nyaa~ [14:00:05] MatmaRex and awight: A patch you scheduled for UTC afternoon backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [14:00:05] Deploy window Mobileapps/RESTBase/Wikifeeds (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230117T1400) [14:00:29] MatmaRex: definitely not happening [14:00:30] PROBLEM - Host db2164 #page is DOWN: PING CRITICAL - Packet loss = 100% [14:00:44] hi. oops :( [14:00:50] RECOVERY - Host db2164 #page is UP: PING OK - Packet loss = 0%, RTA = 33.19 ms [14:00:54] There are ongoing issues specially with restbase [14:01:09] help to monitor its health is welcome [14:01:11] 10SRE, 10SRE-Access-Requests: Requesting access to Analytics Data for Muhammad Jaziraly - https://phabricator.wikimedia.org/T327172 (10Ottomata) Approved. This should include analytics-privatedata-users, as well as ssh and kerberos access. [14:01:17] (network, not software) [14:01:36] will update topic [14:01:44] PROBLEM - Host db2109 #page is DOWN: PING CRITICAL - Packet loss = 100% [14:01:47] (03CR) 10Zabe: [C: 03+1] "I can deploy this later when the ongoing incident is over (unless someone is faster than me)" [core] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/880908 (https://phabricator.wikimedia.org/T327158) (owner: 10Zabe) [14:01:48] none of my changes affect restbase usage. but if there's an outage, i'll obviously reschedule [14:01:55] (CirrusSearchNodeIndexingNotIncreasing) resolved: (2) Elasticsearch instance elastic2042-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [14:02:17] RECOVERY - Host db2109 #page is UP: PING OK - Packet loss = 0%, RTA = 33.24 ms [14:02:20] (KubernetesRsyslogDown) firing: (3) rsyslog on kubernetes2009:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:02:39] (03PS1) 10Hashar: gerrit: remove /srv/gerrit/jvmlogs [puppet] - 10https://gerrit.wikimedia.org/r/880963 [14:02:42] awight: fyi: the B&C window is cancelled due to ongoing issues. [14:02:45] PROBLEM - Host elastic2079 is DOWN: PING CRITICAL - Packet loss = 100% [14:02:51] PROBLEM - Check unit status of httpbb_hourly_appserver on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:03:03] (RdfStreamingUpdaterNotEnoughTaskSlots) resolved: The flink session cluster rdf-streaming-updater in codfw (k8s) does not have enough task slots - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterNotEnoughTaskSlots [14:03:08] (RdfStreamingUpdaterFlinkJobUnstable) resolved: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [14:03:16] urbanecm: o/ thanks for the note! That's fine for me anyway, I've canceled my patch due to other issues ;-) [14:03:31] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [14:03:35] (KubernetesAPILatency) firing: (82) High Kubernetes API latency (GET clusterinformations) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:03:35] ok :) [14:03:37] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:03:44] PROBLEM - Host db2107 #page is DOWN: PING CRITICAL - Packet loss = 100% [14:03:45] codfw is currently in a very unhappy state. for now wait [14:03:47] sorry fot the trouble [14:04:18] RECOVERY - Host db2107 #page is UP: PING OK - Packet loss = 0%, RTA = 33.19 ms [14:04:25] PROBLEM - aqs endpoints health on aqs2007 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/aggregate/{referer}/{media_type}/{agent}/{granularity}/{start}/{end} (Get aggregate mediarequests) is C [14:04:25] Test Get aggregate mediarequests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:04:35] 10SRE, 10SRE-Access-Requests: Requesting access to Analytics Data for Muhammad Jaziraly - https://phabricator.wikimedia.org/T327172 (10WMDE-leszek) I approved this request on WMDE's end, thank you [14:04:44] (KubernetesRsyslogDown) resolved: (3) rsyslog on kubernetes2009:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:05:03] PROBLEM - Host elastic2044 is DOWN: PING CRITICAL - Packet loss = 100% [14:05:55] PROBLEM - Host elastic2043 is DOWN: PING CRITICAL - Packet loss = 100% [14:05:59] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [14:06:01] RECOVERY - aqs endpoints health on aqs2007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:06:03] (KubernetesAPILatency) firing: (88) High Kubernetes API latency (DELETE apiservices) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:06:13] RECOVERY - Host elastic2079 is UP: PING OK - Packet loss = 0%, RTA = 33.19 ms [14:06:45] (JobUnavailable) firing: (7) Reduced availability for job ganeti in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:07:27] (KubernetesRsyslogDown) firing: rsyslog on ml-serve2008:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=ml-serve2008 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:07:32] (KubernetesCalicoDown) firing: kubernetes2020.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s&var-instance=kubernetes2020.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:08:10] (WdqsStreamingUpdaterFlinkJobNotRunning) resolved: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning [14:08:15] (WcqsStreamingUpdaterFlinkJobNotRunning) resolved: WCQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning [14:08:41] PROBLEM - Host elastic2080 is DOWN: PING CRITICAL - Packet loss = 100% [14:09:01] (WdqsStreamingUpdaterFlinkJobNotRunning) firing: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning [14:09:06] (WcqsStreamingUpdaterFlinkJobNotRunning) firing: WCQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning [14:09:10] (KubernetesAPILatency) firing: (97) High Kubernetes API latency (DELETE apiservices) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:09:11] RECOVERY - Host elastic2043 is UP: PING OK - Packet loss = 0%, RTA = 33.17 ms [14:09:45] RECOVERY - Host elastic2044 is UP: PING OK - Packet loss = 0%, RTA = 33.23 ms [14:10:19] RECOVERY - Check systemd state on registry2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:10:26] PROBLEM - Host db2159 #page is DOWN: PING CRITICAL - Packet loss = 100% [14:10:44] (KubernetesRsyslogDown) resolved: rsyslog on ml-serve2008:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=ml-serve2008 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:10:48] (KubernetesCalicoDown) firing: (8) kubernetes2006.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:11:00] PROBLEM - Host db2178 #page is DOWN: PING CRITICAL - Packet loss = 100% [14:11:16] (03PS1) 10Jgiannelos: chromium-render: Bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/880964 [14:11:24] RECOVERY - Host db2178 #page is UP: PING OK - Packet loss = 0%, RTA = 33.21 ms [14:11:26] RECOVERY - Host db2159 #page is UP: PING OK - Packet loss = 0%, RTA = 33.15 ms [14:11:30] (KubernetesCalicoDown) firing: ml-serve2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlserve&var-instance=ml-serve2002.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:11:44] (RdfStreamingUpdaterFlinkJobUnstable) resolved: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [14:11:48] (KubernetesAPILatency) firing: (89) High Kubernetes API latency (DELETE apiservices) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:12:05] RECOVERY - Host elastic2080 is UP: PING OK - Packet loss = 0%, RTA = 33.52 ms [14:12:19] <_joe_> !log try to restart cassandra-a on aqs2005 [14:12:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:37] RECOVERY - aqs endpoints health on aqs2008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:12:57] RECOVERY - aqs endpoints health on aqs2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:13:44] (KubernetesCalicoDown) firing: (2) ml-serve2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:13:53] (KubernetesAPILatency) firing: (83) High Kubernetes API latency (DELETE apiservices) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:13:59] (RoutinatorRsyncErrors) firing: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [14:15:46] PROBLEM - Host db2161 #page is DOWN: PING CRITICAL - Packet loss = 100% [14:16:10] (CirrusSearchJVMGCYoungPoolInsufficient) resolved: Elasticsearch instance elastic2078-production-search-psi-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [14:16:14] (KubernetesAPILatency) firing: (85) High Kubernetes API latency (DELETE apiservices) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:16:26] RECOVERY - Host db2161 #page is UP: PING OK - Packet loss = 0%, RTA = 33.15 ms [14:17:35] (KubernetesCalicoDown) firing: (8) kubernetes2006.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:18:41] PROBLEM - aqs endpoints health on aqs2006 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test Get per fi [14:18:41] sts returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:19:02] (03CR) 10Jgiannelos: [C: 03+2] chromium-render: Bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/880964 (owner: 10Jgiannelos) [14:19:04] (KubernetesAPILatency) firing: (78) High Kubernetes API latency (DELETE apiservices) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:19:28] (JobUnavailable) firing: (8) Reduced availability for job calico-felix in k8s@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:19:45] PROBLEM - Host db2164 #page is DOWN: PING CRITICAL - Packet loss = 100% [14:20:20] RECOVERY - Host db2164 #page is UP: PING OK - Packet loss = 0%, RTA = 33.20 ms [14:20:50] (KubernetesCalicoDown) firing: (17) kubernetes2005.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:21:01] PROBLEM - aqs endpoints health on aqs2005 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Te [14:21:01] ggregate page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:22:13] PROBLEM - Host dbproxy2002 is DOWN: PING CRITICAL - Packet loss = 100% [14:22:25] RECOVERY - Host dbproxy2002 is UP: PING OK - Packet loss = 0%, RTA = 33.24 ms [14:22:37] RECOVERY - aqs endpoints health on aqs2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:22:55] (KubernetesAPILatency) firing: (81) High Kubernetes API latency (DELETE apiservices) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:23:32] PROBLEM - Host db2123 #page is DOWN: PING CRITICAL - Packet loss = 100% [14:23:35] RECOVERY - aqs endpoints health on aqs2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:23:55] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:24:08] (03Merged) 10jenkins-bot: chromium-render: Bump image version [deployment-charts] - 10https://gerrit.wikimedia.org/r/880964 (owner: 10Jgiannelos) [14:24:12] RECOVERY - Host db2123 #page is UP: PING OK - Packet loss = 0%, RTA = 33.24 ms [14:24:13] PROBLEM - Host elastic2079 is DOWN: PING CRITICAL - Packet loss = 100% [14:24:27] (KubernetesCalicoDown) firing: (18) kubernetes2005.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:24:45] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:24:58] PROBLEM - Host db2107 #page is DOWN: PING CRITICAL - Packet loss = 100% [14:25:20] RECOVERY - Host db2107 #page is UP: PING OK - Packet loss = 0%, RTA = 33.22 ms [14:25:33] (03CR) 10Klausman: [C: 03+1] kserve: upgrade to version 0.9 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/880499 (https://phabricator.wikimedia.org/T325528) (owner: 10Elukey) [14:25:35] (KubernetesAPILatency) firing: (85) High Kubernetes API latency (DELETE apiservices) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:25:41] PROBLEM - Host elastic2058 is DOWN: PING CRITICAL - Packet loss = 100% [14:26:26] !log jgiannelos@deploy1002 helmfile [staging] START helmfile.d/services/proton: apply [14:27:23] RECOVERY - Host elastic2079 is UP: PING OK - Packet loss = 0%, RTA = 33.18 ms [14:27:27] !log jgiannelos@deploy1002 helmfile [staging] DONE helmfile.d/services/proton: apply [14:27:41] (KubernetesAPILatency) firing: (84) High Kubernetes API latency (DELETE apiservices) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:28:27] PROBLEM - aqs endpoints health on aqs2006 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test Get per file [14:28:27] s returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:28:39] PROBLEM - aqs endpoints health on aqs2007 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:29:07] PROBLEM - aqs endpoints health on aqs2005 is CRITICAL: /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test Get per file requests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:29:10] (KubernetesCalicoDown) firing: (18) kubernetes2005.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:29:37] RECOVERY - Host elastic2058 is UP: PING OK - Packet loss = 0%, RTA = 33.22 ms [14:29:54] (KubernetesAPILatency) firing: (84) High Kubernetes API latency (POST apiservices) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:30:20] PROBLEM - Host db2147 #page is DOWN: PING CRITICAL - Packet loss = 100% [14:30:54] (KubernetesCalicoDown) firing: (18) kubernetes2005.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:31:24] RECOVERY - Host db2147 #page is UP: PING OK - Packet loss = 0%, RTA = 33.22 ms [14:32:21] RECOVERY - aqs endpoints health on aqs2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:32:33] (KubernetesAPILatency) firing: (87) High Kubernetes API latency (DELETE apiservices) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:34:01] (KubernetesCalicoDown) firing: (2) ml-serve2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:34:19] PROBLEM - Host elastic2043 is DOWN: PING CRITICAL - Packet loss = 100% [14:34:31] (KubernetesAPILatency) firing: (89) High Kubernetes API latency (DELETE apiservices) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:34:37] PROBLEM - Host db2159 #page is DOWN: PING CRITICAL - Packet loss = 100% [14:34:38] PROBLEM - Host db2108 #page is DOWN: PING CRITICAL - Packet loss = 100% [14:34:52] RECOVERY - Host db2108 #page is UP: PING OK - Packet loss = 0%, RTA = 33.21 ms [14:35:14] PROBLEM - Host db2137 #page is DOWN: PING CRITICAL - Packet loss = 100% [14:35:14] PROBLEM - aqs endpoints health on aqs2008 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by p [14:35:14] s returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:35:20] RECOVERY - Host db2159 #page is UP: PING OK - Packet loss = 0%, RTA = 33.16 ms [14:35:48] RECOVERY - Host db2137 #page is UP: PING OK - Packet loss = 0%, RTA = 33.20 ms [14:36:03] (KubernetesCalicoDown) firing: (17) kubernetes2005.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:36:15] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:36:43] RECOVERY - aqs endpoints health on aqs2007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:37:00] PROBLEM - Host es2025 #page is DOWN: PING CRITICAL - Packet loss = 100% [14:37:12] RECOVERY - Host es2025 #page is UP: PING OK - Packet loss = 0%, RTA = 33.74 ms [14:37:12] PROBLEM - aqs endpoints health on aqs2005 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test Get per file requests returned the unexpect [14:37:12] s 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/aggregate/{referer}/{media_type}/{agent}/{granularity}/{start}/{end} (Get aggregate mediarequests) is CRITICAL: Test Get aggregate mediarequests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:37:27] (KubernetesAPILatency) firing: (89) High Kubernetes API latency (DELETE apiservices) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:38:17] RECOVERY - cassandra-a CQL 10.192.16.82:9042 on restbase2013 is OK: TCP OK - 0.033 second response time on 10.192.16.82 port 9042 https://phabricator.wikimedia.org/T93886 [14:38:19] RECOVERY - Host elastic2043 is UP: PING OK - Packet loss = 0%, RTA = 33.22 ms [14:38:21] RECOVERY - cassandra-c CQL 10.192.16.84:9042 on restbase2013 is OK: TCP OK - 0.033 second response time on 10.192.16.84 port 9042 https://phabricator.wikimedia.org/T93886 [14:38:49] RECOVERY - cassandra-b CQL 10.192.16.83:9042 on restbase2013 is OK: TCP OK - 0.033 second response time on 10.192.16.83 port 9042 https://phabricator.wikimedia.org/T93886 [14:39:10] PROBLEM - Host db2111 #page is DOWN: PING CRITICAL - Packet loss = 100% [14:39:26] PROBLEM - Host pc2012 #page is DOWN: PING CRITICAL - Packet loss = 100% [14:40:12] RECOVERY - Host db2111 #page is UP: PING OK - Packet loss = 0%, RTA = 33.49 ms [14:40:13] RECOVERY - Host pc2012 #page is UP: PING OK - Packet loss = 0%, RTA = 33.20 ms [14:40:26] PROBLEM - Host db2109 #page is DOWN: PING CRITICAL - Packet loss = 100% [14:40:50] PROBLEM - Host db2164 #page is DOWN: PING CRITICAL - Packet loss = 100% [14:40:51] RECOVERY - Host db2109 #page is UP: PING OK - Packet loss = 0%, RTA = 33.24 ms [14:41:08] PROBLEM - Host db2177 #page is DOWN: PING CRITICAL - Packet loss = 100% [14:41:08] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_ganeti_codfw_sync.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:41:15] RECOVERY - Host db2164 #page is UP: PING OK - Packet loss = 0%, RTA = 33.23 ms [14:41:39] PROBLEM - aqs endpoints health on aqs2007 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Te [14:41:39] ggregate page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/aggregate/{referer}/{media_type}/{agent}/{granularity}/{start}/{end} (Get aggregate mediarequests) is C [14:41:39] Test Get aggregate mediarequests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:41:50] RECOVERY - Host db2177 #page is UP: PING OK - Packet loss = 0%, RTA = 33.31 ms [14:41:58] (KubernetesAPILatency) firing: (86) High Kubernetes API latency (DELETE apiservices) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:42:07] RECOVERY - aqs endpoints health on aqs2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:42:55] (03PS1) 10Slyngshede: C:apereo_cas Fix regex for IDM [puppet] - 10https://gerrit.wikimedia.org/r/880968 [14:43:01] (03CR) 10Atieno: [C: 03+1] WIP: Update Thumbor repository according to the latest changes [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/876229 (https://phabricator.wikimedia.org/T325811) (owner: 10Vlad.shapik) [14:45:35] PROBLEM - Host ml-serve2002 is DOWN: PING CRITICAL - Packet loss = 100% [14:45:49] PROBLEM - Host elastic2070 is DOWN: PING CRITICAL - Packet loss = 100% [14:45:59] RECOVERY - Host ml-serve2002 is UP: PING OK - Packet loss = 0%, RTA = 33.41 ms [14:46:05] (KubernetesCalicoDown) firing: (2) ml-serve2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:46:23] PROBLEM - Cassandra instance data free space on restbase1017 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7609 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [14:46:57] PROBLEM - aqs endpoints health on aqs2005 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/aggregate/{project}/{access}/{agent}/{granularity}/{start}/{end} (Get aggregate page views) is CRITICAL: Test Get aggregate page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/per-file/{referer}/{agent}/{file_path}/{granularity}/{start}/{end} (Get per file requests) is CRITICAL: Test Get per fi [14:46:57] sts returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/aggregate/{referer}/{media_type}/{agent}/{granularity}/{start}/{end} (Get aggregate mediarequests) is CRITICAL: Test Get aggregate mediarequests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:46:58] (KubernetesAPILatency) firing: (85) High Kubernetes API latency (LIST bgpconfigurations) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:47:03] PROBLEM - Host elastic2079 is DOWN: PING CRITICAL - Packet loss = 100% [14:48:23] PROBLEM - Host elastic2058 is DOWN: PING CRITICAL - Packet loss = 100% [14:48:39] PROBLEM - Host db2160 is DOWN: PING CRITICAL - Packet loss = 100% [14:48:49] PROBLEM - Host elastic2044 is DOWN: PING CRITICAL - Packet loss = 100% [14:48:58] (KubernetesCalicoDown) firing: (16) kubernetes2005.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:49:18] (03CR) 10Hashar: "I have manually deleted the log files which were still in `/srv/gerrit/jvmlogs`." [puppet] - 10https://gerrit.wikimedia.org/r/880963 (owner: 10Hashar) [14:49:21] RECOVERY - Host db2160 is UP: PING OK - Packet loss = 0%, RTA = 33.21 ms [14:49:40] PROBLEM - Host db2162 #page is DOWN: PING CRITICAL - Packet loss = 100% [14:49:45] RECOVERY - Host elastic2070 is UP: PING OK - Packet loss = 0%, RTA = 33.19 ms [14:50:18] RECOVERY - Host db2162 #page is UP: PING OK - Packet loss = 0%, RTA = 33.22 ms [14:50:41] RECOVERY - Host elastic2079 is UP: PING OK - Packet loss = 0%, RTA = 33.17 ms [14:51:01] RECOVERY - Check unit status of netbox_ganeti_codfw_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [14:51:09] PROBLEM - Cassandra instance data free space on restbase1017 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7603 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [14:51:38] (KubernetesCalicoDown) firing: (2) ml-serve2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:51:50] PROBLEM - Host db2110 #page is DOWN: PING CRITICAL - Packet loss = 100% [14:51:53] PROBLEM - Cassandra instance data free space on restbase1016 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7371 MB (19% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [14:52:06] (KubernetesAPILatency) firing: (84) High Kubernetes API latency (DELETE apiservices) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:52:14] !log disabling Cassandra hinted-handoff for codfw -- T327001 [14:52:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:19] T327001: asw-b2-codfw down - https://phabricator.wikimedia.org/T327001 [14:52:30] RECOVERY - Host db2110 #page is UP: PING OK - Packet loss = 0%, RTA = 33.20 ms [14:53:13] RECOVERY - Host elastic2058 is UP: PING OK - Packet loss = 0%, RTA = 33.24 ms [14:53:15] RECOVERY - Host elastic2044 is UP: PING OK - Packet loss = 0%, RTA = 33.68 ms [14:53:40] PROBLEM - Host db2137 #page is DOWN: PING CRITICAL - Packet loss = 100% [14:53:43] PROBLEM - Cassandra instance data free space on restbase1018 is CRITICAL: DISK CRITICAL - free space: /srv/cassandra/instance-data 7489 MB (20% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [14:53:46] RECOVERY - Host db2137 #page is UP: PING OK - Packet loss = 0%, RTA = 33.22 ms [14:53:58] (KubernetesCalicoDown) firing: (15) kubernetes2005.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [14:53:59] PROBLEM - Host elastic2080 is DOWN: PING CRITICAL - Packet loss = 100% [14:54:58] (KubernetesRsyslogDown) firing: rsyslog on ml-serve2007:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=ml-serve2007 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [14:54:59] RECOVERY - aqs endpoints health on aqs2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:55:11] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/880968 (owner: 10Slyngshede) [14:55:51] RECOVERY - MegaRAID on an-worker1086 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [14:56:20] PROBLEM - Host es2025 #page is DOWN: PING CRITICAL - Packet loss = 100% [14:56:27] !log truncating hints for Cassandra nodes in codfw row b -- T327001 [14:56:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:32] RECOVERY - Host es2025 #page is UP: PING OK - Packet loss = 0%, RTA = 33.20 ms [14:56:58] (KubernetesAPILatency) firing: (83) High Kubernetes API latency (DELETE apiservices) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [14:58:51] RECOVERY - Host elastic2080 is UP: PING OK - Packet loss = 0%, RTA = 33.19 ms [14:59:49] PROBLEM - aqs endpoints health on aqs2005 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top/{project}/{access}/{year}/{month}/{day} (Get top page views) is CRITICAL: Test Get top page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected [14:59:49] 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [14:59:58] (KubernetesRsyslogDown) resolved: rsyslog on ml-serve2007:9105 is missing kubernetes logs - https://wikitech.wikimedia.org/wiki/Kubernetes/Logging#Common_issues - https://grafana.wikimedia.org/d/OagQjQmnk?var-server=ml-serve2007 - https://alerts.wikimedia.org/?q=alertname%3DKubernetesRsyslogDown [15:00:51] PROBLEM - Host elastic2043 is DOWN: PING CRITICAL - Packet loss = 100% [15:01:58] (KubernetesAPILatency) firing: (81) High Kubernetes API latency (DELETE apiservices) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:02:33] PROBLEM - Host ml-serve2006 is DOWN: PING CRITICAL - Packet loss = 100% [15:02:44] (03PS3) 10Elukey: kserve: upgrade to upstream version 0.9 [deployment-charts] - 10https://gerrit.wikimedia.org/r/880897 (https://phabricator.wikimedia.org/T325528) [15:03:13] PROBLEM - Check unit status of netbox_ganeti_codfw_sync on netbox1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:03:25] RECOVERY - Host ml-serve2006 is UP: PING OK - Packet loss = 0%, RTA = 33.33 ms [15:03:40] (03CR) 10CI reject: [V: 04-1] kserve: upgrade to upstream version 0.9 [deployment-charts] - 10https://gerrit.wikimedia.org/r/880897 (https://phabricator.wikimedia.org/T325528) (owner: 10Elukey) [15:04:03] RECOVERY - Host elastic2043 is UP: PING OK - Packet loss = 0%, RTA = 33.22 ms [15:05:45] RECOVERY - aqs endpoints health on aqs2007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:06:13] RECOVERY - aqs endpoints health on aqs2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:06:41] PROBLEM - Host elastic2070 is DOWN: PING CRITICAL - Packet loss = 100% [15:06:50] PROBLEM - Host db2107 #page is DOWN: PING CRITICAL - Packet loss = 100% [15:07:02] RECOVERY - Host db2107 #page is UP: PING OK - Packet loss = 0%, RTA = 33.19 ms [15:07:11] (KubernetesAPILatency) firing: (77) High Kubernetes API latency (DELETE apiservices) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:08:37] PROBLEM - Host elastic2079 is DOWN: PING CRITICAL - Packet loss = 100% [15:10:30] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) firing: (2) Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [15:10:37] PROBLEM - aqs endpoints health on aqs2007 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/per-article/{project}/{access}/{agent}/{article}/{granularity}/{start}/{end} (Get per article page views) is CRITICAL: Test Get per article page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get to [15:10:37] ies by page views returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:10:55] RECOVERY - Host elastic2070 is UP: PING OK - Packet loss = 0%, RTA = 33.16 ms [15:10:58] (KubernetesCalicoDown) firing: (2) ml-serve2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:11:03] PROBLEM - aqs endpoints health on aqs2005 is CRITICAL: /analytics.wikimedia.org/v1/pageviews/top-by-country/{project}/{access}/{year}/{month} (Get top countries by page views) is CRITICAL: Test Get top countries by page views returned the unexpected status 500 (expecting: 200): /analytics.wikimedia.org/v1/mediarequests/aggregate/{referer}/{media_type}/{agent}/{granularity}/{start}/{end} (Get aggregate mediarequests) is CRITICAL: Test Get [15:11:03] e mediarequests returned the unexpected status 500 (expecting: 200) https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:11:07] PROBLEM - Host elastic2058 is DOWN: PING CRITICAL - Packet loss = 100% [15:11:09] RECOVERY - Cassandra instance data free space on restbase1016 is OK: DISK OK - free space: /srv/cassandra/instance-data 32283 MB (86% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [15:11:47] PROBLEM - Host db2160 is DOWN: PING CRITICAL - Packet loss = 100% [15:11:58] (KubernetesAPILatency) firing: (80) High Kubernetes API latency (DELETE apiservices) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:12:01] RECOVERY - Host db2160 is UP: PING OK - Packet loss = 0%, RTA = 33.17 ms [15:12:01] RECOVERY - Cassandra instance data free space on restbase1017 is OK: DISK OK - free space: /srv/cassandra/instance-data 32083 MB (86% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [15:12:05] RECOVERY - Host elastic2079 is UP: PING OK - Packet loss = 0%, RTA = 33.11 ms [15:12:24] PROBLEM - Host es2025 #page is DOWN: PING CRITICAL - Packet loss = 100% [15:12:25] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8572 bytes in 6.776 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:12:48] RECOVERY - Host es2025 #page is UP: PING OK - Packet loss = 0%, RTA = 33.95 ms [15:13:01] RECOVERY - Cassandra instance data free space on restbase1018 is OK: DISK OK - free space: /srv/cassandra/instance-data 32675 MB (87% inode=99%): https://wikitech.wikimedia.org/wiki/RESTBase%23instance-data [15:13:13] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49420 bytes in 0.112 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [15:13:58] (KubernetesCalicoDown) firing: (14) kubernetes2005.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:15:17] RECOVERY - Host elastic2058 is UP: PING OK - Packet loss = 0%, RTA = 33.21 ms [15:15:30] PROBLEM - Host db2137 #page is DOWN: PING CRITICAL - Packet loss = 100% [15:15:36] RECOVERY - Host db2137 #page is UP: PING OK - Packet loss = 0%, RTA = 33.26 ms [15:16:37] PROBLEM - Host backup2005 is DOWN: PING CRITICAL - Packet loss = 100% [15:16:43] PROBLEM - Host elastic2080 is DOWN: PING CRITICAL - Packet loss = 100% [15:16:55] RECOVERY - Host backup2005 is UP: PING OK - Packet loss = 0%, RTA = 33.20 ms [15:16:58] (KubernetesAPILatency) firing: (85) High Kubernetes API latency (DELETE apiservices) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:17:06] PROBLEM - Host db2161 #page is DOWN: PING CRITICAL - Packet loss = 100% [15:17:07] PROBLEM - Host irc2001 is DOWN: PING CRITICAL - Packet loss = 100% [15:17:15] PROBLEM - Host elastic2044 is DOWN: PING CRITICAL - Packet loss = 100% [15:17:55] RECOVERY - Host irc2001 is UP: PING OK - Packet loss = 0%, RTA = 33.39 ms [15:18:08] RECOVERY - Host db2161 #page is UP: PING OK - Packet loss = 0%, RTA = 33.22 ms [15:20:49] RECOVERY - Host elastic2080 is UP: PING OK - Packet loss = 0%, RTA = 33.17 ms [15:21:09] RECOVERY - Host elastic2044 is UP: PING OK - Packet loss = 0%, RTA = 33.23 ms [15:21:58] (KubernetesAPILatency) firing: (86) High Kubernetes API latency (POST apiservices) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:22:27] RECOVERY - aqs endpoints health on aqs2005 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:23:58] (KubernetesCalicoDown) firing: (14) kubernetes2005.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:26:23] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance [15:26:26] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1 day, 0:00:00 on db1173.eqiad.wmnet with reason: Maintenance [15:26:44] (03PS1) 10Vivian Rook: update haproxy to new paws cluster [puppet] - 10https://gerrit.wikimedia.org/r/880971 (https://phabricator.wikimedia.org/T326554) [15:26:58] (KubernetesAPILatency) firing: (87) High Kubernetes API latency (DELETE apiservices) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:27:28] PROBLEM - Host db2111 #page is DOWN: PING CRITICAL - Packet loss = 100% [15:27:39] PROBLEM - MegaRAID on an-worker1086 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [15:28:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [15:28:16] RECOVERY - Host db2111 #page is UP: PING OK - Packet loss = 0%, RTA = 33.21 ms [15:28:16] PROBLEM - Host elastic2043 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:21] PROBLEM - Host db2160 is DOWN: PING CRITICAL - Packet loss = 100% [15:28:33] RECOVERY - Host cp2032 is UP: PING WARNING - Packet loss = 77%, RTA = 33.11 ms [15:28:33] RECOVERY - Host mc2042 is UP: PING OK - Packet loss = 0%, RTA = 33.21 ms [15:28:35] RECOVERY - Host elastic2064 is UP: PING OK - Packet loss = 0%, RTA = 33.12 ms [15:28:35] RECOVERY - Host mc2043 is UP: PING OK - Packet loss = 0%, RTA = 33.20 ms [15:28:35] RECOVERY - Host elastic2077 is UP: PING OK - Packet loss = 0%, RTA = 33.17 ms [15:28:35] RECOVERY - Host ms-fe2010 is UP: PING OK - Packet loss = 0%, RTA = 33.25 ms [15:28:35] RECOVERY - Host elastic2057 is UP: PING OK - Packet loss = 0%, RTA = 33.81 ms [15:28:35] RECOVERY - Host kafka-logging2002 is UP: PING OK - Packet loss = 0%, RTA = 33.19 ms [15:28:35] RECOVERY - Host elastic2041 is UP: PING OK - Packet loss = 0%, RTA = 33.17 ms [15:28:36] RECOVERY - Host ms-be2041 is UP: PING OK - Packet loss = 0%, RTA = 33.37 ms [15:28:36] RECOVERY - Host ms-be2046 is UP: PING OK - Packet loss = 0%, RTA = 33.15 ms [15:28:37] RECOVERY - Host cp2031 is UP: PING OK - Packet loss = 0%, RTA = 33.20 ms [15:28:37] RECOVERY - Host thanos-fe2002 is UP: PING OK - Packet loss = 0%, RTA = 33.15 ms [15:28:38] RECOVERY - Host elastic2078 is UP: PING OK - Packet loss = 0%, RTA = 33.18 ms [15:28:38] RECOVERY - Host elastic2063 is UP: PING OK - Packet loss = 0%, RTA = 33.15 ms [15:28:45] rack B2 coming up [15:28:51] RECOVERY - Host elastic2042 is UP: PING OK - Packet loss = 0%, RTA = 33.17 ms [15:28:53] RECOVERY - Host ml-cache2002 is UP: PING OK - Packet loss = 0%, RTA = 33.24 ms [15:29:02] (KubernetesCalicoDown) firing: (14) kubernetes2005.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:29:09] RECOVERY - Host db2160 is UP: PING WARNING - Packet loss = 33%, RTA = 33.14 ms [15:29:39] (03PS10) 10Vlad.shapik: Update Thumbor repository according to the latest changes [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/876229 (https://phabricator.wikimedia.org/T325811) [15:30:47] RECOVERY - Juniper virtual chassis ports on asw-b-codfw is OK: OK: UP: 28 https://wikitech.wikimedia.org/wiki/Network_monitoring%23VCP_status [15:30:54] (03CR) 10Andrew Bogott: update haproxy to new paws cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/880971 (https://phabricator.wikimedia.org/T326554) (owner: 10Vivian Rook) [15:31:13] RECOVERY - Host lvs2008 is UP: PING OK - Packet loss = 0%, RTA = 33.23 ms [15:31:16] (Wikidata Reliability Metrics - wbeditentity API: executeTiming alert) resolved: Wikidata Reliability Metrics - wbeditentity API: executeTiming alert - https://alerts.wikimedia.org/?q=alertname%3DWikidata+Reliability+Metrics+-+wbeditentity+API%3A+executeTiming+alert [15:31:29] RECOVERY - aqs endpoints health on aqs2006 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:31:35] PROBLEM - Host elastic2079 is DOWN: PING CRITICAL - Packet loss = 100% [15:31:39] (CirrusSearchJVMGCYoungPoolInsufficient) firing: (3) Elasticsearch instance elastic2057-production-search-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [15:31:46] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1173.eqiad.wmnet with reason: Maintenance [15:31:48] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1173.eqiad.wmnet with reason: Maintenance [15:31:53] RECOVERY - Host elastic2043 is UP: PING OK - Packet loss = 0%, RTA = 33.21 ms [15:32:00] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1173.eqiad.wmnet with reason: Maintenance [15:32:02] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1173.eqiad.wmnet with reason: Maintenance [15:32:05] (KubernetesAPILatency) firing: (86) High Kubernetes API latency (DELETE apiservices) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:32:31] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-logging2001 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=logging-codfw&var-kafka_broker=kafka-logging2001 [15:32:46] (JobUnavailable) firing: (8) Reduced availability for job calico-felix in k8s@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:33:02] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 8:00:00 on db1173.eqiad.wmnet with reason: Maintenance [15:33:04] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 8:00:00 on db1173.eqiad.wmnet with reason: Maintenance [15:33:09] RECOVERY - Kafka Broker Under Replicated Partitions on kafka-logging2003 is OK: (C)10 ge (W)1 ge 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/d/000000027/kafka?orgId=1&viewPanel=29&var-datasource=codfw+prometheus/ops&var-kafka_cluster=logging-codfw&var-kafka_broker=kafka-logging2003 [15:33:15] RECOVERY - aqs endpoints health on aqs2007 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:34:33] RECOVERY - Etcd cluster health on kubetcd2006 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [15:34:34] !log ladsgroup@cumin1001 START - Cookbook sre.hosts.downtime for 6:00:00 on db1173.eqiad.wmnet with reason: Maintenance [15:34:36] !log ladsgroup@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 6:00:00 on db1173.eqiad.wmnet with reason: Maintenance [15:34:41] RECOVERY - BFD status on cr1-codfw is OK: OK: UP: 22 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:34:55] RECOVERY - BGP status on cr1-codfw is OK: BGP OK - up: 102, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:34:57] RECOVERY - aqs endpoints health on aqs2008 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/aqs [15:35:16] PROBLEM - Host db2147 #page is DOWN: PING CRITICAL - Packet loss = 100% [15:35:29] PROBLEM - Host gitlab-runner2002 is DOWN: PING CRITICAL - Packet loss = 100% [15:35:35] PROBLEM - Host rdb2008 is DOWN: PING CRITICAL - Packet loss = 100% [15:35:42] <_joe_> uh new things down it seems [15:35:45] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 10%: Maint over', diff saved to https://phabricator.wikimedia.org/P43175 and previous config saved to /var/cache/conftool/dbconfig/20230117-153545-ladsgroup.json [15:36:04] PROBLEM - Host db2096 #page is DOWN: PING CRITICAL - Packet loss = 100% [15:36:27] PROBLEM - Host irc2001 is DOWN: PING CRITICAL - Packet loss = 100% [15:36:51] (KubernetesCalicoDown) firing: (2) ml-serve2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:36:57] (CirrusSearchJVMGCYoungPoolInsufficient) resolved: (3) Elasticsearch instance elastic2057-production-search-codfw is showing memory pressure in the young pool - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchJVMGCYoungPoolInsufficient [15:37:02] (CirrusSearchNodeIndexingNotIncreasing) firing: (4) Elasticsearch instance elastic2041-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [15:37:14] (RdfStreamingUpdaterFlinkJobUnstable) firing: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [15:37:19] (WdqsStreamingUpdaterFlinkJobNotRunning) resolved: WDQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWdqsStreamingUpdaterFlinkJobNotRunning [15:37:24] (WcqsStreamingUpdaterFlinkJobNotRunning) resolved: WCQS_Streaming_Updater in codfw (k8s) is not running - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DWcqsStreamingUpdaterFlinkJobNotRunning [15:37:25] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [15:37:29] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:37:29] PROBLEM - Host logstash2024 is DOWN: PING CRITICAL - Packet loss = 100% [15:37:33] (KubernetesAPILatency) firing: (85) High Kubernetes API latency (DELETE apiservices) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:37:34] PROBLEM - Host db2108 #page is DOWN: PING CRITICAL - Packet loss = 100% [15:37:36] (ProbeDown) firing: (11) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:37:48] PROBLEM - Host db2162 #page is DOWN: PING CRITICAL - Packet loss = 100% [15:37:52] (ProbeDown) firing: (8) Service commons.wikimedia.org:443 has failed probes (http_commons_wikimedia_org_ip6) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:37:53] PROBLEM - Host logstash2034 is DOWN: PING CRITICAL - Packet loss = 100% [15:37:55] PROBLEM - Host ganeti2031 is DOWN: PING CRITICAL - Packet loss = 100% [15:38:08] PROBLEM - Host db2161 #page is DOWN: PING CRITICAL - Packet loss = 100% [15:38:08] (JobUnavailable) firing: (8) Reduced availability for job calico-felix in k8s@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:38:13] PROBLEM - Host serpens is DOWN: PING CRITICAL - Packet loss = 100% [15:38:15] RECOVERY - Host elastic2079 is UP: PING OK - Packet loss = 0%, RTA = 30.11 ms [15:38:16] RECOVERY - Host db2162 #page is UP: PING OK - Packet loss = 0%, RTA = 30.15 ms [15:38:18] RECOVERY - Host db2147 #page is UP: PING OK - Packet loss = 0%, RTA = 30.18 ms [15:38:18] RECOVERY - Host gitlab-runner2002 is UP: PING OK - Packet loss = 0%, RTA = 30.19 ms [15:38:18] RECOVERY - Host rdb2008 is UP: PING OK - Packet loss = 0%, RTA = 30.23 ms [15:38:19] RECOVERY - Host db2161 #page is UP: PING OK - Packet loss = 0%, RTA = 30.21 ms [15:38:22] RECOVERY - Host db2096 #page is UP: PING OK - Packet loss = 0%, RTA = 30.24 ms [15:38:23] RECOVERY - Host db2108 #page is UP: PING OK - Packet loss = 0%, RTA = 30.14 ms [15:38:23] RECOVERY - cassandra-a CQL 10.192.16.95:9042 on sessionstore2001 is OK: TCP OK - 0.030 second response time on 10.192.16.95 port 9042 https://phabricator.wikimedia.org/T93886 [15:38:23] RECOVERY - cassandra-c CQL 10.192.16.155:9042 on restbase2021 is OK: TCP OK - 0.030 second response time on 10.192.16.155 port 9042 https://phabricator.wikimedia.org/T93886 [15:38:23] RECOVERY - cassandra-b CQL 10.192.16.187:9042 on aqs2007 is OK: TCP OK - 0.030 second response time on 10.192.16.187 port 9042 https://phabricator.wikimedia.org/T93886 [15:38:23] RECOVERY - cassandra-a CQL 10.192.16.186:9042 on aqs2007 is OK: TCP OK - 0.030 second response time on 10.192.16.186 port 9042 https://phabricator.wikimedia.org/T93886 [15:38:25] RECOVERY - Host logstash2024 is UP: PING OK - Packet loss = 0%, RTA = 30.41 ms [15:38:27] RECOVERY - Host logstash2034 is UP: PING OK - Packet loss = 0%, RTA = 30.15 ms [15:38:33] RECOVERY - Host serpens is UP: PING OK - Packet loss = 0%, RTA = 30.40 ms [15:38:35] RECOVERY - cassandra-b CQL 10.192.16.112:9042 on restbase2024 is OK: TCP OK - 0.030 second response time on 10.192.16.112 port 9042 https://phabricator.wikimedia.org/T93886 [15:38:35] RECOVERY - cassandra-b CQL 10.192.16.99:9042 on restbase2019 is OK: TCP OK - 0.030 second response time on 10.192.16.99 port 9042 https://phabricator.wikimedia.org/T93886 [15:38:39] RECOVERY - Wikidough DoT Check -IPv6- on doh2002 is OK: TCP OK - 0.065 second response time on 2620:0:860:2:208:80:153:38 port 853 https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [15:38:43] RECOVERY - cassandra-b CQL 10.192.16.189:9042 on aqs2008 is OK: TCP OK - 0.030 second response time on 10.192.16.189 port 9042 https://phabricator.wikimedia.org/T93886 [15:38:55] RECOVERY - cassandra-a CQL 10.192.16.111:9042 on restbase2024 is OK: TCP OK - 0.030 second response time on 10.192.16.111 port 9042 https://phabricator.wikimedia.org/T93886 [15:38:58] yeah, life is better with the faulty link disabled, but re-enabling it now [15:39:01] RECOVERY - Etcd cluster health on ml-staging-etcd2002 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [15:39:07] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [15:39:51] RECOVERY - Etcd cluster health on conf2004 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [15:39:51] RECOVERY - Etcd cluster health on ml-etcd2001 is OK: The etcd server is healthy https://wikitech.wikimedia.org/wiki/Etcd [15:39:57] (KubernetesCalicoDown) resolved: (14) kubernetes2005.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:40:01] PROBLEM - Host ml-staging-ctrl2001 is DOWN: PING CRITICAL - Packet loss = 100% [15:40:07] PROBLEM - Host elastic2044 is DOWN: PING CRITICAL - Packet loss = 100% [15:40:36] PROBLEM - Host es2021 #page is DOWN: PING CRITICAL - Packet loss = 100% [15:40:39] PROBLEM - Host backup2005 is DOWN: PING CRITICAL - Packet loss = 100% [15:40:58] PROBLEM - Host db2108 #page is DOWN: PING CRITICAL - Packet loss = 100% [15:41:06] PROBLEM - Host db2147 #page is DOWN: PING CRITICAL - Packet loss = 100% [15:41:15] PROBLEM - Host ganeti2020 is DOWN: PING CRITICAL - Packet loss = 100% [15:41:15] PROBLEM - Host logstash2036 is DOWN: PING CRITICAL - Packet loss = 100% [15:41:20] PROBLEM - Host db2096 #page is DOWN: PING CRITICAL - Packet loss = 100% [15:41:21] PROBLEM - Host gitlab-runner2002 is DOWN: PING CRITICAL - Packet loss = 100% [15:41:23] PROBLEM - Host elastic2058 is DOWN: PING CRITICAL - Packet loss = 100% [15:41:25] PROBLEM - Host logstash2024 is DOWN: PING CRITICAL - Packet loss = 100% [15:41:26] (03Abandoned) 10Vivian Rook: update haproxy to new paws cluster [puppet] - 10https://gerrit.wikimedia.org/r/880971 (https://phabricator.wikimedia.org/T326554) (owner: 10Vivian Rook) [15:41:26] PROBLEM - Host pc2012 #page is DOWN: PING CRITICAL - Packet loss = 100% [15:41:32] PROBLEM - Host db2161 #page is DOWN: PING CRITICAL - Packet loss = 100% [15:41:33] PROBLEM - Host doh2002 is DOWN: PING CRITICAL - Packet loss = 100% [15:41:37] PROBLEM - Host logstash2034 is DOWN: PING CRITICAL - Packet loss = 100% [15:41:37] (03Abandoned) 10Vivian Rook: aptrepo: add thirdparty/kubeadm-k8s-1-2[34] [puppet] - 10https://gerrit.wikimedia.org/r/862994 (owner: 10Vivian Rook) [15:41:57] (KubernetesCalicoDown) resolved: (2) ml-serve2002.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:42:02] (KubernetesCalicoDown) resolved: ml-staging-ctrl2001.codfw.wmnet:9091 is not running calico-node Pod - https://wikitech.wikimedia.org/wiki/Calico#Operations - https://grafana.wikimedia.org/d/G8zPL7-Wz/?var-dc=codfw%20prometheus%2Fk8s-mlstaging&var-instance=ml-staging-ctrl2001.codfw.wmnet - https://alerts.wikimedia.org/?q=alertname%3DKubernetesCalicoDown [15:42:04] PROBLEM - Host db2162 #page is DOWN: PING CRITICAL - Packet loss = 100% [15:42:05] PROBLEM - Host serpens is DOWN: PING CRITICAL - Packet loss = 100% [15:42:08] (CirrusSearchNodeIndexingNotIncreasing) firing: (4) Elasticsearch instance elastic2041-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [15:42:11] PROBLEM - Host rdb2008 is DOWN: PING CRITICAL - Packet loss = 100% [15:42:15] PROBLEM - Host elastic2079 is DOWN: PING CRITICAL - Packet loss = 100% [15:42:23] PROBLEM - Host ganeti2021 is DOWN: PING CRITICAL - Packet loss = 100% [15:42:27] (RdfStreamingUpdaterFlinkJobUnstable) resolved: (2) WCQS_Streaming_Updater in codfw (k8s) is unstable - https://wikitech.wikimedia.org/wiki/Wikidata_Query_Service/Streaming_Updater - https://grafana.wikimedia.org/d/gCFgfpG7k/flink-session-cluster - https://alerts.wikimedia.org/?q=alertname%3DRdfStreamingUpdaterFlinkJobUnstable [15:42:28] PROBLEM - Host db2177 #page is DOWN: PING CRITICAL - Packet loss = 100% [15:42:30] PROBLEM - Host db2124 #page is DOWN: PING CRITICAL - Packet loss = 100% [15:42:31] (KubernetesAPILatency) firing: (77) High Kubernetes API latency (DELETE apiservices) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:42:35] (ProbeDown) firing: (13) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:42:50] PROBLEM - Host db2178 #page is DOWN: PING CRITICAL - Packet loss = 100% [15:42:52] PROBLEM - Host db2123 #page is DOWN: PING CRITICAL - Packet loss = 100% [15:43:01] (03CR) 10Ottomata: Update analytics_text conf compatibility with airflow2.3.4 connect postgresql (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/878128 (https://phabricator.wikimedia.org/T315580) (owner: 10Stevemunene) [15:43:29] PROBLEM - restbase endpoints health on restbase2024 is CRITICAL: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:43:33] PROBLEM - Host logstash2027 is DOWN: PING CRITICAL - Packet loss = 100% [15:43:39] PROBLEM - cassandra-a CQL 10.192.16.95:9042 on sessionstore2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [15:43:39] PROBLEM - cassandra-a CQL 10.192.16.190:9042 on ml-cache2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [15:43:39] PROBLEM - cassandra-c CQL 10.192.16.155:9042 on restbase2021 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [15:43:39] PROBLEM - cassandra-b CQL 10.192.16.187:9042 on aqs2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [15:43:39] PROBLEM - cassandra-a CQL 10.192.16.186:9042 on aqs2007 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [15:43:49] PROBLEM - restbase endpoints health on restbase2014 is CRITICAL: /en.wikipedia.org/v1/page/title/{title} (Get rev by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/html/{title} (Get html by title from storage) timed out before a response was received: /en.wikipedia.org/v1/page/summary/{title} (Get summary from storage) timed out before a response was received https://wikitech.wikimedia.org/wiki/Ser [15:43:49] nitoring/restbase [15:43:51] PROBLEM - cassandra-b CQL 10.192.16.112:9042 on restbase2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [15:43:51] PROBLEM - cassandra-b CQL 10.192.16.99:9042 on restbase2019 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [15:43:57] PROBLEM - cassandra-b CQL 10.192.16.189:9042 on aqs2008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [15:44:02] PROBLEM - Host es2029 #page is DOWN: PING CRITICAL - Packet loss = 100% [15:44:05] PROBLEM - Host backup2008 is DOWN: PING CRITICAL - Packet loss = 100% [15:44:09] PROBLEM - cassandra-a CQL 10.192.16.111:9042 on restbase2024 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T93886 [15:44:11] (ProbeDown) firing: (8) Service commons.wikimedia.org:443 has failed probes (http_commons_wikimedia_org_ip6) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:44:11] PROBLEM - MariaDB Replica IO: x1 on db2115 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2096.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2096.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:44:11] PROBLEM - MariaDB Replica IO: x1 on db2131 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2096.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2096.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:44:15] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (bad URL) timed out before a response was received: /api (Zotero and citoid alive) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [15:44:17] PROBLEM - MariaDB Replica IO: es4 on es2022 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@es2021.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 100000 message: Cant connect to server on es2021.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:44:39] PROBLEM - MariaDB Replica IO: x1 on db2101 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db2096.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db2096.codfw.wmnet (110 Connection timed out) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:44:44] (JobUnavailable) firing: (9) Reduced availability for job calico-felix in k8s@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:44:47] RECOVERY - cassandra-a CQL 10.192.16.98:9042 on restbase2019 is OK: TCP OK - 7.168 second response time on 10.192.16.98 port 9042 https://phabricator.wikimedia.org/T93886 [15:44:47] RECOVERY - cassandra-c CQL 10.192.16.87:9042 on restbase2014 is OK: TCP OK - 7.172 second response time on 10.192.16.87 port 9042 https://phabricator.wikimedia.org/T93886 [15:44:47] RECOVERY - cassandra-b CQL 10.192.16.154:9042 on restbase2021 is OK: TCP OK - 3.053 second response time on 10.192.16.154 port 9042 https://phabricator.wikimedia.org/T93886 [15:44:48] RECOVERY - Host db2178 #page is UP: PING WARNING - Packet loss = 50%, RTA = 30.17 ms [15:44:49] RECOVERY - cassandra-b CQL 10.192.16.86:9042 on restbase2014 is OK: TCP OK - 7.184 second response time on 10.192.16.86 port 9042 https://phabricator.wikimedia.org/T93886 [15:44:49] RECOVERY - cassandra-c CQL 10.192.16.100:9042 on restbase2019 is OK: TCP OK - 7.178 second response time on 10.192.16.100 port 9042 https://phabricator.wikimedia.org/T93886 [15:44:49] RECOVERY - Host irc2001 is UP: PING WARNING - Packet loss = 33%, RTA = 30.74 ms [15:44:49] RECOVERY - Wikidough DoH Check -IPv6- on doh2002 is OK: HTTP OK: HTTP/1.1 200 OK - 550 bytes in 7.274 second response time https://wikitech.wikimedia.org/wiki/Wikidough/Monitoring%23Wikidough_Basic_Check [15:44:49] RECOVERY - Host backup2008 is UP: PING OK - Packet loss = 0%, RTA = 30.26 ms [15:44:49] RECOVERY - Host ganeti2031 is UP: PING OK - Packet loss = 0%, RTA = 30.21 ms [15:44:50] RECOVERY - Host db2123 #page is UP: PING OK - Packet loss = 0%, RTA = 30.17 ms [15:44:51] RECOVERY - Host es2021 #page is UP: PING OK - Packet loss = 0%, RTA = 30.23 ms [15:44:52] RECOVERY - Host db2177 #page is UP: PING OK - Packet loss = 0%, RTA = 30.15 ms [15:44:52] RECOVERY - Host db2147 #page is UP: PING OK - Packet loss = 0%, RTA = 30.21 ms [15:44:53] (LogstashNoLogsIndexed) firing: Logstash logs are not being indexed by Elasticsearch - https://wikitech.wikimedia.org/wiki/Logstash#No_logs_indexed - https://grafana.wikimedia.org/d/000000561/logstash?var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashNoLogsIndexed [15:44:53] RECOVERY - Host db2108 #page is UP: PING OK - Packet loss = 0%, RTA = 30.19 ms [15:44:53] RECOVERY - Host logstash2027 is UP: PING OK - Packet loss = 0%, RTA = 30.23 ms [15:44:53] RECOVERY - Host doh2002 is UP: PING OK - Packet loss = 0%, RTA = 30.48 ms [15:44:54] RECOVERY - Host elastic2058 is UP: PING OK - Packet loss = 0%, RTA = 30.14 ms [15:44:54] RECOVERY - Host es2029 #page is UP: PING OK - Packet loss = 0%, RTA = 30.20 ms [15:44:54] RECOVERY - Host backup2005 is UP: PING OK - Packet loss = 0%, RTA = 30.18 ms [15:44:55] RECOVERY - Host ml-staging-ctrl2001 is UP: PING OK - Packet loss = 0%, RTA = 30.37 ms [15:44:55] RECOVERY - Host pc2012 #page is UP: PING OK - Packet loss = 0%, RTA = 30.19 ms [15:44:56] RECOVERY - Host db2096 #page is UP: PING OK - Packet loss = 0%, RTA = 30.17 ms [15:44:56] RECOVERY - Host logstash2036 is UP: PING OK - Packet loss = 0%, RTA = 30.15 ms [15:44:57] RECOVERY - Host elastic2044 is UP: PING OK - Packet loss = 0%, RTA = 30.19 ms [15:44:59] RECOVERY - Host elastic2079 is UP: PING OK - Packet loss = 0%, RTA = 30.14 ms [15:45:00] RECOVERY - Host db2161 #page is UP: PING OK - Packet loss = 0%, RTA = 30.23 ms [15:45:00] RECOVERY - Host db2124 #page is UP: PING OK - Packet loss = 0%, RTA = 30.24 ms [15:45:01] RECOVERY - cassandra-a CQL 10.192.16.188:9042 on aqs2008 is OK: TCP OK - 0.030 second response time on 10.192.16.188 port 9042 https://phabricator.wikimedia.org/T93886 [15:45:03] RECOVERY - Host logstash2034 is UP: PING OK - Packet loss = 0%, RTA = 30.11 ms [15:45:05] RECOVERY - cassandra-b CQL 10.192.16.179:9042 on aqs2005 is OK: TCP OK - 0.030 second response time on 10.192.16.179 port 9042 https://phabricator.wikimedia.org/T93886 [15:45:05] RECOVERY - Host rdb2008 is UP: PING OK - Packet loss = 0%, RTA = 30.16 ms [15:45:07] RECOVERY - Host ganeti2020 is UP: PING OK - Packet loss = 0%, RTA = 30.22 ms [15:45:09] RECOVERY - Host ganeti2021 is UP: PING OK - Packet loss = 0%, RTA = 30.89 ms [15:45:10] <_joe_> XioNoX: did you just kill b7? [15:45:12] RECOVERY - Host db2162 #page is UP: PING OK - Packet loss = 0%, RTA = 30.15 ms [15:45:12] RECOVERY - restbase endpoints health on restbase2024 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:45:15] RECOVERY - Host gitlab-runner2002 is UP: PING OK - Packet loss = 0%, RTA = 30.49 ms [15:45:19] RECOVERY - Host serpens is UP: PING OK - Packet loss = 0%, RTA = 30.44 ms [15:45:21] RECOVERY - cassandra-a CQL 10.192.16.95:9042 on sessionstore2001 is OK: TCP OK - 0.030 second response time on 10.192.16.95 port 9042 https://phabricator.wikimedia.org/T93886 [15:45:21] RECOVERY - cassandra-a CQL 10.192.16.190:9042 on ml-cache2002 is OK: TCP OK - 0.030 second response time on 10.192.16.190 port 9042 https://phabricator.wikimedia.org/T93886 [15:45:21] RECOVERY - cassandra-c CQL 10.192.16.155:9042 on restbase2021 is OK: TCP OK - 0.030 second response time on 10.192.16.155 port 9042 https://phabricator.wikimedia.org/T93886 [15:45:21] RECOVERY - cassandra-b CQL 10.192.16.187:9042 on aqs2007 is OK: TCP OK - 0.030 second response time on 10.192.16.187 port 9042 https://phabricator.wikimedia.org/T93886 [15:45:21] RECOVERY - cassandra-a CQL 10.192.16.186:9042 on aqs2007 is OK: TCP OK - 0.030 second response time on 10.192.16.186 port 9042 https://phabricator.wikimedia.org/T93886 [15:45:21] RECOVERY - Host logstash2024 is UP: PING OK - Packet loss = 0%, RTA = 30.31 ms [15:45:31] RECOVERY - cassandra-b CQL 10.192.16.112:9042 on restbase2024 is OK: TCP OK - 0.030 second response time on 10.192.16.112 port 9042 https://phabricator.wikimedia.org/T93886 [15:45:31] RECOVERY - cassandra-b CQL 10.192.16.99:9042 on restbase2019 is OK: TCP OK - 0.030 second response time on 10.192.16.99 port 9042 https://phabricator.wikimedia.org/T93886 [15:45:31] RECOVERY - restbase endpoints health on restbase2014 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [15:45:39] RECOVERY - cassandra-b CQL 10.192.16.189:9042 on aqs2008 is OK: TCP OK - 0.030 second response time on 10.192.16.189 port 9042 https://phabricator.wikimedia.org/T93886 [15:45:49] RECOVERY - Check unit status of netbox_ganeti_codfw_sync on netbox1002 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [15:45:49] RECOVERY - cassandra-a CQL 10.192.16.111:9042 on restbase2024 is OK: TCP OK - 0.030 second response time on 10.192.16.111 port 9042 https://phabricator.wikimedia.org/T93886 [15:45:57] RECOVERY - cassandra-c CQL 10.192.16.113:9042 on restbase2024 is OK: TCP OK - 0.030 second response time on 10.192.16.113 port 9042 https://phabricator.wikimedia.org/T93886 [15:45:57] RECOVERY - cassandra-a CQL 10.192.16.85:9042 on restbase2014 is OK: TCP OK - 0.030 second response time on 10.192.16.85 port 9042 https://phabricator.wikimedia.org/T93886 [15:46:01] RECOVERY - MariaDB Replica IO: x1 on db2131 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:46:01] RECOVERY - MariaDB Replica IO: x1 on db2115 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:46:01] RECOVERY - cassandra-a CQL 10.192.16.183:9042 on aqs2006 is OK: TCP OK - 0.030 second response time on 10.192.16.183 port 9042 https://phabricator.wikimedia.org/T93886 [15:46:01] RECOVERY - cassandra-a CQL 10.192.16.174:9042 on aqs2005 is OK: TCP OK - 0.030 second response time on 10.192.16.174 port 9042 https://phabricator.wikimedia.org/T93886 [15:46:03] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [15:46:05] RECOVERY - MariaDB Replica IO: es4 on es2022 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:46:07] RECOVERY - cassandra-b CQL 10.192.16.185:9042 on aqs2006 is OK: TCP OK - 0.030 second response time on 10.192.16.185 port 9042 https://phabricator.wikimedia.org/T93886 [15:46:07] RECOVERY - cassandra-a CQL 10.192.16.153:9042 on restbase2021 is OK: TCP OK - 0.030 second response time on 10.192.16.153 port 9042 https://phabricator.wikimedia.org/T93886 [15:46:25] RECOVERY - BGP status on cr2-codfw is OK: BGP OK - up: 169, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [15:46:27] RECOVERY - MariaDB Replica IO: x1 on db2101 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:46:35] RECOVERY - BFD status on cr2-codfw is OK: OK: UP: 20 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [15:47:57] RECOVERY - configured eth on lvs2009 is OK: OK - interfaces up https://wikitech.wikimedia.org/wiki/Monitoring/check_eth [15:48:30] (03CR) 10Ssingh: [C: 03+2] Release 0.44.0+ds1-1 [debs/cadvisor] - 10https://gerrit.wikimedia.org/r/880530 (https://phabricator.wikimedia.org/T325557) (owner: 10Ssingh) [15:50:51] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 25%: Maint over', diff saved to https://phabricator.wikimedia.org/P43177 and previous config saved to /var/cache/conftool/dbconfig/20230117-155050-ladsgroup.json [15:53:10] 10SRE, 10SRE-OnFire, 10Release-Engineering-Team, 10serviceops-collab, 10Sustainability: Remove old scap repositories from deploy1002 - https://phabricator.wikimedia.org/T309162 (10LSobanski) p:05Medium→03Low [15:53:21] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:53:25] (KubernetesAPILatency) resolved: (74) High Kubernetes API latency (DELETE apiservices) on k8s@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:53:29] (ProbeDown) resolved: (13) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip6) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:54:31] (ProbeDown) resolved: (8) Service commons.wikimedia.org:443 has failed probes (http_commons_wikimedia_org_ip6) #page - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:54:39] 10SRE, 10SRE-Access-Requests: Requesting access to Analytics Data for Ollie_Shotton - https://phabricator.wikimedia.org/T327187 (10Ollie.Shotton_WMDE) [15:54:53] (JobUnavailable) firing: (8) Reduced availability for job calico-felix in k8s@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [15:55:02] (LogstashNoLogsIndexed) resolved: Logstash logs are not being indexed by Elasticsearch - https://wikitech.wikimedia.org/wiki/Logstash#No_logs_indexed - https://grafana.wikimedia.org/d/000000561/logstash?var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashNoLogsIndexed [16:00:43] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [16:01:23] (CirrusSearchNodeIndexingNotIncreasing) resolved: (3) Elasticsearch instance elastic2041-production-search-codfw is not indexing - https://wikitech.wikimedia.org/wiki/Search#Indexing_hung_and_not_making_progress - https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&from=now-3d&to=now&viewPanel=57 - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchNodeIndexingNotIncreasing [16:01:27] (RoutinatorRsyncErrors) resolved: Routinator rsync fetching issue in codfw - https://wikitech.wikimedia.org/wiki/RPKI#RSYNC_status - https://grafana.wikimedia.org/d/UwUa77GZk/rpki - https://alerts.wikimedia.org/?q=alertname%3DRoutinatorRsyncErrors [16:05:56] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 75%: Maint over', diff saved to https://phabricator.wikimedia.org/P43178 and previous config saved to /var/cache/conftool/dbconfig/20230117-160555-ladsgroup.json [16:06:02] (03CR) 10JHathaway: [C: 03+1] "Looks good, just one question" [puppet] - 10https://gerrit.wikimedia.org/r/868703 (https://phabricator.wikimedia.org/T308013) (owner: 10Muehlenhoff) [16:07:11] PROBLEM - Host furud is DOWN: PING CRITICAL - Packet loss = 100% [16:07:11] PROBLEM - Host ganeti2032 is DOWN: PING CRITICAL - Packet loss = 100% [16:07:20] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Move WMCS servers to 1 NIC - https://phabricator.wikimedia.org/T319184 (10aborrero) 05In progress→03Stalled a:05aborrero→03None [16:07:33] PROBLEM - Host elastic2044 is DOWN: PING CRITICAL - Packet loss = 100% [16:07:33] PROBLEM - Host ms-be2047 is DOWN: PING CRITICAL - Packet loss = 100% [16:07:39] PROBLEM - Host elastic2079 is DOWN: PING CRITICAL - Packet loss = 100% [16:07:45] PROBLEM - Host elastic2080 is DOWN: PING CRITICAL - Packet loss = 100% [16:07:49] PROBLEM - Host elastic2043 is DOWN: PING CRITICAL - Packet loss = 100% [16:07:49] PROBLEM - Host logstash2036 is DOWN: PING CRITICAL - Packet loss = 100% [16:07:51] PROBLEM - Host thanos-be2002 is DOWN: PING CRITICAL - Packet loss = 100% [16:07:53] PROBLEM - Host mc2046 is DOWN: PING CRITICAL - Packet loss = 100% [16:09:03] (ProbeDown) firing: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:09:09] RECOVERY - Host elastic2079 is UP: PING OK - Packet loss = 0%, RTA = 30.14 ms [16:09:11] RECOVERY - Host thanos-be2002 is UP: PING OK - Packet loss = 0%, RTA = 30.09 ms [16:09:13] RECOVERY - Host elastic2080 is UP: PING OK - Packet loss = 0%, RTA = 30.13 ms [16:09:13] RECOVERY - Host ms-be2047 is UP: PING OK - Packet loss = 0%, RTA = 30.03 ms [16:09:15] RECOVERY - Host elastic2044 is UP: PING OK - Packet loss = 0%, RTA = 30.16 ms [16:09:21] RECOVERY - Host ganeti2032 is UP: PING OK - Packet loss = 0%, RTA = 30.21 ms [16:09:22] RECOVERY - Host furud is UP: PING OK - Packet loss = 0%, RTA = 30.19 ms [16:09:23] RECOVERY - Host elastic2043 is UP: PING OK - Packet loss = 0%, RTA = 30.17 ms [16:09:31] RECOVERY - Host mc2046 is UP: PING OK - Packet loss = 0%, RTA = 30.14 ms [16:09:41] RECOVERY - Host logstash2036 is UP: PING OK - Packet loss = 0%, RTA = 30.13 ms [16:10:35] (03PS5) 10Btullis: Rename ceph profiles to cloudceph [puppet] - 10https://gerrit.wikimedia.org/r/880939 (https://phabricator.wikimedia.org/T326945) [16:10:49] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 503 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid [16:11:05] (03CR) 10Hnowlan: [C: 03+2] Update Thumbor repository according to the latest changes [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/876229 (https://phabricator.wikimedia.org/T325811) (owner: 10Vlad.shapik) [16:12:25] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [16:14:03] (ProbeDown) resolved: (2) Service centrallog1001:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1001:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [16:14:56] 10SRE-OnFire, 10Gerrit, 10serviceops-collab, 10Release-Engineering-Team (Holiday Leftovers 🥡), and 2 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10LSobanski) a:05LSobanski→03None [16:15:06] 10SRE, 10Traffic, 10Patch-For-Review, 10Upstream: Review cp2041 and cp2042 running bullseye - https://phabricator.wikimedia.org/T325557 (10ssingh) Thanks to Faidon's suggestion of building against 0.44.0 and not 0.46.0, we have a working cadvisor 0.44.0 build for bullseye/sid, which has been merged above. [16:15:15] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 9): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/39155/console" [puppet] - 10https://gerrit.wikimedia.org/r/880939 (https://phabricator.wikimedia.org/T326945) (owner: 10Btullis) [16:15:44] (03Merged) 10jenkins-bot: Update Thumbor repository according to the latest changes [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/876229 (https://phabricator.wikimedia.org/T325811) (owner: 10Vlad.shapik) [16:15:49] 10SRE-OnFire, 10Gerrit, 10serviceops-collab, 10Release-Engineering-Team (Holiday Leftovers 🥡), and 2 others: gerrit1001 running out of space on / - https://phabricator.wikimedia.org/T323262 (10LSobanski) p:05High→03Medium [16:18:10] (03PS6) 10Btullis: Rename ceph profiles to cloudceph [puppet] - 10https://gerrit.wikimedia.org/r/880939 (https://phabricator.wikimedia.org/T326945) [16:19:08] (03PS4) 10Elukey: kserve: upgrade to upstream version 0.9 [deployment-charts] - 10https://gerrit.wikimedia.org/r/880897 (https://phabricator.wikimedia.org/T325528) [16:20:43] (03PS7) 10Btullis: Rename ceph profiles to cloudceph [puppet] - 10https://gerrit.wikimedia.org/r/880939 (https://phabricator.wikimedia.org/T326945) [16:20:43] RECOVERY - MegaRAID on an-worker1086 is OK: OK: optimal, 13 logical, 14 physical, WriteBack policy https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:21:01] !log ladsgroup@cumin1001 dbctl commit (dc=all): 'db1173 (re)pooling @ 100%: Maint over', diff saved to https://phabricator.wikimedia.org/P43179 and previous config saved to /var/cache/conftool/dbconfig/20230117-162100-ladsgroup.json [16:30:11] (03Abandoned) 10BBlack: Depool all services in codfw (dnsdisc) [dns] - 10https://gerrit.wikimedia.org/r/880956 (owner: 10BBlack) [16:30:23] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service,swift_dispersion_stats_lowlatency.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:32:31] !log reprepro --ignore=wrongdistribution -C main include bullseye-wikimedia cadvisor_0.44.0+ds1-1_amd64.changes: T325557 [16:32:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:37] T325557: Review cp2041 and cp2042 running bullseye - https://phabricator.wikimedia.org/T325557 [16:40:49] ACKNOWLEDGEMENT - MegaRAID on an-worker1086 is CRITICAL: CRITICAL: 13 LD(s) must have write cache policy WriteBack, currently using: WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough Btullis Another BBU failure - I will add it to: T326127 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [16:42:27] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Replace RAID controller batteries for an-worker1080, an-worker1084, an-worker1086 - https://phabricator.wikimedia.org/T326127 (10BTullis) [16:43:05] RECOVERY - Check systemd state on thanos-fe1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:44:11] !log btullis@cumin1001 START - Cookbook sre.hosts.downtime for 30 days, 0:00:00 on an-worker1086.eqiad.wmnet with reason: Shutting down for RAID controller BBU replacement [16:44:25] !log btullis@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 30 days, 0:00:00 on an-worker1086.eqiad.wmnet with reason: Shutting down for RAID controller BBU replacement [16:44:30] 10SRE, 10ops-eqiad, 10DC-Ops: hw troubleshooting: Replace RAID controller batteries for an-worker1080, an-worker1084, an-worker1086 - https://phabricator.wikimedia.org/T326127 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=be71f7e8-7930-4f36-95cf-c38a96add158) set by btullis@cumin1001 fo... [16:45:35] 10SRE, 10ops-eqiad, 10Data-Engineering: Check BBU on an-worker1080, an-worker1084, and an-worker1086 - https://phabricator.wikimedia.org/T325984 (10BTullis) [16:45:55] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10LSobanski) a:03Dzahn [16:46:21] 10SRE, 10serviceops-collab, 10Patch-For-Review: rsync server on people2002 - https://phabricator.wikimedia.org/T326888 (10LSobanski) a:03Dzahn [16:54:12] (03PS5) 10Elukey: kserve: upgrade to upstream version 0.9 [deployment-charts] - 10https://gerrit.wikimedia.org/r/880897 (https://phabricator.wikimedia.org/T325528) [16:59:15] i hear that the outage that caused the previous backport window to be cancelled is almost resolved. however, now we have 10 patches to be backported in the next window (after i rescheduled mine): https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230117T2100 [16:59:19] i was wondering whether anyone would be willing to start it early, or to stay longer to finish it? [16:59:47] !log pooling back depooled mw servers in codfw [16:59:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:00:04] jbond and rzl: gettimeofday() says it's time for Puppet request window. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230117T1700) [17:00:04] No Gerrit patches in the queue for this window AFAICS. [17:01:59] !log restarting confd on deploy1002 [17:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:21] RECOVERY - mediawiki-installation DSH group on parse2007 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:04:14] MatmaRex: I'd suspect we can do some of them out-of-window once we can deploy again, which should be soonish [17:04:25] (03CR) 10Andrew Bogott: [C: 03+2] Horizon: put into maintenance mode for Zed upgrade [puppet] - 10https://gerrit.wikimedia.org/r/880564 (https://phabricator.wikimedia.org/T323086) (owner: 10Andrew Bogott) [17:04:59] RECOVERY - mediawiki-installation DSH group on mw2324 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:04:59] RECOVERY - mediawiki-installation DSH group on mw2311 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:05:17] RECOVERY - mediawiki-installation DSH group on mw2265 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:05:17] RECOVERY - mediawiki-installation DSH group on mw2266 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:05:17] RECOVERY - mediawiki-installation DSH group on mw2264 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:08:14] (03CR) 10Andrew Bogott: [C: 03+2] Move eqiad1 OpenStack control plane to version 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/880565 (https://phabricator.wikimedia.org/T323086) (owner: 10Andrew Bogott) [17:08:18] (ProbeDown) firing: Service labweb-ssl:7443 has failed probes (http_labweb-ssl_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#labweb-ssl:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:08:42] new page [17:08:47] ACKed [17:09:07] RECOVERY - mediawiki-installation DSH group on mw2268 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:09:08] is someone looking at this? [17:09:18] (ProbeDown) firing: Service labweb-ssl:7443 has failed probes (http_labweb-ssl_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#labweb-ssl:7443 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [17:09:19] rather working [17:10:45] RECOVERY - mediawiki-installation DSH group on mw2260 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:12:48] andrewbogott: ^ [17:13:16] the labweb thing is me upgrading things. [17:13:28] I will ack if someone tells me how :) [17:13:37] andrewbogott: already ACKed but was checking :) [17:14:00] is that an icinga thing or an alertmanager thing? If the latter I can automate the downtime [17:14:33] RECOVERY - mediawiki-installation DSH group on mw2267 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:14:48] andrewbogott: alertmanager [17:15:21] RECOVERY - mediawiki-installation DSH group on mw2269 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:15:55] RECOVERY - mediawiki-installation DSH group on mw2317 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:16:13] sukhe: cook, I'll make a note for next time. thx [17:16:29] np, thanks [17:17:17] RECOVERY - mediawiki-installation DSH group on parse2006 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:17:30] !log removing errant 2620:0:860:118: IPs from primary interfaces of hosts in B2 [17:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:18:25] RECOVERY - mediawiki-installation DSH group on mw2321 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:19:30] !log pooling back codfw services [17:19:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:19:33] RECOVERY - mediawiki-installation DSH group on mw2270 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:21:07] RECOVERY - mediawiki-installation DSH group on mw2322 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:24:37] RECOVERY - mediawiki-installation DSH group on mw2312 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:24:46] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Engineering-Planning, 10Shared-Data-Infrastructure: Q1:rack/setup/install druid10[09-11] - https://phabricator.wikimedia.org/T314335 (10Papaul) @BTullis any update on this? [17:25:33] RECOVERY - mediawiki-installation DSH group on mw2318 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:25:33] RECOVERY - mediawiki-installation DSH group on mw2320 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:25:33] RECOVERY - mediawiki-installation DSH group on mw2319 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:27:23] RECOVERY - mediawiki-installation DSH group on mw2259 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:27:23] RECOVERY - mediawiki-installation DSH group on mw2262 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:27:23] RECOVERY - mediawiki-installation DSH group on mw2261 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:27:23] RECOVERY - mediawiki-installation DSH group on mw2263 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:27:23] RECOVERY - mediawiki-installation DSH group on mw2310 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:27:23] RECOVERY - mediawiki-installation DSH group on mw2313 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:27:23] RECOVERY - mediawiki-installation DSH group on mw2314 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:27:24] RECOVERY - mediawiki-installation DSH group on mw2315 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:27:24] RECOVERY - mediawiki-installation DSH group on mw2316 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:27:25] RECOVERY - mediawiki-installation DSH group on mw2323 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:27:25] RECOVERY - mediawiki-installation DSH group on mw2325 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:27:26] RECOVERY - mediawiki-installation DSH group on mw2326 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:27:26] RECOVERY - mediawiki-installation DSH group on mw2327 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:27:27] RECOVERY - mediawiki-installation DSH group on mw2328 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:27:27] RECOVERY - mediawiki-installation DSH group on mw2329 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:27:28] RECOVERY - mediawiki-installation DSH group on mw2330 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:27:28] RECOVERY - mediawiki-installation DSH group on mw2331 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:27:29] !log jiji@cumin1001 conftool action : set/pooled=true; selector: dnsdisc=aqs,name=codfw [17:27:29] RECOVERY - mediawiki-installation DSH group on mw2332 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:27:29] RECOVERY - mediawiki-installation DSH group on mw2333 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:27:30] RECOVERY - mediawiki-installation DSH group on mw2334 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:31:05] (ConfdResourceFailed) firing: (4) confd resource _var_lib_gdnsd_discovery-k8s-ingress-wikikube-rw.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [17:31:09] 10SRE, 10SRE-Access-Requests: Requesting access to Analytics Data for Ollie_Shotton - https://phabricator.wikimedia.org/T327187 (10Ollie.Shotton_WMDE) [17:32:03] ^ MatmaRex [17:33:54] thanks [17:36:05] (ConfdResourceFailed) firing: (6) confd resource _var_lib_gdnsd_discovery-k8s-ingress-wikikube-rw.state.toml has errors - https://wikitech.wikimedia.org/wiki/Confd#Monitoring - https://grafana.wikimedia.org/d/OUJF1VI4k/confd - https://alerts.wikimedia.org/?q=alertname%3DConfdResourceFailed [17:36:21] 10SRE, 10SRE-Access-Requests: Requesting access to Analytics Data for Ollie_Shotton - https://phabricator.wikimedia.org/T327187 (10Ottomata) Approved. This will need analytics-privatedata-users group, ssh and kerberos access. [17:38:05] RECOVERY - mediawiki-installation DSH group on parse2009 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:38:12] so, if anyone is available to do an unscheduled backport deployment (of the patches that were in the cancelled window), i am also around and would be very thankful [17:42:57] RECOVERY - mediawiki-installation DSH group on parse2010 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:46:05] 10SRE, 10Traffic-Icebox, 10SecTeam-Processed: Consider removing X-Wikimedia-Security-Audit VCL support - https://phabricator.wikimedia.org/T229320 (10BCornwall) 05Open→03Resolved a:03BCornwall This seems to have been resolved. `git grep -i X-Wikimedia-Security-Audit` in the puppet repo returns nothing... [17:46:10] (03PS1) 10Jdrewniak: Table of contents Collapse/Expand not working [skins/Vector] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/880913 (https://phabricator.wikimedia.org/T327064) [17:46:14] 10SRE, 10Traffic-Icebox, 10SecTeam-Processed: Consider removing X-Wikimedia-Security-Audit VCL support - https://phabricator.wikimedia.org/T229320 (10BCornwall) a:05BCornwall→03None [17:52:03] 10SRE, 10Traffic-Icebox: HTTPS/Browser Recommendations page on Wikitech is outdated - https://phabricator.wikimedia.org/T240813 (10BCornwall) a:03BCornwall [17:54:24] !log authdns1001: restart confd [17:54:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:17] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 112 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [17:55:50] (03CR) 10Dzahn: vrts: add vrts2001 hieradata and database port (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/880488 (https://phabricator.wikimedia.org/T323515) (owner: 10AOkoth) [17:56:57] (03CR) 10Dzahn: [C: 03+2] "thanks! confirmed the directory is empty on both servers" [puppet] - 10https://gerrit.wikimedia.org/r/880963 (owner: 10Hashar) [17:57:17] RECOVERY - mediawiki-installation DSH group on parse2008 is OK: OK https://wikitech.wikimedia.org/wiki/Monitoring/check_dsh_groups [17:58:45] !log restarted es5 codfw backup [17:58:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:05] Deploy window MediaWiki infrastucture (UTC late) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230117T1800) [18:00:22] 10SRE, 10ops-codfw, 10DC-Ops: Decommission mc20[19-27] and mc20[29-37] - https://phabricator.wikimedia.org/T313733 (10Jhancock.wm) [18:01:37] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: (C)100 gt (W)50 gt 4 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:01:44] !log cgoubert@cumin1001 conftool action : set/pooled=false; selector: dnsdisc=k8s-ingress-wikikube-rw,name=codfw [18:02:44] 10SRE, 10ops-codfw, 10DC-Ops: Decommission mc20[19-27] and mc20[29-37] - https://phabricator.wikimedia.org/T313733 (10Jhancock.wm) @Papaul I've finished the onsite items. SSDs have been removed, servers have been unracked. Servers have been moved to the storage cage and will work on removing side rails and s... [18:05:13] MatmaRex, [18:05:31] PROBLEM - Check systemd state on thanos-fe1001 is CRITICAL: CRITICAL - degraded: The following units failed: swift_dispersion_stats.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:05:46] !log jgiannelos@deploy1002 helmfile [codfw] START helmfile.d/services/proton: apply [18:06:05] (03PS2) 10Zabe: objectcache: Fix DI for MultiWriteBagOStuff sub caches [core] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/880908 (https://phabricator.wikimedia.org/T327158) [18:06:24] (03CR) 10Zabe: [C: 03+2] objectcache: Fix DI for MultiWriteBagOStuff sub caches [core] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/880908 (https://phabricator.wikimedia.org/T327158) (owner: 10Zabe) [18:07:15] MatmaRex, is it okay to get the 3 config patches out together? [18:07:26] !log jgiannelos@deploy1002 helmfile [codfw] DONE helmfile.d/services/proton: apply [18:07:42] !log jgiannelos@deploy1002 helmfile [eqiad] START helmfile.d/services/proton: apply [18:10:12] zabe: yes. one sec [18:10:24] !log gerrit1002/gerrit2002: sudo rmdir /srv/gerrit/jvmlogs [18:10:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:32] all config changes look good [18:29:38] !log otto@deploy1002 Started deploy [airflow-dags/analytics@8d0e919]: Regular analytics weekly train @8d0e919] [18:29:49] the wmf.18 backport also seems good, although i think i found some unrelated problem [18:29:54] !log otto@deploy1002 Finished deploy [airflow-dags/analytics@8d0e919]: Regular analytics weekly train @8d0e919] (duration: 00m 15s) [18:29:57] !log otto@deploy1002 Finished deploy [analytics/refinery@55f90ac]: Regular analytics weekly train [analytics/refinery@55f90ac] (duration: 04m 28s) [18:30:02] i get an exception on https://test.wikipedia.org/wiki/Wikipedia_talk:Twinkle when NOT on mwdebug servers [18:30:25] oh wait that's not .18 [18:30:37] that's https://phabricator.wikimedia.org/T327158 [18:31:01] syncing then [18:31:41] wmf.18 backport also looks good. i was testing in the wrong place. everything looks good :) [18:31:44] (03PS1) 10Sbailey: Enable Linter write namespace tag and template using core config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/880989 (https://phabricator.wikimedia.org/T299612) [18:31:46] thanks [18:32:29] (03CR) 10Ottomata: flink-kubernetes-operator - allow flink-app pods to talk to k8s API (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/879618 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [18:32:32] (03CR) 10Ottomata: [C: 03+2] flink-kubernetes-operator - allow flink-app pods to talk to k8s API [deployment-charts] - 10https://gerrit.wikimedia.org/r/879618 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [18:33:58] (03PS2) 10Sbailey: Enable Linter write namespace tag and template using core config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/880989 (https://phabricator.wikimedia.org/T299612) [18:34:52] Assuming y'all already know but https://www.irccloud.com/pastebin/o19pFBYs/ [18:34:55] PROBLEM - MediaWiki exceptions and fatals per minute for appserver on alert1001 is CRITICAL: 2.651e+04 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:35:14] (03CR) 10Sbailey: "The wmf-config/InitializeSettings.php file to be used for the backport window deploy." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/880989 (https://phabricator.wikimedia.org/T299612) (owner: 10Sbailey) [18:35:37] we had a spike of PHP Notice: Undefined index: parse [18:35:42] !log zabe@deploy1002 backport aborted: (duration: 19m 41s) [18:36:24] but seems to have been temporary [18:36:31] RECOVERY - MediaWiki exceptions and fatals per minute for appserver on alert1001 is OK: (C)100 gt (W)50 gt 7 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [18:36:50] enwiki is down for me. See Tamzin's error [18:36:50] mediawiki.org is still down right now [18:37:19] ConfigException: GlobalVarConfig::get: undefined option: 'DiscussionTools_visualenhancements_namespaces' [18:37:26] reverting [18:37:43] (FrontendUnavailable) firing: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [18:37:44] (FrontendUnavailable) firing: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [18:37:52] (03Merged) 10jenkins-bot: flink-kubernetes-operator - allow flink-app pods to talk to k8s API [deployment-charts] - 10https://gerrit.wikimedia.org/r/879618 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [18:37:57] zabe: need help? [18:38:01] hm [18:38:14] that'd be from the wmf.18 backport? [18:38:16] ACKed the page [18:38:26] (03PS1) 10Zabe: Revert "Enable visual enhancements on all talk namespaces" [extensions/DiscussionTools] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/880914 [18:38:31] (03CR) 10Zabe: [V: 03+2 C: 03+2] Revert "Enable visual enhancements on all talk namespaces" [extensions/DiscussionTools] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/880914 (owner: 10Zabe) [18:38:42] uhh [18:38:46] zabe: wait a sec [18:38:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [18:38:59] isn't it that [18:39:05] zabe: so i think that was caused by sync order between the two files [18:39:09] !log zabe@deploy1002 backport aborted: (duration: 00m 26s) [18:39:13] if extension.json was synced first, the PHP code would fail [18:39:25] zabe: please just sync the revert [18:39:27] but, if you sync the revert now [18:39:31] we will get the same spike again [18:39:32] MatmaRex: synces are atomical now [18:39:38] caused by the same sync order issues? [18:39:49] !log zabe@deploy1002 Started scap: Backport for [[gerrit:880914|Revert "Enable visual enhancements on all talk namespaces"]] [18:39:59] (KubernetesAPILatency) firing: High Kubernetes API latency (GET replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:40:18] taavi: well, it doesn't look like they are… [18:40:34] anyway, let's revert and see [18:41:10] semi-atomic. "It depends" [18:41:12] I'm here [18:41:19] we can think about the cause after the wikis are back up [18:41:20] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/admin 'apply'. [18:41:29] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/admin 'apply'. [18:41:32] half in a meeting [18:41:38] !log zabe@deploy1002 zabe and zabe: Backport for [[gerrit:880914|Revert "Enable visual enhancements on all talk namespaces"]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [18:41:39] already syncing [18:41:58] sadly we can't speed this up [18:42:27] sorry about this, i didn't think about this. i really hate the sync order thing [18:43:36] i am getting the "Fatal exception of type "ConfigException"" on wikipedia right now, fwiw [18:44:00] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/services/flink-app-example: apply [18:44:02] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/flink-app-example: apply [18:44:24] Someone else I'm on VC with says they're able to connect now. I'm still getting an error. [18:44:40] Tamzin: -tech, please and thank you [18:44:59] (KubernetesAPILatency) resolved: High Kubernetes API latency (GET replicasets) on k8s@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [18:45:00] Working again for me, had been getting the error for the past few mins [18:45:09] but yes, the revert is getting deployed atm [18:45:21] https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?orgId=1&from=1673980221875&to=1673981112597 [18:45:25] so it's really up to a random chance of which appserver you get [18:45:34] it's subsiding but not fully gone yet [18:45:48] we were down for a full ten minutes [18:46:17] better graph https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red?orgId=1&from=now-1h&to=now&viewPanel=63 [18:47:34] ok, so what exactly happened here? looks like the updated code got applied before the extension.json changes were? [18:47:43] (FrontendUnavailable) resolved: varnish-text has reduced HTTP availability #page - https://wikitech.wikimedia.org/wiki/Varnish#Diagnosing_Varnish_alerts - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=3 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [18:47:44] (FrontendUnavailable) resolved: HAProxy (cache_text) has reduced HTTP availability #page - TODO - https://grafana.wikimedia.org/d/000000479/frontend-traffic?viewPanel=13 - https://alerts.wikimedia.org/?q=alertname%3DFrontendUnavailable [18:47:48] (how) do we cache extension.json? [18:47:50] hmm [18:48:25] do we know for sure that the deploys are atomic? this looked exactly like an issue that would occur if you synced extension.json without syncing PHP code [18:48:40] what is the faulty patch? Does anyone want to write IR? [18:48:54] the faulty patch would be https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DiscussionTools/+/880914 [18:48:55] the faulty patch was https://gerrit.wikimedia.org/r/c/mediawiki/extensions/DiscussionTools/+/879103/ [18:48:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [18:48:57] https://wikitech.wikimedia.org/wiki/Incident_status [18:48:58] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:880914|Revert "Enable visual enhancements on all talk namespaces"]] (duration: 09m 08s) [18:49:08] I can write something once we figure out what exactly happened here [18:49:16] MatmaxRex: Changes to non-PHP files will be visible as soon as they land on the target server. [18:49:41] dancy: this seems like the exact opposite.. which smells like caching here [18:49:50] the extension.json values are cached in APCu [18:50:04] don't know for how long [18:50:13] Amir1: ok, that makes sense.. (how) is that cache cleared? [18:50:23] the problem was that the config was removed from extension.json, so as soon as the files reached the hosts they treid to look up the config which failed [18:50:24] the issue did not occur when i was testing on mwdebug. it only occurred during the normal deployment [18:50:24] mtime of the extension.json file [18:50:25] OK makes sense. Depends on the meaning of "seen" I suppose. [18:50:30] because we don't every request to load tons of .json files making syscals [18:51:03] dancy: does the canary check phase involve restarting php-fpm? [18:51:11] unless something has changed, we do check the mtime of every extension.json + skin.json on every request [18:51:53] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:52:04] taavi: It should but I'll double check in the code. [18:52:40] zabe: do you have the logs (including timestamp) for the initial scap sync leading to the outage? [18:52:45] so is it the sync order? [18:53:07] looks like sync order issue to me (maybe with some extra steps) [18:53:11] the php code should only get applied when php-fpm is restarted, which happens well after the updated extension.json is synced [18:54:07] not sure, I don't think php-fpm restart is needed. Sometimes is, but not all the time [18:54:10] taavi https://phabricator.wikimedia.org/P43180 [18:54:33] thank you [18:58:18] (ProbeDown) firing: Service text:80 has failed probes (http_text_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:58:36] Amir1: can you use your sre superpowers to grep through mw1414 auth.log and figure out what time did scap restart php-fpm there? it should result in a logged sudo call [18:58:47] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin2002 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:58:51] sure [18:59:15] scap stuff can be viewed in logstash too: https://logstash.wikimedia.org/app/dashboards#/view/f7e31de0-9f0d-11eb-863c-3588009e4dd9 [18:59:19] PROBLEM - Check systemd state on cp2031 is CRITICAL: CRITICAL - degraded: The following units failed: cadvisor.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [18:59:21] RECOVERY - Check unit status of httpbb_hourly_appserver on cumin1001 is OK: OK: Status of the systemd unit httpbb_hourly_appserver https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:59:35] RECOVERY - Check unit status of httpbb_kubernetes_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [18:59:44] zabe: the other changes are live, right? or not? (i just want to update the tasks) [18:59:58] yes [19:00:04] jnuche and jeena: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for MediaWiki train - Utc-0+Utc-7 Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230117T1900). [19:00:08] thanks [19:00:14] let me know if i should do anything else to help [19:00:49] https://phabricator.wikimedia.org/P43181 taavi ? [19:01:00] Amir1: perfect, thank you [19:01:17] taavi: Yes, canaries should be performing a self php-fpm restart. I need to improve the logging though. [19:01:35] dancy: thanks, and indeed those logs confirm it's happening [19:02:02] the first errors on that box are logged at 18:31:11, which is a second earlier than the php-fpm restart [19:02:40] so unless the auth.log timestamps are off (if it logs the exit time or something), that suggests that php did indeed pick up the code changes before the restart [19:03:05] we have a culprit then 🔫 [19:03:18] (ProbeDown) resolved: Service text:80 has failed probes (http_text_ip4) #page - https://wikitech.wikimedia.org/wiki/Runbook#text:80 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:03:20] 10SRE, 10Traffic-Icebox: HTTPS/Browser Recommendations page on Wikitech is outdated - https://phabricator.wikimedia.org/T240813 (10BCornwall) 05Open→03Resolved I've updated the page a little further to reflect the Windows 8/8.1 EOL (just a few days ago!) and made some of the wording more vague so it can in... [19:03:24] 10SRE, 10Traffic: Start warning and deprecation process for all legacy TLS - https://phabricator.wikimedia.org/T238038 (10BCornwall) [19:03:26] 10SRE, 10Traffic-Icebox, 10Documentation: Update TLS/HTTP documentation on wikitech - https://phabricator.wikimedia.org/T96844 (10BCornwall) [19:03:53] PROBLEM - traffic_server backend process restarted on cp2031 is CRITICAL: 2 ge 2 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=codfw+prometheus/ops&var-instance=cp2031&var-layer=backend [19:04:08] AFAIK, sync leads to the change of behavior without php-fpm restart. Sometimes it didn't and that led to outages and then we decided to restart php-fpm unconditionally [19:04:55] I might be misremembering stuff though [19:05:07] that sounds similar to what I remember [19:05:12] opcache corruption was a big motivation for the always php-fpm restart [19:05:29] next question: why did the canary checks not prevent the deployment from going forward? [19:07:04] 10SRE, 10Traffic: Start warning and deprecation process for all legacy TLS - https://phabricator.wikimedia.org/T238038 (10BCornwall) Looks like this can be closed, right @Vgutierrez? [19:07:40] taavi: it would have worked after the restart, just not until both ext.json and php files synced [19:07:41] RECOVERY - Check systemd state on cumin2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:08:23] RECOVERY - Check unit status of httpbb_kubernetes_hourly on cumin2002 is OK: OK: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [19:09:09] <_joe_> taavi: some hosts don't use php-fpm restarts [19:09:14] <_joe_> specifically, the jobrunners [19:09:40] (03PS1) 10Ebernhardson: Resolve deprecations and type changes in elastica 7.3.0 [extensions/CirrusSearch] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/880915 [19:09:44] 10SRE, 10Traffic-Icebox, 10Documentation: Update TLS/HTTP documentation on wikitech - https://phabricator.wikimedia.org/T96844 (10BCornwall) [19:09:46] 10SRE, 10Traffic: Start warning and deprecation process for all legacy TLS - https://phabricator.wikimedia.org/T238038 (10BCornwall) [19:09:53] 10SRE, 10Traffic-Icebox, 10Documentation: Update TLS/HTTP documentation on wikitech - https://phabricator.wikimedia.org/T96844 (10BCornwall) Even though this is the older ticket, marking as a dupe of T240813 since that had more relevant information. [19:11:10] (03PS1) 10Ottomata: flink-app - set KUBERNETES_SERVICE_{HOST,PORT} in flink-main-container [deployment-charts] - 10https://gerrit.wikimedia.org/r/880991 (https://phabricator.wikimedia.org/T324576) [19:11:19] RECOVERY - Check systemd state on mwmaint1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:11:56] (03CR) 10CI reject: [V: 04-1] flink-app - set KUBERNETES_SERVICE_{HOST,PORT} in flink-main-container [deployment-charts] - 10https://gerrit.wikimedia.org/r/880991 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [19:12:06] going forward: MatmaRex, zabe, how about trying to deploy the patch again but with the extension.json part split in a separate patch set that's getting synced first? [19:12:11] (03CR) 10Ottomata: "not sure if this will work, but I'll try this before giving up." [deployment-charts] - 10https://gerrit.wikimedia.org/r/880991 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [19:12:37] RECOVERY - ElasticSearch unassigned shard check - 9243 on search.svc.codfw.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [19:13:32] taavi: i am up for that, but i didn't want to suggest it because there's other stuff scheduled on the calendar [19:13:38] we could probably just do that backport without the extension.json change at all, it doesn't hurt to have that, now useless, config var in there [19:13:47] i can prepare patches if you want to deploy them though [19:13:58] the backport wasn't *that* important :) [19:14:07] zabe: true [19:14:57] zabe: wdym? [19:15:35] MatmaRex: I'd maybe prefer getting it deployed today, just so everyone (or at least I) will have a good feeling about it afterwards [19:15:58] (03PS1) 10Zabe: Revert "Revert "Enable visual enhancements on all talk namespaces"" [extensions/DiscussionTools] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/880916 [19:16:09] (03PS2) 10Zabe: Revert "Revert "Enable visual enhancements on all talk namespaces"" [extensions/DiscussionTools] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/880916 [19:16:28] (03PS2) 10Ottomata: flink-app - set KUBERNETES_SERVICE_{HOST,PORT} in flink-main-container [deployment-charts] - 10https://gerrit.wikimedia.org/r/880991 (https://phabricator.wikimedia.org/T324576) [19:16:45] taavi, the backport has the same effect without the extension.json change, like ^^, so we can just leave it away and we'll avoid issues [19:17:11] (03CR) 10CI reject: [V: 04-1] flink-app - set KUBERNETES_SERVICE_{HOST,PORT} in flink-main-container [deployment-charts] - 10https://gerrit.wikimedia.org/r/880991 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [19:17:14] (03PS3) 10Bartosz Dziewoński: Revert "Revert "Enable visual enhancements on all talk namespaces"" [extensions/DiscussionTools] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/880916 (owner: 10Zabe) [19:17:18] (03CR) 10Bartosz Dziewoński: [C: 03+1] Revert "Revert "Enable visual enhancements on all talk namespaces"" [extensions/DiscussionTools] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/880916 (owner: 10Zabe) [19:17:29] i added details to commit message [19:18:03] oh, it was removing the config var and not adding it? somehow I missed that, and it explains.. a lot [19:18:17] yea, that sounds like a good plan [19:19:42] 10SRE: wikimediastatus.net help popups are unreadable - https://phabricator.wikimedia.org/T327201 (10Tgr) [19:20:20] 10SRE: wikimediastatus.net help popups are unreadable - https://phabricator.wikimedia.org/T327201 (10Tgr) (BTW it would be nice to add a `#wikimediastatus` alias to the appropriate project, it's not clear currently where such issues should be filed.) [19:22:47] i'm around, feel free to ping me whenever you want to deploy [19:23:02] 10SRE: wikimediastatus.net should have link anchors - https://phabricator.wikimedia.org/T327203 (10Tgr) [19:23:05] RECOVERY - Check systemd state on cp2031 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [19:23:16] (03PS3) 10Ottomata: flink-app - set KUBERNETES_SERVICE_{HOST,PORT} in flink-main-container [deployment-charts] - 10https://gerrit.wikimedia.org/r/880991 (https://phabricator.wikimedia.org/T324576) [19:24:19] zabe: ^ do you still want to deploy or should someone else do it? [19:24:43] I can do it [19:24:52] awesome [19:25:00] (03CR) 10Zabe: [C: 03+2] Revert "Revert "Enable visual enhancements on all talk namespaces"" [extensions/DiscussionTools] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/880916 (owner: 10Zabe) [19:25:30] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [extensions/DiscussionTools] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/880916 (owner: 10Zabe) [19:25:57] MatmaRex, let's do this ^ [19:26:21] ok [19:28:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [19:30:32] (03Merged) 10jenkins-bot: Revert "Revert "Enable visual enhancements on all talk namespaces"" [extensions/DiscussionTools] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/880916 (owner: 10Zabe) [19:30:56] !log zabe@deploy1002 Started scap: Backport for [[gerrit:880916|Revert "Revert "Enable visual enhancements on all talk namespaces""]] [19:32:48] !log zabe@deploy1002 zabe and zabe: Backport for [[gerrit:880916|Revert "Revert "Enable visual enhancements on all talk namespaces""]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [19:33:01] MatmaRex, ^ [19:33:36] looks good on mwdebug [19:33:50] 10SRE, 10Parsoid, 10Scap, 10serviceops: scap groups on bastions still needed? - https://phabricator.wikimedia.org/T327066 (10Dzahn) +1, I also think those are not relevant anymore. If something like this is needed it should be done from deployment servers or maybe mwmaint but not bastions. I would say rem... [19:34:03] then let's try this [19:34:32] (03CR) 10Ottomata: [C: 03+2] flink-app - set KUBERNETES_SERVICE_{HOST,PORT} in flink-main-container [deployment-charts] - 10https://gerrit.wikimedia.org/r/880991 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [19:40:37] (03Merged) 10jenkins-bot: flink-app - set KUBERNETES_SERVICE_{HOST,PORT} in flink-main-container [deployment-charts] - 10https://gerrit.wikimedia.org/r/880991 (https://phabricator.wikimedia.org/T324576) (owner: 10Ottomata) [19:41:21] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:880916|Revert "Revert "Enable visual enhancements on all talk namespaces""]] (duration: 10m 25s) [19:41:30] 10SRE, 10SRE-Access-Requests: Requesting access to Analytics Data for Muhammad Jaziraly - https://phabricator.wikimedia.org/T327172 (10Eevans) [19:41:45] looks like everything went well this time? nice! [19:42:01] yep :) [19:42:20] cool, I'm finishing up the IR [19:42:44] nice, thanks! [19:42:58] do you want to post an update on the task, or should i write one? [19:43:10] MatmaRex: I'll do that in a bit [19:43:28] !log otto@deploy1002 helmfile [dse-k8s-eqiad] START helmfile.d/services/flink-app-example: apply [19:43:30] ok. thanks everyone [19:43:33] !log otto@deploy1002 helmfile [dse-k8s-eqiad] DONE helmfile.d/services/flink-app-example: apply [19:43:36] thanks all [19:43:51] are backports all done now? [19:44:06] yep [19:44:32] ok thanks! I'm going to deploy to group0 shortly [19:45:01] did the train blocker fix get deployed already? [19:46:56] taavi: yeah, zabe was backporting it in the last group i think (if you mean https://phabricator.wikimedia.org/T327158) [19:47:20] oh thanks for the reminder [19:47:20] yeah, it was part of the faulty deploy (but not the outage causing patch) [19:47:47] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:47:59] I only reverted the faulty patch, the other ones just went out [19:48:11] jeena, the fix is deployed, you are safe to go ahead [19:48:21] 👍 [19:50:15] !log T327175 Reprocessing last several hours of updates (`2023-01-17T12:00:00Z` -> `2023-01-17T17:30:00Z`) on codfw elasticsearch, running on `ryankemper@mwmaint2002` tmux session `reindex` [19:50:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:19] T327175: Survey and correct issues caused by CODFW switch failure - https://phabricator.wikimedia.org/T327175 [19:50:22] Amir1: zabe: MatmaRex: draft IR for the MW incident, https://wikitech.wikimedia.org/wiki/Incidents/2023-01-17_MediaWiki [19:52:07] thanks. reading :) [19:56:12] Thanks taavi [19:57:46] (03PS1) 10TrainBranchBot: group0 wikis to 1.40.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/880999 (https://phabricator.wikimedia.org/T325582) [19:57:48] (03CR) 10TrainBranchBot: [C: 03+2] group0 wikis to 1.40.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/880999 (https://phabricator.wikimedia.org/T325582) (owner: 10TrainBranchBot) [19:58:27] (03Merged) 10jenkins-bot: group0 wikis to 1.40.0-wmf.19 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/880999 (https://phabricator.wikimedia.org/T325582) (owner: 10TrainBranchBot) [19:58:41] (03PS1) 10Ebernhardson: UpdateSuggesterIndex: Properly cleanup bad indices [extensions/CirrusSearch] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/880917 [19:59:42] taavi: i added a point and tweaked into a bit [19:59:45] (03PS1) 10Ryan Kemper: wdqs: disable notifs on not-yet-in-service hosts [puppet] - 10https://gerrit.wikimedia.org/r/881000 [19:59:51] intro* [20:00:05] (03PS2) 10Ryan Kemper: wdqs: disable notifs on not-yet-in-service hosts [puppet] - 10https://gerrit.wikimedia.org/r/881000 (https://phabricator.wikimedia.org/T301167) [20:00:17] (03CR) 10Gehel: [C: 03+1] wdqs: disable notifs on not-yet-in-service hosts [puppet] - 10https://gerrit.wikimedia.org/r/881000 (https://phabricator.wikimedia.org/T301167) (owner: 10Ryan Kemper) [20:01:24] (03CR) 10Ryan Kemper: [C: 03+2] wdqs: disable notifs on not-yet-in-service hosts [puppet] - 10https://gerrit.wikimedia.org/r/881000 (https://phabricator.wikimedia.org/T301167) (owner: 10Ryan Kemper) [20:06:54] !log jhuneidi@deploy1002 rebuilt and synchronized wikiversions files: group0 wikis to 1.40.0-wmf.19 refs T325582 [20:06:59] T325582: 1.40.0-wmf.19 deployment blockers - https://phabricator.wikimedia.org/T325582 [20:08:21] If i understand correctly, cancelling the backport left the cluster in an inconsistent state (Where files were synced, but php opcache was never cleared). That seems like something to potentially call out in the IR as unexpected [20:12:25] I'm also not sure if "The issue was detected early" is exactly something that "went well", since if the incident was detected a bit later, the incident arguably wouldn't have happened [20:13:06] (03CR) 10CI reject: [V: 04-1] UpdateSuggesterIndex: Properly cleanup bad indices [extensions/CirrusSearch] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/880917 (owner: 10Ebernhardson) [20:14:57] taavi: i wrote a human-readable summary at https://phabricator.wikimedia.org/T327196#8532474 . let me know if i got something wrong [20:18:45] RECOVERY - WDQS SPARQL on wdqs1016 is OK: HTTP OK: HTTP/1.1 200 OK - 691 bytes in 1.151 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [20:18:49] !log [WDQS] Restart blazegraph on `wdqs1016` to clear alert: `ryankemper@wdqs1016:~$ sudo systemctl restart wdqs-blazegraph` [20:18:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:21] RECOVERY - Query Service HTTP Port on wdqs1016 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.238 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [20:21:41] bawolff: good point on the inconsistant state, I'll update it accordingly [20:22:51] and would have happened if scap wasn't cancelled, although the duration of the outage would indeed have been shorter. so in general I think it's a good thing if we're able to detect and react to bad changes quickly [20:23:40] yes true. It definitely does seem surprising though that cancelling the scap actually made the problem worse though [20:25:25] definitely counter-intuitive [20:25:33] !log ran preferred-replica-election on kafka-logging codfw to clear replica imbalance [20:25:34] you have a good point that in this case it maybe wasn't something that "went well", but I definitely would not consider it something that "went poorly" [20:25:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:46] the joys of weird edge cases [20:26:11] maybe I'll add a note just in case [20:28:26] I suppose if scap worked like - deploy one server, restart php, then go to next server instead of change files everywhere then restart everywhere - that would have been better for this case, although presumably it might not be as good for other cases [20:29:21] I guess doing a deployment would take like an hour then [20:29:41] indeed, or if it would group the syncs and php-fpm restarts together instead of first doing the file syncs, then deploying to kubernetes and only after that's complete finishing with the php-fpm restarts [20:29:59] * bawolff so glad he's not a deployer anymore. I always found it so stresful [20:33:50] 10SRE-OnFire, 10Wikidata: Very high maxlag on Wikidata - https://phabricator.wikimedia.org/T327210 (10RPI2026F1) [20:34:04] 10SRE-OnFire, 10Wikidata: Very high maxlag on Wikidata - https://phabricator.wikimedia.org/T327210 (10RPI2026F1) p:05Triage→03Unbreak! [20:42:41] (03PS2) 10Andrew Bogott: Move cloud-vps client manifests to OpenStack verison 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/880567 (https://phabricator.wikimedia.org/T323086) [20:42:42] (03PS1) 10Andrew Bogott: neutron policy.yaml: remove a redundant policy rule [puppet] - 10https://gerrit.wikimedia.org/r/881004 (https://phabricator.wikimedia.org/T323086) [20:43:24] (03CR) 10Andrew Bogott: [C: 03+2] Move cloud-vps client manifests to OpenStack verison 'zed' [puppet] - 10https://gerrit.wikimedia.org/r/880567 (https://phabricator.wikimedia.org/T323086) (owner: 10Andrew Bogott) [20:44:29] (03CR) 10Andrew Bogott: [C: 03+2] neutron policy.yaml: remove a redundant policy rule [puppet] - 10https://gerrit.wikimedia.org/r/881004 (https://phabricator.wikimedia.org/T323086) (owner: 10Andrew Bogott) [20:47:15] !log [WDQS] Depooled `wdqs1016` [20:47:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:57] 10SRE, 10Platform Engineering, 10cloud-services-team (Kanban): Get platform engineering team green light for Cloud NAT to wikis change - https://phabricator.wikimedia.org/T273738 (10BCornwall) [20:52:51] (03PS1) 10Bartosz Dziewoński: Revert "Gallery: Improve initial state and fix thumbnail sizes" [core] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/880919 (https://phabricator.wikimedia.org/T326270) [20:53:16] (03PS1) 10Bartosz Dziewoński: Revert gallery changes in 1.40.0-wmf.18 [core] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/880920 (https://phabricator.wikimedia.org/T326990) [20:53:44] 10SRE, 10API Platform, 10Commons, 10MediaWiki-File-management, and 6 others: Define an official thumb API - https://phabricator.wikimedia.org/T66214 (10BCornwall) [20:53:50] (03PS1) 10Bartosz Dziewoński: Revert gallery changes in 1.40.0-wmf.18 [core] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/880921 (https://phabricator.wikimedia.org/T326990) [20:59:44] (03PS1) 10Andrew Bogott: neutron policy.yaml: remove more redundant policy rules [puppet] - 10https://gerrit.wikimedia.org/r/881006 (https://phabricator.wikimedia.org/T323086) [21:00:05] RoanKattouw, Urbanecm, cjming, TheresNoTime, and kindrobot: My dear minions, it's time we take the moon! Just kidding. Time for UTC late backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230117T2100). [21:00:05] hmonroy, Dreamy_Jazz, jan_drewniak, and ebernhardson: A patch you scheduled for UTC late backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [21:00:16] o/ [21:00:21] Hello [21:00:25] 10SRE, 10SRE Observability: wikimediastatus.net help popups are unreadable - https://phabricator.wikimedia.org/T327201 (10Eevans) [21:00:26] o/ [21:00:27] \o [21:00:41] !log [WDQS] `ryankemper@wdqs1005:~$ sudo pool` (had been left depooled from previous powercycle) [21:00:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:43] 10SRE, 10SRE Observability: wikimediastatus.net should have link anchors - https://phabricator.wikimedia.org/T327203 (10Eevans) [21:01:50] (03CR) 10Andrew Bogott: [C: 03+2] neutron policy.yaml: remove more redundant policy rules [puppet] - 10https://gerrit.wikimedia.org/r/881006 (https://phabricator.wikimedia.org/T323086) (owner: 10Andrew Bogott) [21:02:39] o/ is anyone deploying already? [21:03:22] PROBLEM - IPv6 ping to esams on ripe-atlas-esams IPv6 is CRITICAL: CRITICAL - failed 260 probes of 709 (alerts on 90) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:03:26] PROBLEM - IPv4 ping to esams on ripe-atlas-esams is CRITICAL: CRITICAL - failed 189 probes of 798 (alerts on 35) - https://atlas.ripe.net/measurements/23449935/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:04:39] (03PS1) 10Andrew Bogott: Revert "neutron policy.yaml: remove more redundant policy rules" [puppet] - 10https://gerrit.wikimedia.org/r/880922 [21:05:05] Doesn't look like it [21:05:36] and now I'm thinking if I should get worried about the esams atlas alerts above [21:05:49] (03CR) 10Andrew Bogott: [C: 03+2] Revert "neutron policy.yaml: remove more redundant policy rules" [puppet] - 10https://gerrit.wikimedia.org/r/880922 (owner: 10Andrew Bogott) [21:06:59] so no deploys? :( [21:07:31] no, I think I'm too tired to touch production at this point. you would need to find someone else for it. sorry. [21:08:14] i can deploy i suppose [21:08:21] yay! [21:08:26] RECOVERY - IPv6 ping to esams on ripe-atlas-esams IPv6 is OK: OK - failed 37 probes of 709 (alerts on 90) - https://atlas.ripe.net/measurements/23449938/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:08:26] :D [21:08:27] thank you! [21:08:30] RECOVERY - IPv4 ping to esams on ripe-atlas-esams is OK: OK - failed 3 probes of 798 (alerts on 35) - https://atlas.ripe.net/measurements/23449935/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts https://grafana.wikimedia.org/d/K1qm1j-Wz/ripe-atlas [21:09:24] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ebernhardson@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/880568 (https://phabricator.wikimedia.org/T324561) (owner: 10HMonroy) [21:13:04] taavi: is there anything special about pre-merging a patch do skin/extension and then `scap backport ...` it? [21:13:15] just to get jenkins moving [21:15:17] ebernhardson: these days `scap backport` will do everything on its own including the merge, although to save time you still can manually +2 it in advance after the precious patch was merged [21:15:17] (03PS3) 10Ebernhardson: Enable Phonos on afwiktionary and arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/880568 (https://phabricator.wikimedia.org/T324561) (owner: 10HMonroy) [21:15:44] taavi: alright thanks, that's what i was wondering. seemed like it should work but i don't think i've done that yet :) [21:15:57] (03CR) 10TrainBranchBot: "Approved by ebernhardson@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/880568 (https://phabricator.wikimedia.org/T324561) (owner: 10HMonroy) [21:16:20] yeah, it's a really nice tool, much better than the old manual workflow [21:16:21] (03CR) 10Ebernhardson: [C: 03+2] "pre-merging for UTC late backport window" [skins/Vector] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/880913 (https://phabricator.wikimedia.org/T327064) (owner: 10Jdrewniak) [21:16:42] (03Merged) 10jenkins-bot: Enable Phonos on afwiktionary and arwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/880568 (https://phabricator.wikimedia.org/T324561) (owner: 10HMonroy) [21:17:10] !log ebernhardson@deploy1002 Started scap: Backport for [[gerrit:880568|Enable Phonos on afwiktionary and arwiki (T324561)]] [21:17:13] T324561: Roll out IPA Audio Renderer support to pilot wikis - https://phabricator.wikimedia.org/T324561 [21:17:44] (03PS6) 10Dreamy Jazz: Start writing to cul_reason[_plaintext]_id on group0 and group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879653 (https://phabricator.wikimedia.org/T233004) [21:18:51] !log ebernhardson@deploy1002 ebernhardson and hmonroy: Backport for [[gerrit:880568|Enable Phonos on afwiktionary and arwiki (T324561)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [21:18:51] 10SRE, 10Incident Tooling: wikimediastatus.net help popups are unreadable - https://phabricator.wikimedia.org/T327201 (10lmata) p:05Triage→03Medium [21:18:57] hmonroy: your patch is synced to mwdebug, please test [21:19:06] k, thanks [21:19:49] 10SRE, 10Incident Tooling: wikimediastatus.net should have link anchors - https://phabricator.wikimedia.org/T327203 (10lmata) p:05Triage→03Medium [21:22:11] ebernhardson: looks good. Thank you! [21:22:17] alright, shipping [21:25:15] (If my change is got to, I have no way to test it as I do not have access to Special:CheckUser on any group0 or group1 wiki - If zabe is not around to help like with this morning UTC, we could skip to writing everywhere and I would be able to test on a group2 wiki). [21:25:27] i can help with that Dreamy_Jazz [21:25:33] Thanks :) [21:25:36] just ping me when needed with instructions :) [21:25:55] thanks urbanecm. That patch will be up next, i support it will be a few minutes [21:26:18] ack [21:26:19] (03PS2) 10Ebernhardson: UpdateSuggesterIndex: Properly cleanup bad indices [extensions/CirrusSearch] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/880917 [21:28:09] (03Abandoned) 10Andrew Bogott: sre-sandbox: remove automatic VM purge logic [puppet] - 10https://gerrit.wikimedia.org/r/829231 (https://phabricator.wikimedia.org/T247517) (owner: 10Andrew Bogott) [21:29:31] !log ebernhardson@deploy1002 Finished scap: Backport for [[gerrit:880568|Enable Phonos on afwiktionary and arwiki (T324561)]] (duration: 12m 21s) [21:29:36] T324561: Roll out IPA Audio Renderer support to pilot wikis - https://phabricator.wikimedia.org/T324561 [21:30:44] hmonroy: ok, yours is all shipped [21:31:10] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ebernhardson@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879653 (https://phabricator.wikimedia.org/T233004) (owner: 10Dreamy Jazz) [21:31:18] ebernhardson: thanks again :) [21:31:20] Dreamy_Jazz: started shipping yours [21:31:23] hmonroy: np [21:31:24] Thanks [21:31:47] (03Merged) 10jenkins-bot: Table of contents Collapse/Expand not working [skins/Vector] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/880913 (https://phabricator.wikimedia.org/T327064) (owner: 10Jdrewniak) [21:31:58] (03Merged) 10jenkins-bot: Start writing to cul_reason[_plaintext]_id on group0 and group1 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879653 (https://phabricator.wikimedia.org/T233004) (owner: 10Dreamy Jazz) [21:33:25] hmm, i guess it's going to ship both together since they merged so quickly. probably ok [21:33:32] jan_drewniak: heads up yours is going to ship to mwdeubg as well [21:33:44] indeed ebernhardson. only one will be logged in the message though [21:33:59] !log ebernhardson@deploy1002 Started scap: Backport for [[gerrit:879653|Start writing to cul_reason[_plaintext]_id on group0 and group1 wikis (T233004)]] [21:34:03] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [21:34:09] (if you want to avoid that, you can ctrl+c instead of confirming sync, and re-start it with both changes as params) [21:34:26] ebernhardson: sounds good [21:34:50] !log scap also backporting [[gerrit:880913|Table of contents Collapse/Expand not working (T327064)]] [21:34:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:54] T327064: Table of contents Collapse/Expand not working - https://phabricator.wikimedia.org/T327064 [21:34:56] urbanecm: The test instructions are to run any check using Special:CheckUser (IP or user), make sure that the check reason is specified and includes wikilink, and then check that the row in the database wrote comment IDs to the cul_reason_id and cul_reason_plaintext_id. Then make sure that the comment table rows for these IDs are the reason you specified. cul_reason_plaintext_id should reference a comment table row with the [21:34:56] wikilink removed. [21:35:32] By wikilink removed I mean the "[[" and "]]" removed, leaving the page name. [21:35:43] !log ebernhardson@deploy1002 ebernhardson and dreamyjazz: Backport for [[gerrit:879653|Start writing to cul_reason[_plaintext]_id on group0 and group1 wikis (T233004)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [21:35:53] Dreamy_Jazz: so, I do a self-check with a reason like `testing [[:phab:T233004]]`? [21:36:01] That would be fine. [21:36:09] jan_drewniak: your Vector patch is up on mwdebug [21:36:17] looks it's on mwdebug now -- so i'll do that now [21:36:36] And the comment table should store the comments "testing [[:phab:T233004]]" and "testing :phab:T233004" respectively. [21:37:06] Dreamy_Jazz: i opened https://test.wikipedia.org/wiki/Special:CheckUserLog and...it is (nearly) empty [21:37:15] it literally only has one entry [21:37:23] DB has bunch more, so it's at least not dataloss [21:37:24] ebernhardson: that's mwdebug1001? [21:37:34] jan_drewniak: all the mwdebug instances, including 1001 [21:37:45] mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [21:37:52] Dreamy_Jazz: it also happens with or without mwdebug, so it's not related to your patch either. [21:37:56] (03PS2) 10Bartosz Dziewoński: Revert gallery changes in 1.40.0-wmf.18 & .19 [core] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/880921 (https://phabricator.wikimedia.org/T326990) [21:37:58] but...any reason why that happens? [21:38:03] No idea. [21:38:07] huh [21:38:18] that sounds like a wmf.19 blocker [21:38:20] That wiki was done this morning I think? [21:38:35] i.e. starting to write to the comment ID columns [21:38:50] (03Abandoned) 10Bartosz Dziewoński: Revert "Gallery: Improve initial state and fix thumbnail sizes" [core] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/880919 (https://phabricator.wikimedia.org/T326270) (owner: 10Bartosz Dziewoński) [21:38:56] On my local testing wiki I have more than one entry showing [21:39:11] ebernhardson: mine is fine to sync [21:39:22] confirmed, the only showing entry is the one with cul_reason_id filled [21:39:36] That is very odd... [21:40:31] was there a deployment just now? links in dropdowns in old vector have suddenly stopped working for me [21:41:20] okay, I'm now sure that this is caused by something in wmf.19. [21:41:30] Jhs: there wa a deploy of enabling phonos on afwiktionary and arwiki, there is a patch in the pipeline for vector but it's only deployed to mwdebug* right now [21:41:52] I've tried changing lots of different things [21:41:53] Dreamy_Jazz: I don't really feel comfortable with modifying the CU log when it is currently broken. I suggest rescheduling after this issue is fixed. [21:41:59] Sure. I can do that. [21:42:00] (03PS1) 10Ahmon Dancy: Gitlab runners: Use gckeepstorage buildkitd setting to manage storage [puppet] - 10https://gerrit.wikimedia.org/r/881007 (https://phabricator.wikimedia.org/T327060) [21:42:08] sounds reasonable, will back this change out [21:42:13] thanks [21:42:14] (the cu parts) [21:42:14] But I can't reproduce this on my local testing wiki [21:42:16] !log ebernhardson@deploy1002 Sync cancelled. [21:42:30] I'll fill a task about the CU log issue now [21:42:34] (03PS2) 10Ahmon Dancy: Gitlab runners: Use gckeepstorage buildkitd setting to manage storage [puppet] - 10https://gerrit.wikimedia.org/r/881007 (https://phabricator.wikimedia.org/T327060) [21:42:37] I wonder if this is to do with a change to core? [21:42:43] possible [21:42:43] Let me pull core origin master [21:42:44] ebernhardson, oh, now it suddenly started working again. false alarm, ignore me :) [21:43:03] created https://phabricator.wikimedia.org/T327219 [21:43:25] (03PS1) 10TrainBranchBot: Revert "Start writing to cul_reason[_plaintext]_id on group0 and group1 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881008 [21:43:27] (03CR) 10TrainBranchBot: "ebernhardson@deploy1002 created a revert of this change as Iff7cef26f42ab563ec0e59c298baae4ece9200c8" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879653 (https://phabricator.wikimedia.org/T233004) (owner: 10Dreamy Jazz) [21:43:53] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ebernhardson@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881008 (owner: 10TrainBranchBot) [21:44:10] Jhs: no worries :) [21:44:10] I still can't reproduce this locally, even after pulling master. [21:44:26] zabe: you beated me :) [21:44:35] (03Merged) 10jenkins-bot: Revert "Start writing to cul_reason[_plaintext]_id on group0 and group1 wikis" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881008 (owner: 10TrainBranchBot) [21:44:59] !log ebernhardson@deploy1002 Started scap: Backport for [[gerrit:881008|Revert "Start writing to cul_reason[_plaintext]_id on group0 and group1 wikis"]] [21:45:35] jan_drewniak: yours should be heading out to the cluster at same time as the revert to the checkuser patch, few minutes [21:46:42] !log ebernhardson@deploy1002 ebernhardson and trainbranchbot: Backport for [[gerrit:881008|Revert "Start writing to cul_reason[_plaintext]_id on group0 and group1 wikis"]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet [21:47:20] continuing with sync [21:49:24] (03CR) 10Ahmon Dancy: "Pcc results: https://puppet-compiler.wmflabs.org/output/881007/39157/" [puppet] - 10https://gerrit.wikimedia.org/r/881007 (https://phabricator.wikimedia.org/T327060) (owner: 10Ahmon Dancy) [21:49:50] i also think this CU log issue warrants train rollback (it affects mediawiki.org, which is a content wiki, albeit with no local CUs). [21:50:26] ...or since zabe likely just figured the cause, let's revert/fix https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CheckUser/+/879686 instead? [21:51:16] Oh. [21:51:18] I see. [21:51:58] Hang on though [21:52:03] !log zabe@mwmaint1002:~$ mwscript extensions/CheckUser/maintenance/populateCulComment.php --wiki testwiki [21:52:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:13] ^this actually fixes testwiki [21:52:14] sure. not doing anything now :) [21:52:26] but not in the intended way [21:52:51] zabe: please don't fix the other group0 wikis yet -- I think it'd be great to keep a repro case for now [21:53:09] (to ensure revert/fix of the patch does fix the issue) [21:53:10] I'm happy to write a fix that could be backported into wmf.19 [21:53:19] no worries, we should still revert since the patch basically goes to read_new by itself (which it shouldn't) [21:53:45] Yeah. I had forgot that we needed to use the checkuser version of the comment store [21:53:56] PROBLEM - Check systemd state on cumin1001 is CRITICAL: CRITICAL - degraded: The following units failed: httpbb_kubernetes_hourly.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:54:08] Apologies [21:54:20] !log ebernhardson@deploy1002 Finished scap: Backport for [[gerrit:881008|Revert "Start writing to cul_reason[_plaintext]_id on group0 and group1 wikis"]] (duration: 09m 20s) [21:54:25] no worries Dreamy_Jazz, it happens. at least we caught it before train shipped :) [21:54:27] tbh, even that might not solve the issue since that is following the cu_changes comment migration var and not your one [21:54:43] !log Finished scap: Backport for [[gerrit:880913|Table of contents Collapse/Expand not working (T327064)]] [21:54:44] Yeah. Would need a new class [21:54:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:47] T327064: Table of contents Collapse/Expand not working - https://phabricator.wikimedia.org/T327064 [21:54:53] jan_drewniak: your patch should be live everywhere now, going on to the config patch [21:55:21] (03PS2) 10Ebernhardson: Show edit button in sticky header for desktop-improvement wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/880533 (https://phabricator.wikimedia.org/T324799) (owner: 10Jdrewniak) [21:55:29] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ebernhardson@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/880533 (https://phabricator.wikimedia.org/T324799) (owner: 10Jdrewniak) [21:55:33] (completly unrelated: it seems like the securepoll voters list vor ucoc election is broken) [21:55:43] s/vor/for [21:56:02] zabe: can you define broken? :) [21:56:03] Revert created at https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CheckUser/+/880924 [21:56:53] it says for a lot of folks (including myself) that they are not elligeble [21:57:02] * eligible [21:57:11] Dreamy_Jazz: i confirm that revert "fixes" it, +2'ed. [21:57:20] (fwiw reverts are self-mergable) [21:57:39] (03PS2) 10Jdlrobson: Revert gallery changes in 1.40.0-wmf.18 [core] (wmf/1.40.0-wmf.18) - 10https://gerrit.wikimedia.org/r/880920 (https://phabricator.wikimedia.org/T326990) (owner: 10Bartosz Dziewoński) [21:57:51] Would I now cherry-pick it to wmf.19 branch? [21:58:18] (03PS1) 10Dreamy Jazz: Revert "Add read new support for cu_log comment ID columns" [extensions/CheckUser] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/880925 (https://phabricator.wikimedia.org/T327219) [21:58:21] feel free to [21:58:28] PROBLEM - Check unit status of httpbb_kubernetes_hourly on cumin1001 is CRITICAL: CRITICAL: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [21:58:30] zabe: I saw a message saying to email ucocproject@wikimedia.org to be added to the voter roll if you are eligible but unable to vote [21:58:44] I guess the cherry-picked change would need to be backported somehow? [21:58:46] zabe: interesting. it disallows me to vote now. but it worked; i voted yesterday. [21:58:54] Dreamy_Jazz: it now just needs a deployer [21:59:10] zabe: I'll raise that up with T&S. [21:59:58] I can't seem to +2 that wmf.19 branch patch, so https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CheckUser/+/880925 is ready for a deployer. [22:00:14] I can deploy that once ebernhardson is done [22:00:19] (03CR) 10Dreamy Jazz: [C: 03+1] "Needs backported to wmf.19 to unblock the train." [extensions/CheckUser] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/880925 (https://phabricator.wikimedia.org/T327219) (owner: 10Dreamy Jazz) [22:00:25] Thanks [22:00:40] I'll work on making that patch work this time :) [22:00:56] (03CR) 10Ebernhardson: [C: 03+2] "pre-merging for backport window" [extensions/CirrusSearch] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/880915 (owner: 10Ebernhardson) [22:01:06] (03CR) 10Ebernhardson: [C: 03+2] "pre-merging for backport window" [extensions/CirrusSearch] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/880917 (owner: 10Ebernhardson) [22:01:17] zabe: will probably overrun the window a little [22:01:24] since it's already :01 :) [22:01:27] no worries :) [22:02:09] not clear why the current patch isn't merging, it has the +2 gate-and-submit [22:02:38] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/880533/ is the one currently being backported [22:03:23] no idea why, but since gate-and-submit passed, I'd say it's fine to just hit submit [22:03:35] i was wondering same :) sounds good [22:04:04] !log ebernhardson@deploy1002 Started scap: Backport for [[gerrit:880533|Show edit button in sticky header for desktop-improvement wikis (T324799)]] [22:04:07] T324799: Set default for edit button in sticky header across wikis - https://phabricator.wikimedia.org/T324799 [22:04:14] jan_drewniak: you'll be up on mwdebug hosts in a couple minutes [22:05:47] !log ebernhardson@deploy1002 ebernhardson and jdrewniak: Backport for [[gerrit:880533|Show edit button in sticky header for desktop-improvement wikis (T324799)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet [22:06:19] jan_drewniak: it's up now, please test [22:07:08] ebernhardson: yeah, ok I think it works :P good to deploy [22:07:33] excellent, continuing [22:11:34] (03PS1) 10Ottomata: flink 1.16.0-wmf3 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/881011 (https://phabricator.wikimedia.org/T316519) [22:14:47] !log ebernhardson@deploy1002 Finished scap: Backport for [[gerrit:880533|Show edit button in sticky header for desktop-improvement wikis (T324799)]] (duration: 10m 43s) [22:14:51] T324799: Set default for edit button in sticky header across wikis - https://phabricator.wikimedia.org/T324799 [22:15:02] jan_drewniak: your config patch is live now [22:15:24] ebernhardson: thanks! [22:15:44] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ebernhardson@deploy1002 using scap backport" [extensions/CirrusSearch] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/880915 (owner: 10Ebernhardson) [22:15:46] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ebernhardson@deploy1002 using scap backport" [extensions/CirrusSearch] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/880917 (owner: 10Ebernhardson) [22:15:50] (03CR) 10Jeena Huneidi: "This change is ready for review." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/879908 (https://phabricator.wikimedia.org/T290260) (owner: 10Jeena Huneidi) [22:16:17] (03Merged) 10jenkins-bot: Resolve deprecations and type changes in elastica 7.3.0 [extensions/CirrusSearch] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/880915 (owner: 10Ebernhardson) [22:17:40] (03Merged) 10jenkins-bot: UpdateSuggesterIndex: Properly cleanup bad indices [extensions/CirrusSearch] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/880917 (owner: 10Ebernhardson) [22:18:09] !log ebernhardson@deploy1002 Started scap: Backport for [[gerrit:880915|Resolve deprecations and type changes in elastica 7.3.0]], [[gerrit:880917|UpdateSuggesterIndex: Properly cleanup bad indices]] [22:19:02] 10SRE-tools, 10Infrastructure-Foundations: 500 generated by Netbox while running the decom cookbook - https://phabricator.wikimedia.org/T268605 (10Volans) 05Open→03Resolved a:03Volans I forgot to resolve this. No re-occurrence happened. [22:20:00] !log ebernhardson@deploy1002 ebernhardson and ebernhardson: Backport for [[gerrit:880915|Resolve deprecations and type changes in elastica 7.3.0]], [[gerrit:880917|UpdateSuggesterIndex: Properly cleanup bad indices]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet [22:20:38] 10SRE-tools, 10DC-Ops, 10Infrastructure-Foundations, 10Patch-For-Review: Implement early-racking automation (was: Productionize system to automatically deploy some BIOS Settings) - https://phabricator.wikimedia.org/T271583 (10Volans) 05Open→03Resolved a:03Volans The BIOS automated configuration is in... [22:20:57] (03CR) 10Jdlrobson: Show edit button in sticky header for desktop-improvement wikis (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/880533 (https://phabricator.wikimedia.org/T324799) (owner: 10Jdrewniak) [22:21:15] looks reasonable, continuing deploy [22:21:44] 10SRE, 10SRE-tools, 10Infrastructure-Foundations: Puppet compiler: order resources for easy comparison between hosts - https://phabricator.wikimedia.org/T154776 (10Volans) 05Open→03Resolved a:03Volans Old task, superseded by more recent improvements to the puppet compiler. Closing as not valid anymore... [22:22:20] 10SRE, 10Cloud-VPS (Project-requests), 10Patch-For-Review, 10cloud-services-team (Kanban): Request creation of 'sre-sandbox' VPS project - https://phabricator.wikimedia.org/T247517 (10herron) >>! In T247517#8211187, @jbond wrote: > * did the emails informing @herron that the machine was due to be delete... [22:23:01] 10SRE, 10SRE-OnFire, 10ops-codfw, 10Sustainability (Incident Followup): asw-b2-codfw down - https://phabricator.wikimedia.org/T327001 (10bking) [22:23:40] 10SRE-tools, 10Infrastructure-Foundations: Cumin: batch_sleep is waited after last execution in some cases - https://phabricator.wikimedia.org/T213296 (10Volans) p:05Triage→03Medium [22:24:58] PROBLEM - BGP status on cr2-eqord is CRITICAL: BGP CRITICAL - No response from remote host 208.80.154.198 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [22:25:48] !log cp2031: restart ats-be [22:25:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:05] ebernhardson: still deploying? any chance we could add 1 more config patch to the deploy window? [22:27:16] jan_drewniak: the last patch is shipping now, i suppose it can't hurt to push a last config patch. I suppose this is re: jon's comments on the config ? [22:27:52] !log ebernhardson@deploy1002 Finished scap: Backport for [[gerrit:880915|Resolve deprecations and type changes in elastica 7.3.0]], [[gerrit:880917|UpdateSuggesterIndex: Properly cleanup bad indices]] (duration: 09m 42s) [22:28:00] RECOVERY - traffic_server backend process restarted on cp2031 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=codfw+prometheus/ops&var-instance=cp2031&var-layer=backend [22:28:23] jan_drewniak: ready now for it [22:29:48] (03PS1) 10Jdrewniak: Make sticky header edit button default for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881016 (https://phabricator.wikimedia.org/T324799) [22:30:20] ebernhardson: thanks! this is the better config [22:30:20] https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/881016 [22:30:55] !log volans@cumin1001 conftool action : set/pooled=inactive; selector: name=non-existent1001 [22:31:11] jan_drewniak: my reading of Jdlrobson's comment is he expected logged_out => false? [22:31:17] (NELHigh) firing: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [22:31:22] just verifying you have what you expect [22:32:02] (03PS2) 10Jdrewniak: Make sticky header edit button default for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881016 (https://phabricator.wikimedia.org/T324799) [22:32:35] 10SRE, 10SRE-tools, 10Infrastructure-Foundations: confctl: log to SAL even if the selection doesn't match any host - https://phabricator.wikimedia.org/T155705 (10Volans) For future reference, this is still happening, hence keeping the task open. [22:33:17] ebernhardson: yeah, I edited the patch. I shouldn't really make a difference but technically yes, it should be logged-in only. [22:33:41] jan_drewniak: should the desktop-improvements section also be removed, so that default is the only config and it's the same everywhere? [22:34:57] (03PS3) 10Jdrewniak: Make sticky header edit button default for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881016 (https://phabricator.wikimedia.org/T324799) [22:35:33] jan_drewniak: thanks! looks to line up with what was asked for in the ticket and comments on the previous patch. backporting [22:35:34] ebernhardson: this is what happens when you write patches with kids running around :P [22:35:39] :) [22:35:49] mine is thaknfully not home yet...but will be in ~20min [22:36:17] (NELHigh) resolved: Elevated Network Error Logging events (tcp.timed_out) #page - https://wikitech.wikimedia.org/wiki/Network_monitoring#NEL_alerts - https://logstash.wikimedia.org/goto/5c8f4ca1413eda33128e5c5a35da7e28 - https://alerts.wikimedia.org/?q=alertname%3DNELHigh [22:36:56] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by ebernhardson@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881016 (https://phabricator.wikimedia.org/T324799) (owner: 10Jdrewniak) [22:37:41] (03Merged) 10jenkins-bot: Make sticky header edit button default for all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881016 (https://phabricator.wikimedia.org/T324799) (owner: 10Jdrewniak) [22:38:04] !log ebernhardson@deploy1002 Started scap: Backport for [[gerrit:881016|Make sticky header edit button default for all wikis (T324799)]] [22:38:08] T324799: Set default for edit button in sticky header across all wikis - https://phabricator.wikimedia.org/T324799 [22:39:46] !log ebernhardson@deploy1002 ebernhardson and jdrewniak: Backport for [[gerrit:881016|Make sticky header edit button default for all wikis (T324799)]] synced to the testservers: mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug2002.codfw.wmnet [22:40:13] jan_drewniak: is live on mwdebug hosts, please test [22:41:42] ebernhardson: ok now I see it on all wikis (using the Vector 2022 skin) as expected. good to sync [22:41:52] kk, thanks [22:48:39] !log ebernhardson@deploy1002 Finished scap: Backport for [[gerrit:881016|Make sticky header edit button default for all wikis (T324799)]] (duration: 10m 34s) [22:48:43] T324799: Set default for edit button in sticky header across all wikis - https://phabricator.wikimedia.org/T324799 [22:49:09] zabe: all done with the backport window (a bit belated) [22:49:33] ebernhardson: thank you again for helping out! [22:49:37] thanks :) [22:49:45] jan_drewniak: np! [22:49:52] (03CR) 10Zabe: [C: 03+2] Revert "Add read new support for cu_log comment ID columns" [extensions/CheckUser] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/880925 (https://phabricator.wikimedia.org/T327219) (owner: 10Dreamy Jazz) [22:50:26] (03PS1) 10BBlack: Revert "dns: Depool all of codfw" [dns] - 10https://gerrit.wikimedia.org/r/881019 [22:51:02] RECOVERY - Check unit status of httpbb_kubernetes_hourly on cumin1001 is OK: OK: Status of the systemd unit httpbb_kubernetes_hourly https://wikitech.wikimedia.org/wiki/Monitoring/systemd_unit_state [22:51:32] RECOVERY - Check systemd state on cumin1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:51:36] (03CR) 10BBlack: [C: 03+2] Revert "dns: Depool all of codfw" [dns] - 10https://gerrit.wikimedia.org/r/881019 (owner: 10BBlack) [22:51:48] !log repooling codfw [22:51:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:53:32] Dreamy_Jazz, if you like we could try https://gerrit.wikimedia.org/r/c/879653/ again since we found out what the issue was [22:54:26] (03PS2) 10Zabe: Start writing to rev_comment_id everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/880902 (https://phabricator.wikimedia.org/T299954) [22:54:28] (03CR) 10Zabe: [C: 03+2] Start writing to rev_comment_id everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/880902 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe) [22:54:40] (03PS2) 10Zabe: Stop writing to cul_user and cul_user_text everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/880903 (https://phabricator.wikimedia.org/T233004) [22:54:46] (03CR) 10Zabe: [C: 03+2] Stop writing to cul_user and cul_user_text everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/880903 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [22:55:11] (03Merged) 10jenkins-bot: Start writing to rev_comment_id everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/880902 (https://phabricator.wikimedia.org/T299954) (owner: 10Zabe) [22:55:23] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by zabe@deploy1002 using scap backport" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/880903 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [22:55:32] (03Merged) 10jenkins-bot: Stop writing to cul_user and cul_user_text everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/880903 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [22:55:54] !log zabe@deploy1002 Started scap: Backport for [[gerrit:880903|Stop writing to cul_user and cul_user_text everywhere (T233004)]], [[gerrit:880902|Start writing to rev_comment_id everywhere (T299954)]] [22:55:59] T299954: Write code for handing write and read of rev_comment_id - https://phabricator.wikimedia.org/T299954 [22:55:59] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [22:56:31] (03CR) 10Jdlrobson: [C: 04-1] English Wikipedia uses Vector 2022 skin (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879659 (https://phabricator.wikimedia.org/T326892) (owner: 10Jdlrobson) [22:56:36] Sure with retrying that [22:57:40] !log zabe@deploy1002 zabe and zabe: Backport for [[gerrit:880903|Stop writing to cul_user and cul_user_text everywhere (T233004)]], [[gerrit:880902|Start writing to rev_comment_id everywhere (T299954)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [23:00:33] (03PS1) 10Zabe: Revert "Revert "Start writing to cul_reason[_plaintext]_id on group0 and group1 wikis"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881026 (https://phabricator.wikimedia.org/T233004) [23:00:41] (03PS2) 10Zabe: Revert "Revert "Start writing to cul_reason[_plaintext]_id on group0 and group1 wikis"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881026 (https://phabricator.wikimedia.org/T233004) [23:01:04] (03CR) 10Zabe: [C: 03+2] Revert "Revert "Start writing to cul_reason[_plaintext]_id on group0 and group1 wikis"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881026 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [23:01:48] (03Merged) 10jenkins-bot: Revert "Revert "Start writing to cul_reason[_plaintext]_id on group0 and group1 wikis"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881026 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [23:03:19] (03Merged) 10jenkins-bot: Revert "Add read new support for cu_log comment ID columns" [extensions/CheckUser] (wmf/1.40.0-wmf.19) - 10https://gerrit.wikimedia.org/r/880925 (https://phabricator.wikimedia.org/T327219) (owner: 10Dreamy Jazz) [23:05:17] (03PS4) 10Jdlrobson: [10%] English Wikipedia uses Vector 2022 skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879659 (https://phabricator.wikimedia.org/T326892) [23:05:19] (03PS1) 10Jdlrobson: [25%] English Wikipedia uses Vector 2022 skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881020 (https://phabricator.wikimedia.org/T326892) [23:05:21] (03PS1) 10Jdlrobson: [50%] English Wikipedia uses Vector 2022 skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881021 (https://phabricator.wikimedia.org/T326892) [23:05:22] (03PS1) 10Jdlrobson: [75%] English Wikipedia uses Vector 2022 skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881022 (https://phabricator.wikimedia.org/T326892) [23:05:24] (03PS1) 10Jdlrobson: [100%] English Wikipedia uses Vector 2022 skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/881023 [23:06:13] (03CR) 10Jdlrobson: [10%] English Wikipedia uses Vector 2022 skin (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879659 (https://phabricator.wikimedia.org/T326892) (owner: 10Jdlrobson) [23:06:24] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:880903|Stop writing to cul_user and cul_user_text everywhere (T233004)]], [[gerrit:880902|Start writing to rev_comment_id everywhere (T299954)]] (duration: 10m 29s) [23:06:29] T299954: Write code for handing write and read of rev_comment_id - https://phabricator.wikimedia.org/T299954 [23:06:30] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [23:06:57] (03PS5) 10Jdlrobson: [10%] English Wikipedia uses Vector 2022 skin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/879659 (https://phabricator.wikimedia.org/T326892) [23:07:31] !log zabe@deploy1002 Started scap: Backport for [[gerrit:881026|Revert "Revert "Start writing to cul_reason[_plaintext]_id on group0 and group1 wikis"" (T233004)]], [[gerrit:880925|Revert "Add read new support for cu_log comment ID columns" (T327219)]] [23:07:36] T327219: Special:CheckUserLog almost empty on testwiki - https://phabricator.wikimedia.org/T327219 [23:09:14] !log zabe@deploy1002 zabe and dreamyjazz and zabe: Backport for [[gerrit:881026|Revert "Revert "Start writing to cul_reason[_plaintext]_id on group0 and group1 wikis"" (T233004)]], [[gerrit:880925|Revert "Add read new support for cu_log comment ID columns" (T327219)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet [23:09:38] (03CR) 10Ahmon Dancy: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/879908 (https://phabricator.wikimedia.org/T290260) (owner: 10Jeena Huneidi) [23:12:32] https://phabricator.wikimedia.org/P43182 lgtm [23:12:35] syncing [23:14:54] Thanks [23:19:17] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:881026|Revert "Revert "Start writing to cul_reason[_plaintext]_id on group0 and group1 wikis"" (T233004)]], [[gerrit:880925|Revert "Add read new support for cu_log comment ID columns" (T327219)]] (duration: 11m 46s) [23:19:22] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [23:19:22] T327219: Special:CheckUserLog almost empty on testwiki - https://phabricator.wikimedia.org/T327219 [23:20:45] I've created https://gerrit.wikimedia.org/r/c/mediawiki/extensions/CheckUser/+/881015 to re-add read new support, for you to review at your pleasure (if you want of course :) ). [23:21:40] sure, will take a look later / tomorrow [23:22:13] (03PS2) 10Zabe: Start reading from cuc_actor everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/880904 (https://phabricator.wikimedia.org/T233004) [23:22:16] Thanks. I'm making some progress on read new for cu_log_event and cu_private_event, and should have another patch ready for that soon. [23:22:17] (03CR) 10Zabe: [C: 03+2] Start reading from cuc_actor everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/880904 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [23:22:24] (03PS2) 10Zabe: Start reading from cuc_comment_id on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/880905 (https://phabricator.wikimedia.org/T233004) [23:22:27] (03CR) 10Zabe: [C: 03+2] Start reading from cuc_comment_id on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/880905 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [23:23:07] (03Merged) 10jenkins-bot: Start reading from cuc_actor everywhere [mediawiki-config] - 10https://gerrit.wikimedia.org/r/880904 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [23:23:27] (03Merged) 10jenkins-bot: Start reading from cuc_comment_id on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/880905 (https://phabricator.wikimedia.org/T233004) (owner: 10Zabe) [23:24:00] !log zabe@deploy1002 Started scap: Backport for [[gerrit:880905|Start reading from cuc_comment_id on testwiki (T233004)]], [[gerrit:880904|Start reading from cuc_actor everywhere (T233004)]] [23:25:48] !log zabe@deploy1002 zabe and zabe: Backport for [[gerrit:880905|Start reading from cuc_comment_id on testwiki (T233004)]], [[gerrit:880904|Start reading from cuc_actor everywhere (T233004)]] synced to the testservers: mwdebug1001.eqiad.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug2002.codfw.wmnet [23:25:51] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [23:27:23] 10SRE, 10ops-codfw, 10DC-Ops: Decommission mc20[19-27] and mc20[29-37] - https://phabricator.wikimedia.org/T313733 (10Papaul) @Jhancock.wm thank you [23:28:01] (CirrusSearchHighOldGCFrequency) firing: Elasticsearch instance cloudelastic1006-cloudelastic-psi-eqiad is running the old gc excessively - https://wikitech.wikimedia.org/wiki/Search#Stuck_in_old_GC_hell - https://grafana.wikimedia.org/d/000000462/elasticsearch-memory - https://alerts.wikimedia.org/?q=alertname%3DCirrusSearchHighOldGCFrequency [23:33:58] !log zabe@deploy1002 Finished scap: Backport for [[gerrit:880905|Start reading from cuc_comment_id on testwiki (T233004)]], [[gerrit:880904|Start reading from cuc_actor everywhere (T233004)]] (duration: 09m 58s) [23:34:02] T233004: Update CheckUser for actor and comment table - https://phabricator.wikimedia.org/T233004 [23:38:53] (03PS3) 10Dzahn: peopleweb: ensure rsync service is stopped on passive host [puppet] - 10https://gerrit.wikimedia.org/r/879878 (https://phabricator.wikimedia.org/T326888) [23:40:08] (03CR) 10Dzahn: peopleweb: ensure rsync service is stopped on passive host (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/879878 (https://phabricator.wikimedia.org/T326888) (owner: 10Dzahn) [23:43:44] PROBLEM - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is CRITICAL: 101 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:45:18] RECOVERY - MediaWiki exceptions and fatals per minute for api_appserver on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:47:46] (JobUnavailable) firing: (3) Reduced availability for job jmx_presto in analytics@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [23:51:32] !log mwscript extensions/Translate/scripts/moveTranslatableBundle.php --wiki metawiki "User:Amire80/frg" "Movement Multilingual Termbase" "Zabe" "per request [[:phab:T327149|T327149]]" # T327149 [23:51:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:51:37] T327149: Move the translatable page "meta:User:Amire80/frg" to "meta:Movement Multilingual Termbase" - https://phabricator.wikimedia.org/T327149