[00:01:12] (03CR) 10CI reject: [V: 04-1] maintain-dbusers: add prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955) (owner: 10David Caro) [00:01:14] (03PS1) 10CDanis: add tunnelencabulator [debs/wmf-sre-laptop] - 10https://gerrit.wikimedia.org/r/902826 (https://phabricator.wikimedia.org/T266784) [00:02:46] PROBLEM - Check systemd state on doc1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc2002.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:04:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PUT certificates) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:09:44] !log removing 2 files for legal compliance [00:09:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:09:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PUT certificates) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [00:21:48] RECOVERY - Check systemd state on doc1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:47:35] PROBLEM - Check systemd state on doc1002 is CRITICAL: CRITICAL - degraded: The following units failed: rsync-doc-doc2002.codfw.wmnet.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:54:53] 10SRE, 10SRE-swift-storage, 10ops-codfw, 10DC-Ops: Two failed disks in ms-be2067 - https://phabricator.wikimedia.org/T332983 (10Papaul) The same server had 2 disks failed on 03-02 task: https://phabricator.wikimedia.org/T331030 and today we have 2 disk failed again. That is a total of 4 disks on the same s... [00:56:37] !log doc1002 - manually running rsync to doc2002 - which failed with status 23 when started by timer [00:56:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:57:26] !log doc1002 - issue is mismatched UIDs again, most likely. doc-uploader is debmonitor on new host [00:57:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:58:38] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime for 3 days, 0:00:00 on doc1002.eqiad.wmnet with reason: WIP-known-to-be-debugged-new-host [00:59:02] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 3 days, 0:00:00 on doc1002.eqiad.wmnet with reason: WIP-known-to-be-debugged-new-host [01:20:45] (JobUnavailable) firing: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:25:45] (JobUnavailable) resolved: (2) Reduced availability for job atlas_exporter in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:38:33] 10SRE, 10Wikimedia-Mailing-lists: Turn down summary digest frequency on wikimedia-l - https://phabricator.wikimedia.org/T332927 (10Legoktm) [01:40:27] 10SRE, 10Wikimedia-Mailing-lists: Turn down summary digest frequency on wikimedia-l - https://phabricator.wikimedia.org/T332927 (10Legoktm) >>! In T332927#8724124, @Novem_Linguae wrote: > @Aklapper . Thanks. Any idea who the administrator of wikimedia-l is? I'm not seeing it in the obvious places (https://list... [01:44:59] RECOVERY - Check systemd state on doc1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [01:52:34] (03PS1) 10Legoktm: planet: Add Nemo_bis's new blog [puppet] - 10https://gerrit.wikimedia.org/r/902828 [01:59:53] (03PS1) 10Legoktm: planet: Add Wikimedia category of Jan Ainali's blog [puppet] - 10https://gerrit.wikimedia.org/r/902829 [02:01:06] 10SRE, 10Wikimedia-Mailing-lists: Turn down summary digest frequency on wikimedia-l - https://phabricator.wikimedia.org/T332927 (10Novem_Linguae) Thank you both :) [02:06:45] (JobUnavailable) firing: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:13:05] (03CR) 10Aaron Schulz: Add per-action component-level profiling in statsd using excimer (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893839 (https://phabricator.wikimedia.org/T225968) (owner: 10Aaron Schulz) [02:26:45] (JobUnavailable) resolved: Reduced availability for job sidekiq in ops@eqiad - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:52:42] * Krinkle staging on mwdebug1002 [03:00:36] (03CR) 10Krinkle: Add per-action component-level profiling in statsd using excimer (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893839 (https://phabricator.wikimedia.org/T225968) (owner: 10Aaron Schulz) [03:02:30] legoktm: do you know where in planet I can change (if possible) how it renders the 'date' for each post? [03:02:38] > 20:43, Friday, 24 2023 March UTC [03:02:52] * legoktm looks [03:03:19] This is one of those typical things that stem from some 1970s protocol that can never change but makes no sense to any humans, not even the US data format beats that. [03:03:48] probably like date('c') or date('r') one of those built-ins. [03:04:05] https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/d1b4c9503b96193f85c23a202d15c20ae7aed649/modules/planet/templates/html/index.html.tmpl.erb#75 [03:04:12] this looks related, but not sure how to trace it behind that [03:05:27] I think that's the old file [03:05:30] pretty sure it's https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/planet/templates/html/rawdog/rd_item.html.tmpl.erb#7 [03:05:34] * legoktm looks in the rawdog source code [03:06:28] hmm, it's no longer in Debian [03:09:22] https://sources.debian.org/src/rawdog/2.23-2/rawdoglib/rawdog.py/#L81 [03:09:52] https://sources.debian.org/src/rawdog/2.23-2/rawdoglib/rawdog.py/#L941 [03:10:17] Krinkle: https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/refs/heads/production/modules/planet/templates/global.erb#17 I think? [03:10:43] Neat! [03:11:59] looks like https://gerrit.wikimedia.org/r/c/operations/puppet/+/436790 tried, but missed that datetimeformat overrides dayformat/timeformat [03:18:01] well, it changed both, but in different ways [03:18:02] not sure why [03:21:18] (03PS9) 10Aaron Schulz: Add per-action component-level profiling in statsd using excimer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893839 (https://phabricator.wikimedia.org/T225968) [03:30:52] 10SRE, 10Wikimedia-Planet, 10Patch-For-Review: Find a replacement for RSS aggregator for planet.wikimedia.org - https://phabricator.wikimedia.org/T281219 (10Legoktm) I very much appreciate and depend upon the planet service so happy to spend some time working on it if it would be useful... >>! In T281219#70... [03:32:21] (03CR) 10Krinkle: Add per-action component-level profiling in statsd using excimer (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893839 (https://phabricator.wikimedia.org/T225968) (owner: 10Aaron Schulz) [03:47:53] (03PS1) 10Krinkle: planet: Use a more human date format than "Friday, 03 2023 March" [puppet] - 10https://gerrit.wikimedia.org/r/902832 [03:48:15] * Krinkle done testing on mwdebug1002 [04:11:17] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:16:18] (MediaWikiHighErrorRate) resolved: Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:19:18] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:24:18] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [04:37:43] (03PS10) 10Aaron Schulz: Add per-action component-level profiling in statsd using excimer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893839 (https://phabricator.wikimedia.org/T225968) [04:38:32] (03CR) 10Aaron Schulz: Add per-action component-level profiling in statsd using excimer (033 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893839 (https://phabricator.wikimedia.org/T225968) (owner: 10Aaron Schulz) [04:39:58] (03CR) 10Krinkle: [C: 03+1] Add per-action component-level profiling in statsd using excimer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/893839 (https://phabricator.wikimedia.org/T225968) (owner: 10Aaron Schulz) [04:50:06] (03PS6) 10David Caro: maintain-dbusers: add prometheus metrics [puppet] - 10https://gerrit.wikimedia.org/r/902819 (https://phabricator.wikimedia.org/T332955) [05:27:44] 10SRE, 10Traffic-Icebox: Offer AuthDNS service over IPv6 - https://phabricator.wikimedia.org/T81605 (10dkg) What's the status on this fix? Currently, an IPv6-only client using an IPv6-only DNS resolver will fail to reach wikimedia services. If their DNS resolver is capable of using a NAT64 translator, that mi... [06:01:57] (03PS2) 10Legoktm: planet: Use a more human date format than "Friday, 03 2023 March" [puppet] - 10https://gerrit.wikimedia.org/r/902832 (owner: 10Krinkle) [06:03:21] (03CR) 10Legoktm: planet: Use a more human date format than "Friday, 03 2023 March" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/902832 (owner: 10Krinkle) [06:04:08] (03PS3) 10Krinkle: planet: Use a more human date format than "Friday, 03 2023 March" [puppet] - 10https://gerrit.wikimedia.org/r/902832 [06:05:06] (03CR) 10Legoktm: [C: 03+1] planet: Use a more human date format than "Friday, 03 2023 March" [puppet] - 10https://gerrit.wikimedia.org/r/902832 (owner: 10Krinkle) [07:35:11] (03CR) 10Nemo bis: "Oh thank you. Not sure how much I'll post but there will be posts which don't fit the Wikimedia Planet. I've now added a "wiki" tag so a m" [puppet] - 10https://gerrit.wikimedia.org/r/902828 (owner: 10Legoktm) [07:36:49] (03PS1) 10Majavah: dumps: properly absent enterprise timers [puppet] - 10https://gerrit.wikimedia.org/r/902833 [07:38:20] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40329/console" [puppet] - 10https://gerrit.wikimedia.org/r/902833 (owner: 10Majavah) [07:39:38] (03CR) 10Majavah: [V: 03+1] "PCC SUCCESS (): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/40330/console" [puppet] - 10https://gerrit.wikimedia.org/r/902833 (owner: 10Majavah) [07:54:27] !log hashar@deploy2002 Started deploy [integration/docroot@ab848e3]: build: Updating eslint-config-wikimedia to 0.24.0 [07:54:35] !log hashar@deploy2002 Finished deploy [integration/docroot@ab848e3]: build: Updating eslint-config-wikimedia to 0.24.0 (duration: 00m 08s) [07:59:23] 10SRE, 10MediaWiki-extensions-OAuth, 10Performance-Team, 10Datacenter-Switchover: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (10Aschmidt) I've just encountered the said problem. My first attempt to log... [10:28:58] (KubernetesAPILatency) firing: High Kubernetes API latency (PATCH nodes) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:33:58] (KubernetesAPILatency) resolved: High Kubernetes API latency (PATCH nodes) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:40:28] (KubernetesAPILatency) firing: (2) High Kubernetes API latency (LIST events) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [10:45:28] (KubernetesAPILatency) resolved: (2) High Kubernetes API latency (LIST events) on k8s-staging@eqiad - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=eqiad&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [13:46:51] 10SRE, 10SRE-Unowned, 10Wikimedia-IRC-RC-Server: Migrate mw_rc_irc servers to Bullseye - https://phabricator.wikimedia.org/T331702 (10Peachey88) [14:06:31] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:06:37] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:08:19] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 49708 bytes in 0.055 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:08:23] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.234 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [14:25:29] RECOVERY - Router interfaces on cr3-esams is OK: OK: host 91.198.174.245, interfaces up: 84, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [14:27:03] RECOVERY - Router interfaces on cr2-drmrs is OK: OK: host 185.15.58.129, interfaces up: 62, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [16:43:18] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:48:18] (MediaWikiHighErrorRate) firing: (2) Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [16:53:18] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [18:59:03] (ProbeDown) firing: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [19:00:07] 10SRE, 10MediaWiki-extensions-OAuth, 10Performance-Team, 10Datacenter-Switchover: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (10Legoktm) [19:03:33] 10SRE, 10MediaWiki-extensions-OAuth, 10Performance-Team, 10Datacenter-Switchover: Frequent OAuth failures on Wikimedia wikis since eqiad was repooled due to db-mainstash replication lag - https://phabricator.wikimedia.org/T332650 (10Tgr) The patch did not work: https://logstash.wikimedia.org/goto/c8c557ced... [19:04:03] (ProbeDown) resolved: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:47:07] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [21:50:51] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:02:31] PROBLEM - Check systemd state on an-worker1132 is CRITICAL: CRITICAL - degraded: The following units failed: export_smart_data_dump.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:41:09] RECOVERY - Check systemd state on an-worker1132 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state