[00:13:27] PROBLEM - MariaDB Replica IO: s6 on db2141 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2026, Errmsg: error reconnecting to master repl@db2129.codfw.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: SSL connection error00000000:lib(0):func(0):reason(0) https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:15:37] RECOVERY - MariaDB Replica IO: s6 on db2141 is OK: OK slave_io_state Slave_IO_Running: Yes https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [00:45:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [00:50:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [01:09:27] PROBLEM - WDQS high update lag on wdqs1013 is CRITICAL: 6.927e+07 ge 4.32e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [01:10:01] RECOVERY - MariaDB Replica Lag: s1 on db2141 is OK: OK slave_sql_lag Replication lag: 0.29 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [01:28:24] (03CR) 10Juan90264: Enable talk for mobile users on enwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732705 (https://phabricator.wikimedia.org/T293946) (owner: 10Juan90264) [01:45:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [01:50:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [02:08:49] PROBLEM - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is CRITICAL: 214 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:10:51] RECOVERY - MediaWiki exceptions and fatals per minute for jobrunner on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=18&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [02:15:11] RECOVERY - WDQS high update lag on wdqs1013 is OK: (C)4.32e+07 ge (W)2.16e+07 ge 2.082e+07 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook%23Update_lag https://grafana.wikimedia.org/dashboard/db/wikidata-query-service?orgId=1&panelId=8&fullscreen [03:45:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [03:50:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [05:16:57] PROBLEM - Maps tiles generation on alert1001 is CRITICAL: CRITICAL: 97.78% of data under the critical threshold [5.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [05:45:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [05:50:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [06:56:45] (03CR) 10Elukey: [C: 03+2] Reduce verbosity of the log commit message [cookbooks] - 10https://gerrit.wikimedia.org/r/737706 (owner: 10Elukey) [06:56:49] (03PS3) 10Elukey: Reduce verbosity of the log commit message [cookbooks] - 10https://gerrit.wikimedia.org/r/737706 [07:10:20] (03PS1) 10Elukey: Import new ROCm version 4.5.1 [puppet] - 10https://gerrit.wikimedia.org/r/738615 (https://phabricator.wikimedia.org/T295661) [07:10:26] (03PS1) 10Urbanecm: MenteeOverviewDataUpdater: Use UserOptionsManager::saveOptions [extensions/GrowthExperiments] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/738626 (https://phabricator.wikimedia.org/T295339) [07:16:36] (03PS21) 10Giuseppe Lavagetto: mediawiki::php: support multiple php version in monitoring too [puppet] - 10https://gerrit.wikimedia.org/r/736949 (https://phabricator.wikimedia.org/T293450) [07:16:38] (03PS24) 10Giuseppe Lavagetto: mediawiki: add support for multiple versions in the web configuration [puppet] - 10https://gerrit.wikimedia.org/r/737330 (https://phabricator.wikimedia.org/T293450) [07:16:40] (03PS16) 10Giuseppe Lavagetto: mediawiki::php: report prometheus metrics for all php versions [puppet] - 10https://gerrit.wikimedia.org/r/737929 [07:16:42] (03PS9) 10Giuseppe Lavagetto: deployment-prep: install php 7.4 on a mw appserver [puppet] - 10https://gerrit.wikimedia.org/r/738194 [07:18:17] (03CR) 10jerkins-bot: [V: 04-1] mediawiki: add support for multiple versions in the web configuration [puppet] - 10https://gerrit.wikimedia.org/r/737330 (https://phabricator.wikimedia.org/T293450) (owner: 10Giuseppe Lavagetto) [07:19:28] <_joe_> wtf? [07:19:54] <_joe_> 08:18:05 KeyError: key not found: "PARALLEL_PID_FILE" [07:20:02] <_joe_> not sure I want to look into this [07:20:16] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::php: support multiple php version in monitoring too [puppet] - 10https://gerrit.wikimedia.org/r/736949 (https://phabricator.wikimedia.org/T293450) (owner: 10Giuseppe Lavagetto) [07:29:50] (03PS22) 10Giuseppe Lavagetto: mediawiki::php: support multiple php version in monitoring too [puppet] - 10https://gerrit.wikimedia.org/r/736949 (https://phabricator.wikimedia.org/T293450) [07:30:28] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::php: support multiple php version in monitoring too (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/736949 (https://phabricator.wikimedia.org/T293450) (owner: 10Giuseppe Lavagetto) [07:33:52] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] "PCC SUCCESS (DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32397/console" [puppet] - 10https://gerrit.wikimedia.org/r/736949 (https://phabricator.wikimedia.org/T293450) (owner: 10Giuseppe Lavagetto) [07:40:19] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] mediawiki::php: support multiple php version in monitoring too (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/736949 (https://phabricator.wikimedia.org/T293450) (owner: 10Giuseppe Lavagetto) [07:41:20] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6002.drmrs.wmnet with OS buster [07:41:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:41:30] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6002.drmrs.wmnet with OS buster [07:44:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [07:46:21] (03CR) 10Elukey: [C: 04-1] "There is a 4.5.1 release in the apt repositories of AMD, but it doesn't seem released yet. Moving to https://repo.radeon.com/rocm/apt/4.5/" [puppet] - 10https://gerrit.wikimedia.org/r/738615 (https://phabricator.wikimedia.org/T295661) (owner: 10Elukey) [07:49:42] (03PS2) 10Elukey: Import new ROCm version 4.5.1 [puppet] - 10https://gerrit.wikimedia.org/r/738615 (https://phabricator.wikimedia.org/T295661) [07:49:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [07:50:10] (03PS3) 10Elukey: Import new ROCm version 4.5 [puppet] - 10https://gerrit.wikimedia.org/r/738615 (https://phabricator.wikimedia.org/T295661) [07:52:23] (03PS1) 10Muehlenhoff: Remove access for wkandek [puppet] - 10https://gerrit.wikimedia.org/r/738834 [08:04:28] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for wkandek [puppet] - 10https://gerrit.wikimedia.org/r/738834 (owner: 10Muehlenhoff) [08:07:10] !log mmandere@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp6002.drmrs.wmnet with OS buster [08:07:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:19] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6002.drmrs.wmnet with OS buster executed with errors: - cp6002... [08:07:36] (03Abandoned) 10Muehlenhoff: admin: restricted add Wolfgang as the group approver [puppet] - 10https://gerrit.wikimedia.org/r/668397 (owner: 10Jbond) [08:09:07] (03PS1) 10Muehlenhoff: Update approver for gitlab-roots/vrts-roots [puppet] - 10https://gerrit.wikimedia.org/r/738836 [08:09:58] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Remove access for wkandek [puppet] - 10https://gerrit.wikimedia.org/r/738834 (owner: 10Muehlenhoff) [08:10:58] (03PS1) 10Muehlenhoff: Update approver for os-installers [puppet] - 10https://gerrit.wikimedia.org/r/738837 [08:11:29] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/737330 (https://phabricator.wikimedia.org/T293450) (owner: 10Giuseppe Lavagetto) [08:21:15] PROBLEM - Host ms-be1035 is DOWN: PING CRITICAL - Packet loss = 100% [08:21:51] <_joe_> uhm [08:22:32] RECOVERY - Host ms-be1035 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [08:22:49] (03CR) 10Legoktm: python39: Use shell reimplementation of webservice-runner (036 comments) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/738503 (https://phabricator.wikimedia.org/T293552) (owner: 10Legoktm) [08:22:54] (03PS2) 10Legoktm: python39: Use shell reimplementation of webservice-runner [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/738503 (https://phabricator.wikimedia.org/T293552) [08:23:33] (03CR) 10Legoktm: "Thanks for the review :) please be as nitpicky as you feel like, if this works out most of it'll be copied into scripts for the other imag" [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/738503 (https://phabricator.wikimedia.org/T293552) (owner: 10Legoktm) [08:24:32] PROBLEM - k8s API server requests latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [08:24:41] mmmm [08:25:49] <_joe_> so the host is now ofc actually reachable [08:26:44] RECOVERY - k8s API server requests latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [08:45:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [08:46:42] (03PS25) 10Giuseppe Lavagetto: mediawiki: add support for multiple versions in the web configuration [puppet] - 10https://gerrit.wikimedia.org/r/737330 (https://phabricator.wikimedia.org/T293450) [08:48:04] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "🎉 🍷" [puppet] - 10https://gerrit.wikimedia.org/r/738836 (owner: 10Muehlenhoff) [08:48:48] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 2 NOOP 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32398/console" [puppet] - 10https://gerrit.wikimedia.org/r/737330 (https://phabricator.wikimedia.org/T293450) (owner: 10Giuseppe Lavagetto) [08:49:39] !log installing glibc bugfix updates from bullseye point release [08:49:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:16] (03PS1) 10Elukey: kserve-inference: add support for revision traffic metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/738842 (https://phabricator.wikimedia.org/T289841) [08:50:49] (03CR) 10jerkins-bot: [V: 04-1] kserve-inference: add support for revision traffic metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/738842 (https://phabricator.wikimedia.org/T289841) (owner: 10Elukey) [08:50:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [08:57:17] I would like to add my name to the channel topic but I'm not a channel operator. Could somebody promote me? [08:57:27] (03PS2) 10David Caro: controller: consider failure if any host fails [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738420 (https://phabricator.wikimedia.org/T295030) [09:04:07] (03PS2) 10Elukey: kserve-inference: add support for revision traffic metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/738842 (https://phabricator.wikimedia.org/T289841) [09:04:08] urbanecm: ^ [09:05:03] (03PS1) 10Muehlenhoff: Remove Icinga permissions for wkandek [puppet] - 10https://gerrit.wikimedia.org/r/738843 [09:07:45] (03CR) 10Muehlenhoff: [C: 03+2] Remove Icinga permissions for wkandek [puppet] - 10https://gerrit.wikimedia.org/r/738843 (owner: 10Muehlenhoff) [09:14:36] (03CR) 10JMeybohm: Update approver for gitlab-roots/vrts-roots (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/738836 (owner: 10Muehlenhoff) [09:14:50] (03CR) 10Jelto: "🎊" [puppet] - 10https://gerrit.wikimedia.org/r/738836 (owner: 10Muehlenhoff) [09:17:40] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/738432 (https://phabricator.wikimedia.org/T216088) (owner: 10Muehlenhoff) [09:22:29] (03CR) 10Elukey: [C: 03+2] kserve-inference: add support for revision traffic metrics [deployment-charts] - 10https://gerrit.wikimedia.org/r/738842 (https://phabricator.wikimedia.org/T289841) (owner: 10Elukey) [09:22:48] (03PS26) 10Giuseppe Lavagetto: mediawiki: add support for multiple versions in the web configuration [puppet] - 10https://gerrit.wikimedia.org/r/737330 (https://phabricator.wikimedia.org/T293450) [09:25:42] (03CR) 10Giuseppe Lavagetto: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32399/console" [puppet] - 10https://gerrit.wikimedia.org/r/737330 (https://phabricator.wikimedia.org/T293450) (owner: 10Giuseppe Lavagetto) [09:27:27] (03PS1) 10Elukey: kserve-inference: add quotation to prometheus annotation [deployment-charts] - 10https://gerrit.wikimedia.org/r/738852 (https://phabricator.wikimedia.org/T289841) [09:32:01] (03CR) 10Elukey: [C: 03+2] kserve-inference: add quotation to prometheus annotation [deployment-charts] - 10https://gerrit.wikimedia.org/r/738852 (https://phabricator.wikimedia.org/T289841) (owner: 10Elukey) [09:34:23] <_joe_> jelto: done [09:34:53] (03PS1) 10Ladsgroup: media: Port DjVuImage::retrieveMetaData() to use BoxedCommand [core] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/738636 (https://phabricator.wikimedia.org/T289228) [09:35:09] (03PS1) 10Ladsgroup: Increase memory limit for DjVu metadata [core] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/738637 (https://phabricator.wikimedia.org/T275268) [09:35:22] joe: thanks! [09:35:23] (03CR) 10Ladsgroup: [C: 03+2] media: Port DjVuImage::retrieveMetaData() to use BoxedCommand [core] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/738636 (https://phabricator.wikimedia.org/T289228) (owner: 10Ladsgroup) [09:35:27] (03CR) 10Giuseppe Lavagetto: [V: 03+1 C: 03+2] mediawiki: add support for multiple versions in the web configuration [puppet] - 10https://gerrit.wikimedia.org/r/737330 (https://phabricator.wikimedia.org/T293450) (owner: 10Giuseppe Lavagetto) [09:35:29] (03CR) 10Ladsgroup: [C: 03+2] Increase memory limit for DjVu metadata [core] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/738637 (https://phabricator.wikimedia.org/T275268) (owner: 10Ladsgroup) [09:35:39] 10ops-codfw, 10ops-ulsfo: Update PDUs name-server config - https://phabricator.wikimedia.org/T295668 (10ayounsi) p:05Triage→03Low [09:36:40] (03Merged) 10jenkins-bot: kserve-inference: add quotation to prometheus annotation [deployment-charts] - 10https://gerrit.wikimedia.org/r/738852 (https://phabricator.wikimedia.org/T289841) (owner: 10Elukey) [09:37:12] 10ops-codfw, 10ops-ulsfo: Update PDUs name-server config - https://phabricator.wikimedia.org/T295668 (10ayounsi) [09:39:01] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-editquality' for release 'main' . [09:39:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:13] 10SRE, 10Traffic-Icebox, 10Patch-For-Review: Decom LVS recdns - https://phabricator.wikimedia.org/T239993 (10ayounsi) As there are a lot of PDUs, auditing them manually is quite time-consuming, so running a 24h capture and only updating the ones that show up seems more efficient. I opened T295668 for DCops t... [09:40:04] jelto: you can also prepare a patchset for the bot to add you onto the channel access list - https://meta.wikimedia.org/wiki/IRC/Bots/ircservserv [09:41:45] https://github.com/wikimedia/wikimedia-irc-ircservserv-config/blob/master/channels/wikimedia-operations.toml [09:45:19] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-draftquality' for release 'main' . [09:45:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:38] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revscoring-articlequality' for release 'main' . [09:45:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:45:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [09:45:55] (03CR) 10David Caro: [C: 03+1] "LGTM, any/all nits can be ignored." [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738261 (owner: 10Jbond) [09:48:12] (03CR) 10Effie Mouzeli: [C: 03+1] "Approved💃" [puppet] - 10https://gerrit.wikimedia.org/r/738836 (owner: 10Muehlenhoff) [09:49:07] (03CR) 10Jcrespo: [C: 03+2] P:mediabackup::storage: update mino daemon to use the chained certificate [puppet] - 10https://gerrit.wikimedia.org/r/738439 (https://phabricator.wikimedia.org/T295594) (owner: 10Jbond) [09:49:41] (03CR) 10JMeybohm: "I think the diff is pretty misleading here as it always uses the latest released chart version (from chartmuseum) to create diffs. As that" [deployment-charts] - 10https://gerrit.wikimedia.org/r/738578 (https://phabricator.wikimedia.org/T292390) (owner: 10Jelto) [09:49:45] (03PS3) 10JMeybohm: charts:eventgate bump common_templates and standardize labels [deployment-charts] - 10https://gerrit.wikimedia.org/r/738578 (https://phabricator.wikimedia.org/T292390) (owner: 10Jelto) [09:49:47] (03PS1) 10JMeybohm: Run helmfile commands against the local version of the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/738857 [09:50:42] (03CR) 10JMeybohm: "@joe: You mind taking a look?" [deployment-charts] - 10https://gerrit.wikimedia.org/r/738857 (owner: 10JMeybohm) [09:50:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [09:52:14] (03CR) 10jerkins-bot: [V: 04-1] charts:eventgate bump common_templates and standardize labels [deployment-charts] - 10https://gerrit.wikimedia.org/r/738578 (https://phabricator.wikimedia.org/T292390) (owner: 10Jelto) [09:52:18] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Run helmfile commands against the local version of the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/738857 (owner: 10JMeybohm) [09:52:26] (03CR) 10jerkins-bot: [V: 04-1] Run helmfile commands against the local version of the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/738857 (owner: 10JMeybohm) [09:53:42] 10SRE-tools, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10media-backups, and 2 others: minio monitoring broken due to TLS certificate marked as insecure - https://phabricator.wikimedia.org/T295594 (10jcrespo) [09:54:06] 10SRE-tools, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10media-backups, and 2 others: minio monitoring broken due to TLS certificate marked as insecure - https://phabricator.wikimedia.org/T295594 (10jcrespo) [09:54:42] (03Merged) 10jenkins-bot: media: Port DjVuImage::retrieveMetaData() to use BoxedCommand [core] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/738636 (https://phabricator.wikimedia.org/T289228) (owner: 10Ladsgroup) [09:54:48] (03Merged) 10jenkins-bot: Increase memory limit for DjVu metadata [core] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/738637 (https://phabricator.wikimedia.org/T275268) (owner: 10Ladsgroup) [09:58:55] (03PS1) 10Ladsgroup: media: Build and use JSON for metadata of djvu instead of XML [core] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/738638 (https://phabricator.wikimedia.org/T275268) [09:59:07] (03CR) 10Ladsgroup: [C: 03+2] media: Build and use JSON for metadata of djvu instead of XML [core] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/738638 (https://phabricator.wikimedia.org/T275268) (owner: 10Ladsgroup) [09:59:14] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.7/includes/media/: Backport: [[gerrit:738636|media: Port DjVuImage::retrieveMetaData() to use BoxedCommand (T289228)]] (duration: 00m 56s) [09:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:19] T289228: Convert media handling code (PdfHandler, PagedTiffHandler) to use Shellbox - https://phabricator.wikimedia.org/T289228 [10:00:20] !log update Java on Hadoop and Presto nodes [10:00:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:23] 10SRE, 10Infrastructure-Foundations, 10netops: cr1-eqiad -> Charter/AS7843 connectivity is broken - https://phabricator.wikimedia.org/T295650 (10ayounsi) Thanks for taking care of it. Proper fix is most likely T295672. [10:09:30] jouncebot: nowandnext [10:09:30] No deployments scheduled for the next 1 hour(s) and 50 minute(s) [10:09:30] In 1 hour(s) and 50 minute(s): UTC morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211115T1200) [10:09:46] Amir1: mind if i +2 my own backport? (can wait if need to) [10:10:09] urbanecm: sure [10:10:27] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:10:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:10:40] (03CR) 10Urbanecm: [C: 03+2] MenteeOverviewDataUpdater: Use UserOptionsManager::saveOptions [extensions/GrowthExperiments] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/738626 (https://phabricator.wikimedia.org/T295339) (owner: 10Urbanecm) [10:10:44] thanks [10:15:13] 10SRE, 10DBA, 10cloud-services-team (Kanban): db1112 (s3 contribs/rc replica) is down - https://phabricator.wikimedia.org/T294295 (10Kormat) >>! In T294295#7500678, @Cmjohnson wrote: > @Marostegui 15 Nov 1000 Local 1500GMT ? Marostegui is out today, but i can handle that, so yep, let's go for it. [10:15:32] (03PS1) 10Btullis: Configure stat servers to use /srv/spark-tmp as spark.local.dir [puppet] - 10https://gerrit.wikimedia.org/r/738866 (https://phabricator.wikimedia.org/T295346) [10:15:40] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:15:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:50] PROBLEM - BGP status on cr4-ulsfo is CRITICAL: BGP CRITICAL - AS2914/IPv4: Active - NTT, AS2914/IPv6: Active - NTT https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:15:55] 10SRE-tools, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10media-backups, 10observability: minio monitoring broken due to TLS certificate marked as insecure - https://phabricator.wikimedia.org/T295594 (10jcrespo) 05Open→03Resolved a:03jbond The patch + running puppet fixed the issue. I... [10:18:10] (03Merged) 10jenkins-bot: media: Build and use JSON for metadata of djvu instead of XML [core] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/738638 (https://phabricator.wikimedia.org/T275268) (owner: 10Ladsgroup) [10:19:35] (03CR) 10Ideophagous: "Hello Urbanecm," [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735713 (owner: 10Ideophagous) [10:20:52] (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735713 (owner: 10Ideophagous) [10:22:16] (03CR) 10jerkins-bot: [V: 04-1] Bug:T291737 Squashed two commits into one, previous commit comments follow: Bug:T291737 Change-Id: Ib263a5419c6ace911a597d025b28d6ef13549c10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735713 (owner: 10Ideophagous) [10:23:38] (03CR) 10Urbanecm: "the commit message is very hard to read. Can you change it?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735713 (owner: 10Ideophagous) [10:23:44] (03CR) 10Urbanecm: [C: 04-1] Bug:T291737 Squashed two commits into one, previous commit comments follow: Bug:T291737 Change-Id: Ib263a5419c6ace911a597d025b28d6ef13549c10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/735713 (owner: 10Ideophagous) [10:23:56] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.7/includes/media/: Backport: [[gerrit:738638|media: Build and use JSON for metadata of djvu instead of XML (T275268 T192866)]] (duration: 00m 56s) [10:24:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:24:02] T275268: Address "image" table capacity problems by storing pdf/djvu text outside file metadata - https://phabricator.wikimedia.org/T275268 [10:24:02] T192866: Some DjVu files have too much metadata to fit in their database column - https://phabricator.wikimedia.org/T192866 [10:24:44] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [10:26:52] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [10:27:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:27:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:24] (03CR) 10Muehlenhoff: [C: 03+2] Add ownership annotations for more Search Platform services [puppet] - 10https://gerrit.wikimedia.org/r/738432 (https://phabricator.wikimedia.org/T216088) (owner: 10Muehlenhoff) [10:30:52] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:30:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:13] (03Merged) 10jenkins-bot: MenteeOverviewDataUpdater: Use UserOptionsManager::saveOptions [extensions/GrowthExperiments] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/738626 (https://phabricator.wikimedia.org/T295339) (owner: 10Urbanecm) [10:32:22] Amir1: you done now? :) [10:32:36] urbanecm: mostly but it'll take some time, go ahead [10:32:45] okay, deploying... [10:34:10] !log Rebuilding rpki1001.eqiad.wmnet. with larger disk - going to decom and then re-create via cookbooks. [10:34:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:23] !log cmooney@cumin1001 START - Cookbook sre.hosts.decommission for hosts rpki1001.eqiad.wmnet [10:34:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:34:30] syncing in one batch, runs as a background process at mwmaint, no user interaction [10:35:28] !log urbanecm@deploy1002 Synchronized php-1.38.0-wmf.7/extensions/GrowthExperiments/: 05d6550218f21f89171fcb8c73230e0855cf41a4: MenteeOverviewDataUpdater: Use UserOptionsManager::saveOptions (T295339) (duration: 00m 56s) [10:35:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:35:31] T295339: GrowthExperiments\Maintenance\UpdateMenteeData: Cannot execute query from GrowthExperiments\Maintenance\UpdateMenteeData while transaction status is ERROR - https://phabricator.wikimedia.org/T295339 [10:37:13] Amir1: done :) [10:37:17] (03CR) 10Ladsgroup: [C: 03+2] "This change is ready for review." [core] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/738639 (owner: 10Ladsgroup) [10:37:19] (03PS1) 10Ideophagous: Bug:T291737 Squashed two commits into one, previous commit comments follow: Bug:T291737 Change-Id: Ib263a5419c6ace911a597d025b28d6ef13549c10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738870 (https://phabricator.wikimedia.org/T291737) [10:37:35] \o/ [10:37:48] (03PS2) 10Ideophagous: Bug:T291737 Squashed two commits into one, previous commit comments follow: Bug:T291737 Change-Id: Ib263a5419c6ace911a597d025b28d6ef13549c10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738870 (https://phabricator.wikimedia.org/T291737) [10:38:32] (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738870 (https://phabricator.wikimedia.org/T291737) (owner: 10Ideophagous) [10:38:55] (03CR) 10Ideophagous: "I fixed the tabulation issue with the previous commit 735713." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738870 (https://phabricator.wikimedia.org/T291737) (owner: 10Ideophagous) [10:39:38] (03CR) 10jerkins-bot: [V: 04-1] Bug:T291737 Squashed two commits into one, previous commit comments follow: Bug:T291737 Change-Id: Ib263a5419c6ace911a597d025b28d6ef13549c10 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738870 (https://phabricator.wikimedia.org/T291737) (owner: 10Ideophagous) [10:39:41] (03CR) 10Hnowlan: [V: 03+1 C: 03+2] maps: Add cronjob to send tile invalidation events [puppet] - 10https://gerrit.wikimedia.org/r/737898 (https://phabricator.wikimedia.org/T270175) (owner: 10Jgiannelos) [10:40:49] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:40:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:40:51] (03CR) 10Urbanecm: [C: 04-1] "better, but commit message still doesn't follow the guidelines " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738870 (https://phabricator.wikimedia.org/T291737) (owner: 10Ideophagous) [10:41:12] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "LGTM overall, I would change the --hooksdir argument to something a bit less hacky. Other than that, consider mine a +1" [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/737884 (owner: 10JMeybohm) [10:43:47] (03PS2) 10Btullis: Configure stat servers to use /srv/spark-tmp as spark.local.dir [puppet] - 10https://gerrit.wikimedia.org/r/738866 (https://phabricator.wikimedia.org/T295346) [10:43:51] !log cmooney@cumin1001 END (PASS) - Cookbook sre.hosts.decommission (exit_code=0) for hosts rpki1001.eqiad.wmnet [10:43:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:59] 10SRE, 10Infrastructure-Foundations, 10netops: Rebuild Routinator (rpki) VMs with larger disk - https://phabricator.wikimedia.org/T292503 (10ops-monitoring-bot) cookbooks.sre.hosts.decommission executed by cmooney@cumin1001 for hosts: `rpki1001.eqiad.wmnet` - rpki1001.eqiad.wmnet (**PASS**) - Downtimed hos... [10:44:27] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:44:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:45:35] (03CR) 10Btullis: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32401/console" [puppet] - 10https://gerrit.wikimedia.org/r/738866 (https://phabricator.wikimedia.org/T295346) (owner: 10Btullis) [10:45:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [10:46:02] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Install update-ca-certificates hook maintaining wmf-ca-certificates.crt (031 comment) [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/737885 (owner: 10JMeybohm) [10:48:49] (03PS1) 10Ideophagous: Bug:T291737 updated arywiki NSs and fixed tabulation issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738872 [10:49:31] (03CR) 10Urbanecm: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738872 (owner: 10Ideophagous) [10:50:24] (03CR) 10jerkins-bot: [V: 04-1] Bug:T291737 updated arywiki NSs and fixed tabulation issue [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738872 (owner: 10Ideophagous) [10:50:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [10:51:08] (03CR) 10Urbanecm: [C: 04-1] "please move the Bug: to next line (to follow mediawiki.org/wiki/Gerrit/Commit_message_guidelines)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738872 (owner: 10Ideophagous) [10:53:09] !log upgrading python3-wmflib to 1.0.0-1 on all hosts buster+ [10:53:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:14] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::php: report prometheus metrics for all php versions [puppet] - 10https://gerrit.wikimedia.org/r/737929 (owner: 10Giuseppe Lavagetto) [10:53:21] (03PS17) 10Giuseppe Lavagetto: mediawiki::php: report prometheus metrics for all php versions [puppet] - 10https://gerrit.wikimedia.org/r/737929 [10:53:23] (03CR) 10JMeybohm: Install update-ca-certificates hook maintaining wmf-ca-certificates.crt (031 comment) [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/737885 (owner: 10JMeybohm) [10:54:02] (03CR) 10Kormat: "One comment, the rest looks good." [puppet] - 10https://gerrit.wikimedia.org/r/738265 (https://phabricator.wikimedia.org/T216088) (owner: 10Muehlenhoff) [10:54:08] !log cmooney@cumin1001 START - Cookbook sre.ganeti.makevm for new host rpki1001.eqiad.wmnet [10:54:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:24] <_joe_> all the unknown checks for php opcache should be ok now [10:54:29] <_joe_> sorry for the inconvenience [10:55:27] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2002.codfw.wmnet [10:55:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:55:40] (03Merged) 10jenkins-bot: media: Make new DjVu metadata handler more defensive [core] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/738639 (owner: 10Ladsgroup) [10:57:12] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.7/includes/media/DjVuHandler.php: Backport: [[gerrit:738639|media: Make new DjVu metadata handler more defensive]] (duration: 00m 54s) [10:57:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:10] (03PS3) 10JMeybohm: Add an update-ca-certificates hook maintaining wmf-ca-certificates.crt [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/737884 [10:58:12] (03PS3) 10JMeybohm: Install update-ca-certificates hook maintaining wmf-ca-certificates.crt [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/737885 [10:58:50] <_joe_> Amir1: if you have ffurther patches, can I ask you to pause for ~ 5 minutes? [10:59:00] _joe_: I'm done [10:59:13] (03CR) 10David Caro: "There's a couple typos/minor errors. Feel free to ignore the nits!" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738437 (owner: 10Jbond) [10:59:15] <_joe_> ack great [10:59:33] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [10:59:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:26] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2002.codfw.wmnet [11:00:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:08] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [11:03:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:48] !log wikiadmin@10.64.0.164(ukwiki)> delete from growthexperiments_mentor_mentee where gemm_mentee_id = 464811 /* Martin Urbanec (WMF) */; [11:04:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:05:36] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:05:50] (03CR) 10Arturo Borrero Gonzalez: "other than the inlined comment, LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/738411 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [11:06:56] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 68, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:07:06] RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [11:07:09] (03CR) 10Arturo Borrero Gonzalez: "other than the inlined comment, LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/738412 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [11:07:28] (03PS2) 10David Caro: openstack: codfw1dev: remove cinder keyring [puppet] - 10https://gerrit.wikimedia.org/r/738411 (https://phabricator.wikimedia.org/T293752) [11:07:42] (03CR) 10David Caro: openstack: codfw1dev: remove cinder keyring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/738411 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [11:07:52] !log cmooney@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) for new host rpki1001.eqiad.wmnet [11:07:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:59] (03CR) 10Arturo Borrero Gonzalez: "other than the inlined comment, LGTM." [puppet] - 10https://gerrit.wikimedia.org/r/738413 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [11:08:14] (03PS2) 10David Caro: openstack: codfw1dev: enable cinder key generation [puppet] - 10https://gerrit.wikimedia.org/r/738412 (https://phabricator.wikimedia.org/T293752) [11:08:26] (03CR) 10David Caro: openstack: codfw1dev: enable cinder key generation (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/738412 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [11:08:32] (03PS3) 10David Caro: openstack: codfw1dev: enable cinder key generation [puppet] - 10https://gerrit.wikimedia.org/r/738412 (https://phabricator.wikimedia.org/T293752) [11:08:50] (03CR) 10Ideophagous: Bug:T291737 updated arywiki NSs and fixed tabulation issue (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738872 (owner: 10Ideophagous) [11:08:53] (03PS2) 10David Caro: openstack: eqiad1: Remove cinder key generation from cloudcontrols [puppet] - 10https://gerrit.wikimedia.org/r/738413 (https://phabricator.wikimedia.org/T293752) [11:09:02] (03CR) 10David Caro: openstack: eqiad1: Remove cinder key generation from cloudcontrols (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/738413 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [11:09:08] (03PS3) 10David Caro: openstack: eqiad1: Remove cinder key generation from cloudcontrols [puppet] - 10https://gerrit.wikimedia.org/r/738413 (https://phabricator.wikimedia.org/T293752) [11:09:28] (03CR) 10Arturo Borrero Gonzalez: "What would you think about squashing this patch with the previous one?" [puppet] - 10https://gerrit.wikimedia.org/r/738414 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [11:13:57] !log jmm@cumin2002 START - Cookbook sre.hosts.reboot-single for host ganeti-test2003.codfw.wmnet [11:13:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:15:54] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] openstack: codfw1dev: remove cinder keyring [puppet] - 10https://gerrit.wikimedia.org/r/738411 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [11:17:54] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] openstack: codfw1dev: enable cinder key generation [puppet] - 10https://gerrit.wikimedia.org/r/738412 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [11:18:15] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] openstack: eqiad1: Remove cinder key generation from cloudcontrols [puppet] - 10https://gerrit.wikimedia.org/r/738413 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [11:18:45] (03PS2) 10David Caro: openstack: eqiad1: enable cinder keyring generation on control nodes [puppet] - 10https://gerrit.wikimedia.org/r/738414 (https://phabricator.wikimedia.org/T293752) [11:18:53] (03PS3) 10David Caro: openstack: eqiad1: enable cinder keyring generation on control nodes [puppet] - 10https://gerrit.wikimedia.org/r/738414 (https://phabricator.wikimedia.org/T293752) [11:19:14] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:19:34] (03CR) 10David Caro: openstack: eqiad1: enable cinder keyring generation on control nodes (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/738414 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [11:19:42] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host ganeti-test2003.codfw.wmnet [11:19:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:53] (03PS1) 10Cathal Mooney: Updating MAC address for install server DHCP config for rpki1001 as it is being rebuilt to provide more disk space and has a new MAC. [puppet] - 10https://gerrit.wikimedia.org/r/738873 (https://phabricator.wikimedia.org/T295650) [11:20:18] (03CR) 10David Caro: [C: 03+2] openstack: codfw1dev: remove cinder keyring [puppet] - 10https://gerrit.wikimedia.org/r/738411 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [11:20:20] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] openstack: eqiad1: enable cinder keyring generation on control nodes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/738414 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [11:20:49] (03CR) 10jerkins-bot: [V: 04-1] Updating MAC address for install server DHCP config for rpki1001 as it is being rebuilt to provide more disk space and has a new MAC. [puppet] - 10https://gerrit.wikimedia.org/r/738873 (https://phabricator.wikimedia.org/T295650) (owner: 10Cathal Mooney) [11:21:36] (03PS1) 10Joal: Import commons mediainfo json dumps to HDFS [puppet] - 10https://gerrit.wikimedia.org/r/738874 [11:23:35] (03CR) 10jerkins-bot: [V: 04-1] Import commons mediainfo json dumps to HDFS [puppet] - 10https://gerrit.wikimedia.org/r/738874 (owner: 10Joal) [11:25:49] (03CR) 10David Caro: [C: 03+2] openstack: codfw1dev: enable cinder key generation [puppet] - 10https://gerrit.wikimedia.org/r/738412 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [11:26:29] (03PS1) 10Ideophagous: arywiki NS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738876 (https://phabricator.wikimedia.org/T291737) [11:26:50] (03PS2) 10Cathal Mooney: Updating MAC address for install server DHCP config for rpki1001 as it is being rebuilt to provide more disk space and has a new MAC. [puppet] - 10https://gerrit.wikimedia.org/r/738873 (https://phabricator.wikimedia.org/T295650) [11:27:44] (03CR) 10Ideophagous: "Hello Urbanecm," [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738876 (https://phabricator.wikimedia.org/T291737) (owner: 10Ideophagous) [11:27:55] (03PS3) 10Cathal Mooney: Updating MAC address for install server DHCP config for rpki1001 as it is being rebuilt to provide more disk space and has a new MAC. [puppet] - 10https://gerrit.wikimedia.org/r/738873 (https://phabricator.wikimedia.org/T295650) [11:29:06] (03PS1) 10Giuseppe Lavagetto: mediawiki::php::monitoring: fix double output [puppet] - 10https://gerrit.wikimedia.org/r/738878 [11:29:14] (03PS4) 10Cathal Mooney: Modifying MAC address for install server DHCP config for rpki1001 which is being rebuilt to provide more disk space. [puppet] - 10https://gerrit.wikimedia.org/r/738873 (https://phabricator.wikimedia.org/T295650) [11:30:04] (03CR) 10jerkins-bot: [V: 04-1] Modifying MAC address for install server DHCP config for rpki1001 which is being rebuilt to provide more disk space. [puppet] - 10https://gerrit.wikimedia.org/r/738873 (https://phabricator.wikimedia.org/T295650) (owner: 10Cathal Mooney) [11:32:28] (03CR) 10Giuseppe Lavagetto: [C: 03+2] mediawiki::php::monitoring: fix double output [puppet] - 10https://gerrit.wikimedia.org/r/738878 (owner: 10Giuseppe Lavagetto) [11:34:03] (03CR) 10Michael Große: Add missing termbox codes from Wikibase (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734722 (https://phabricator.wikimedia.org/T277836) (owner: 10Mbch331) [11:35:06] (03PS5) 10Cathal Mooney: Change MAC address in DHCP config for rpki1001 [puppet] - 10https://gerrit.wikimedia.org/r/738873 (https://phabricator.wikimedia.org/T295650) [11:36:03] (03CR) 10Jbond: [C: 03+1] Update approver for os-installers [puppet] - 10https://gerrit.wikimedia.org/r/738837 (owner: 10Muehlenhoff) [11:36:10] (03CR) 10Cathal Mooney: [C: 03+2] Change MAC address in DHCP config for rpki1001 [puppet] - 10https://gerrit.wikimedia.org/r/738873 (https://phabricator.wikimedia.org/T295650) (owner: 10Cathal Mooney) [11:37:32] (03PS2) 10Joal: Import commons mediainfo json dumps to HDFS [puppet] - 10https://gerrit.wikimedia.org/r/738874 (https://phabricator.wikimedia.org/T258834) [11:40:24] PROBLEM - OSPF status on cr4-ulsfo is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:40:29] (03CR) 10JMeybohm: Add an update-ca-certificates hook maintaining wmf-ca-certificates.crt (031 comment) [debs/wmf-certificates] - 10https://gerrit.wikimedia.org/r/737884 (owner: 10JMeybohm) [11:40:32] PROBLEM - Router interfaces on cr4-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 69, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:40:44] (03CR) 10Jbond: [C: 03+1] "lgtm" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738420 (https://phabricator.wikimedia.org/T295030) (owner: 10David Caro) [11:40:59] (03PS1) 10Arturo Borrero Gonzalez: wmcs: network: tests: include docs reference [cookbooks] (wmcs) - 10https://gerrit.wikimedia.org/r/738881 [11:41:26] PROBLEM - OSPF status on cr2-eqdfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:45:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [11:46:10] (03CR) 10Jcrespo: "@Moritz Let's remove that line addition and merge the rest, so we don't have to wait on further approval (manuel is out today BTW)?" [puppet] - 10https://gerrit.wikimedia.org/r/738265 (https://phabricator.wikimedia.org/T216088) (owner: 10Muehlenhoff) [11:46:33] (03CR) 10Alexandros Kosiaris: [C: 04-1] "It's actually going to be Lukasz, so -1" [puppet] - 10https://gerrit.wikimedia.org/r/738836 (owner: 10Muehlenhoff) [11:50:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [11:51:10] RECOVERY - Router interfaces on cr4-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 70, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [11:51:34] 10SRE, 10Infrastructure-Foundations, 10netops, 10Patch-For-Review: cr1-eqiad -> Charter/AS7843 connectivity is broken - https://phabricator.wikimedia.org/T295650 (10cmooney) Please ignore the above, unrelated CRs. I pasted the wrong task ID when doing the commit. [11:52:04] RECOVERY - OSPF status on cr2-eqdfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:53:12] RECOVERY - OSPF status on cr4-ulsfo is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [11:59:07] (03CR) 10Majavah: python39: Use shell reimplementation of webservice-runner (031 comment) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/738503 (https://phabricator.wikimedia.org/T293552) (owner: 10Legoktm) [11:59:29] !log btullis@cumin1001 START - Cookbook sre.druid.roll-restart-workers for Druid analytics cluster: Roll restart of Druid jvm daemons. [11:59:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for UTC morning backport window . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211115T1200). [12:00:04] No Gerrit patches in the queue for this window AFAICS. [12:00:28] (03CR) 10Urbanecm: [C: 03+2] uzwiki: Enable VisualEditor by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738296 (https://phabricator.wikimedia.org/T294245) (owner: 10Urbanecm) [12:01:18] (03Merged) 10jenkins-bot: uzwiki: Enable VisualEditor by default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738296 (https://phabricator.wikimedia.org/T294245) (owner: 10Urbanecm) [12:01:20] (03PS15) 10Jbond: Pathlib: switch to pathlib vs os.path everywhere [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738261 [12:02:54] !log urbanecm@deploy1002 Synchronized dblists/visualeditor-nondefault.dblist: 6b3bacd986ab041a5e3aee06c6de04e344dd8015: uzwiki: Enable VisualEditor by default (T294245) (duration: 00m 56s) [12:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:02:57] T294245: Activation of the visual editor and Growth features by default on Uzbek Wikipedia - https://phabricator.wikimedia.org/T294245 [12:08:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:08:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:09:49] (03PS16) 10Jbond: Pathlib: switch to pathlib vs os.path everywhere [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738261 [12:12:13] (03CR) 10Jbond: "updated thanks" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738261 (owner: 10Jbond) [12:12:20] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [12:12:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:26] PROBLEM - etcd request latencies on ml-serve-ctrl2002 is CRITICAL: instance=10.192.48.41 operation={get,listWithCount} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [12:14:32] RECOVERY - etcd request latencies on ml-serve-ctrl2002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [12:14:36] (03CR) 10Klausman: [C: 03+1] Import new ROCm version 4.5 [puppet] - 10https://gerrit.wikimedia.org/r/738615 (https://phabricator.wikimedia.org/T295661) (owner: 10Elukey) [12:15:31] (03CR) 10jerkins-bot: [V: 04-1] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/738894 (owner: 10L10n-bot) [12:17:22] (03PS1) 10AOkoth: gitlab: disable restore timer to perform upgrade [puppet] - 10https://gerrit.wikimedia.org/r/738898 (https://phabricator.wikimedia.org/T294580) [12:20:42] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:20:58] (03CR) 10AOkoth: "https://puppet-compiler.wmflabs.org/compiler1001/32402/" [puppet] - 10https://gerrit.wikimedia.org/r/738898 (https://phabricator.wikimedia.org/T294580) (owner: 10AOkoth) [12:22:52] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [12:24:20] 10SRE, 10Infrastructure-Foundations, 10netops: cr1-eqiad -> Charter/AS7843 connectivity is broken - https://phabricator.wikimedia.org/T295650 (10cmooney) a:03cmooney [12:27:09] (03PS1) 10Cathal Mooney: Add policy-statement to CRs which sets next-hop self in iBGP. [homer/public] - 10https://gerrit.wikimedia.org/r/738899 (https://phabricator.wikimedia.org/T295672) [12:27:59] 10SRE-tools, 10Data-Persistence-Backup, 10Infrastructure-Foundations, 10media-backups, 10observability: minio monitoring broken due to TLS certificate marked as insecure - https://phabricator.wikimedia.org/T295594 (10jbond) @jcrespo To be clear there was nothing wrong with your config, this is something... [12:30:13] (03PS4) 10Jbond: f-strings: convert strings to f-strings [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738437 [12:31:26] (03CR) 10Ayounsi: [C: 03+1] Add policy-statement to CRs which sets next-hop self in iBGP. [homer/public] - 10https://gerrit.wikimedia.org/r/738899 (https://phabricator.wikimedia.org/T295672) (owner: 10Cathal Mooney) [12:32:52] (03CR) 10Nikerabbit: [V: 03+2] Localisation updates from https://translatewiki.net. [software/mailman-templates] - 10https://gerrit.wikimedia.org/r/738894 (owner: 10L10n-bot) [12:37:36] (03PS1) 10JMeybohm: dragonfly::dfdaemon: Serve the chained cert to clients [puppet] - 10https://gerrit.wikimedia.org/r/738900 [12:38:40] (03CR) 10JMeybohm: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32403/console" [puppet] - 10https://gerrit.wikimedia.org/r/738900 (owner: 10JMeybohm) [12:38:47] 10SRE, 10Infrastructure-Foundations, 10netops: cr1-eqiad -> Charter/AS7843 connectivity is broken - https://phabricator.wikimedia.org/T295650 (10cmooney) > My guess would be that this is Charter filtering traffic on their IXP port to only routers they have peerings with, for security/anti-DDoS reasons. > >... [12:40:04] (03PS2) 10JMeybohm: dragonfly::dfdaemon: Serve the chained cert to clients [puppet] - 10https://gerrit.wikimedia.org/r/738900 [12:41:48] (03CR) 10Jbond: [C: 03+1] dragonfly::dfdaemon: Serve the chained cert to clients [puppet] - 10https://gerrit.wikimedia.org/r/738900 (owner: 10JMeybohm) [12:42:24] (03CR) 10JMeybohm: [C: 03+2] dragonfly::dfdaemon: Serve the chained cert to clients [puppet] - 10https://gerrit.wikimedia.org/r/738900 (owner: 10JMeybohm) [12:42:59] (03PS1) 10Jbond: P:multirootca: dont include the root CA [puppet] - 10https://gerrit.wikimedia.org/r/738901 [12:43:21] topranks: fine to merge your MAC change? [12:43:23] (03CR) 10Jbond: [V: 03+2 C: 03+2] P:multirootca: dont include the root CA [puppet] - 10https://gerrit.wikimedia.org/r/738901 (owner: 10Jbond) [12:44:06] jayme: apologies I thought I'd done so [12:44:18] yes by all means please proceed [12:44:23] topranks: no problem :) [12:44:35] (03CR) 10David Caro: "Just one small issue, I'll +1 so there's no need to bounce :)" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738261 (owner: 10Jbond) [12:44:59] (03CR) 10AOkoth: [C: 03+2] gitlab: disable restore timer to perform upgrade [puppet] - 10https://gerrit.wikimedia.org/r/738898 (https://phabricator.wikimedia.org/T294580) (owner: 10AOkoth) [12:45:28] !log btullis@cumin1001 END (PASS) - Cookbook sre.druid.roll-restart-workers (exit_code=0) for Druid analytics cluster: Roll restart of Druid jvm daemons. [12:45:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:45:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [12:46:33] !log btullis@cumin1001 START - Cookbook sre.hadoop.roll-restart-workers restart workers for Hadoop analytics cluster: Roll restart of jvm daemons for openjdk upgrade. [12:46:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:38] (03PS17) 10Jbond: Pathlib: switch to pathlib vs os.path everywhere [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738261 [12:48:51] (03PS5) 10Jbond: f-strings: convert strings to f-strings [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738437 [12:49:48] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for SCherukuwada - https://phabricator.wikimedia.org/T295550 (10Jelto) [12:50:04] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SCherukuwada - https://phabricator.wikimedia.org/T295552 (10Jelto) [12:50:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [12:58:31] (03PS6) 10Jbond: f-strings: convert strings to f-strings [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738437 [13:00:50] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: eqiad1: Remove cinder key generation from cloudcontrols [puppet] - 10https://gerrit.wikimedia.org/r/738413 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [13:02:10] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for SCherukuwada - https://phabricator.wikimedia.org/T295550 (10Jelto) [13:02:23] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SCherukuwada - https://phabricator.wikimedia.org/T295552 (10Jelto) [13:05:19] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] openstack: eqiad1: enable cinder keyring generation on control nodes [puppet] - 10https://gerrit.wikimedia.org/r/738414 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [13:06:16] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:07:58] (03PS2) 10Muehlenhoff: Update approver for gitlab-roots/vrts-roots [puppet] - 10https://gerrit.wikimedia.org/r/738836 [13:11:42] (03CR) 10Ladsgroup: [C: 03+2] "This change is ready for review." [core] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/738641 (owner: 10Ladsgroup) [13:17:53] (03CR) 10Cathal Mooney: [C: 03+2] Add policy-statement to CRs which sets next-hop self in iBGP. [homer/public] - 10https://gerrit.wikimedia.org/r/738899 (https://phabricator.wikimedia.org/T295672) (owner: 10Cathal Mooney) [13:18:33] (03Merged) 10jenkins-bot: Add policy-statement to CRs which sets next-hop self in iBGP. [homer/public] - 10https://gerrit.wikimedia.org/r/738899 (https://phabricator.wikimedia.org/T295672) (owner: 10Cathal Mooney) [13:21:05] (03CR) 10Muehlenhoff: Add ownership annotations for additional Data Persistence services (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/738265 (https://phabricator.wikimedia.org/T216088) (owner: 10Muehlenhoff) [13:23:01] (03PS2) 10Muehlenhoff: Add ownership annotations for additional Data Persistence services [puppet] - 10https://gerrit.wikimedia.org/r/738265 (https://phabricator.wikimedia.org/T216088) [13:24:19] (03CR) 10Jcrespo: [C: 03+1] "+1 for backup owner" [puppet] - 10https://gerrit.wikimedia.org/r/738265 (https://phabricator.wikimedia.org/T216088) (owner: 10Muehlenhoff) [13:25:17] !log Adding new policy-statement to CR routers via homer to set next-hop self on iBGP sessions (not yet configured for any peers). [13:25:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:26:31] (03CR) 10Jgiannelos: [C: 03+2] tegola-vector-tiles: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/738210 (owner: 10Jgiannelos) [13:29:10] (03CR) 10Alexandros Kosiaris: [C: 03+1] Update approver for gitlab-roots/vrts-roots [puppet] - 10https://gerrit.wikimedia.org/r/738836 (owner: 10Muehlenhoff) [13:30:05] (03CR) 10Kormat: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/738265 (https://phabricator.wikimedia.org/T216088) (owner: 10Muehlenhoff) [13:30:24] PROBLEM - SSH on contint1001.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [13:30:41] (03Merged) 10jenkins-bot: tegola-vector-tiles: Bump image to latest version [deployment-charts] - 10https://gerrit.wikimedia.org/r/738210 (owner: 10Jgiannelos) [13:34:39] !log jgiannelos@deploy1002 helmfile [staging] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [13:34:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:42] (03Merged) 10jenkins-bot: Revert "media: Port DjVuImage::retrieveMetaData() to use BoxedCommand" [core] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/738641 (owner: 10Ladsgroup) [13:36:11] !log jgiannelos@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [13:36:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:04] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.7/includes: Backport: [[gerrit:738641|Revert "media: Port DjVuImage::retrieveMetaData() to use BoxedCommand"]] (duration: 01m 01s) [13:40:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:39] !log start of djvu clean up in commons in a screen. Gonna take a couple of days (T275268) [13:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:40:42] T275268: Address "image" table capacity problems by storing pdf/djvu text outside file metadata - https://phabricator.wikimedia.org/T275268 [13:40:57] (03CR) 10LSobanski: [C: 03+1] Add ownership annotations for additional Data Persistence services [puppet] - 10https://gerrit.wikimedia.org/r/738265 (https://phabricator.wikimedia.org/T216088) (owner: 10Muehlenhoff) [13:41:54] (03CR) 10LSobanski: [C: 03+1] Update approver for gitlab-roots/vrts-roots (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/738836 (owner: 10Muehlenhoff) [13:42:16] (03PS1) 10David Caro: ceph::auth::keyring: allow passing the full client name [puppet] - 10https://gerrit.wikimedia.org/r/738903 (https://phabricator.wikimedia.org/T293752) [13:43:10] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:43:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:44:36] (03PS1) 10Arturo Borrero Gonzalez: cloud: ceph: migrate admin keyring to new abstraction [puppet] - 10https://gerrit.wikimedia.org/r/738904 (https://phabricator.wikimedia.org/T293752) [13:45:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [13:46:14] (03CR) 10jerkins-bot: [V: 04-1] cloud: ceph: migrate admin keyring to new abstraction [puppet] - 10https://gerrit.wikimedia.org/r/738904 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [13:46:46] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [13:46:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:00] (03PS2) 10Arturo Borrero Gonzalez: cloud: ceph: migrate admin keyring to new abstraction [puppet] - 10https://gerrit.wikimedia.org/r/738904 (https://phabricator.wikimedia.org/T293752) [13:48:54] (03CR) 10jerkins-bot: [V: 04-1] cloud: ceph: migrate admin keyring to new abstraction [puppet] - 10https://gerrit.wikimedia.org/r/738904 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [13:50:38] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [13:50:52] (03PS1) 10Ayounsi: _get_junos_router_interfaces: ignore VCP interfaces [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/738905 [13:50:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [13:51:36] !log jgiannelos@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'tegola-vector-tiles' for release 'main' . [13:51:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:21] (03CR) 10Arturo Borrero Gonzalez: [V: 04-1] "PCC fails: https://puppet-compiler.wmflabs.org/compiler1001/32404/" [puppet] - 10https://gerrit.wikimedia.org/r/738904 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [13:52:48] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for SCherukuwada - https://phabricator.wikimedia.org/T295550 (10Jelto) [13:53:32] (03CR) 10jerkins-bot: [V: 04-1] Add drmrs switches to Homer [homer/public] - 10https://gerrit.wikimedia.org/r/738372 (https://phabricator.wikimedia.org/T283050) (owner: 10Ayounsi) [13:54:15] (03PS5) 10Ayounsi: Add drmrs switches to Homer [homer/public] - 10https://gerrit.wikimedia.org/r/738372 (https://phabricator.wikimedia.org/T283050) [13:55:06] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for SCherukuwada - https://phabricator.wikimedia.org/T295552 (10Jelto) [13:55:20] !log installing java-atk-wrapper bugfix updates [13:55:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:46] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [13:58:53] (03CR) 10Muehlenhoff: [C: 03+2] Add ownership annotations for additional Data Persistence services [puppet] - 10https://gerrit.wikimedia.org/r/738265 (https://phabricator.wikimedia.org/T216088) (owner: 10Muehlenhoff) [13:59:33] (03PS1) 10David Caro: ceph::auth::keyring: Generate keyring_path if not passed [puppet] - 10https://gerrit.wikimedia.org/r/738908 (https://phabricator.wikimedia.org/T293752) [14:00:37] (03CR) 10Ottomata: Configure stat servers to use /srv/spark-tmp as spark.local.dir (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/738866 (https://phabricator.wikimedia.org/T295346) (owner: 10Btullis) [14:04:16] 10SRE, 10Discovery-Search (Current work): Upgrade Cirrus Elasticsearch clusters to Debian Bullseye - https://phabricator.wikimedia.org/T289135 (10MPhamWMF) [14:05:32] (03PS1) 10Ema: varnish: add varnishmtail-wrapper [puppet] - 10https://gerrit.wikimedia.org/r/738910 (https://phabricator.wikimedia.org/T293879) [14:06:03] (03PS3) 10Arturo Borrero Gonzalez: cloud: ceph: migrate admin keyring to new abstraction [puppet] - 10https://gerrit.wikimedia.org/r/738904 (https://phabricator.wikimedia.org/T293752) [14:06:21] (03CR) 10jerkins-bot: [V: 04-1] varnish: add varnishmtail-wrapper [puppet] - 10https://gerrit.wikimedia.org/r/738910 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema) [14:06:29] (03PS2) 10Jgiannelos: tile-pregeneration: Fix wording about envoy [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/730149 [14:07:23] (03CR) 10Kormat: [C: 03+2] mariadb: Set important db host monitoring to critical. [puppet] - 10https://gerrit.wikimedia.org/r/736415 (https://phabricator.wikimedia.org/T233684) (owner: 10Kormat) [14:07:43] (03CR) 10jerkins-bot: [V: 04-1] cloud: ceph: migrate admin keyring to new abstraction [puppet] - 10https://gerrit.wikimedia.org/r/738904 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [14:08:52] RECOVERY - Maps tiles generation on alert1001 is OK: OK: Less than 90.00% under the threshold [10.0] https://wikitech.wikimedia.org/wiki/Maps/Runbook https://grafana.wikimedia.org/dashboard/db/maps-performances?panelId=8&fullscreen&orgId=1 [14:11:52] (03PS2) 10Vgutierrez: cache:haproxy: Gather TTFB metrics using mtail [puppet] - 10https://gerrit.wikimedia.org/r/738422 (https://phabricator.wikimedia.org/T290005) [14:12:12] (03CR) 10Ottomata: "Will this affect any prometheus labels? I think it will, but the ones it affects probably won't matter." [deployment-charts] - 10https://gerrit.wikimedia.org/r/738578 (https://phabricator.wikimedia.org/T292390) (owner: 10Jelto) [14:12:49] (03PS2) 10Ladsgroup: Disable DPL on Wikiquotes where not in use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734423 (https://phabricator.wikimedia.org/T287916) (owner: 10Legoktm) [14:13:11] jouncebot: nowandnext [14:13:11] No deployments scheduled for the next 2 hour(s) and 16 minute(s) [14:13:11] In 2 hour(s) and 16 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211115T1630) [14:13:19] (03CR) 10Ladsgroup: [C: 03+2] Disable DPL on Wikiquotes where not in use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734423 (https://phabricator.wikimedia.org/T287916) (owner: 10Legoktm) [14:14:04] (03Merged) 10jenkins-bot: Disable DPL on Wikiquotes where not in use [mediawiki-config] - 10https://gerrit.wikimedia.org/r/734423 (https://phabricator.wikimedia.org/T287916) (owner: 10Legoktm) [14:15:34] !log ladsgroup@deploy1002 Synchronized wmf-config/InitialiseSettings.php: Config: [[gerrit:734423|Disable DPL on Wikiquotes where not in use (T287916)]] (duration: 00m 56s) [14:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:39] T287916: Disable DPL on wikis that aren't using it - https://phabricator.wikimedia.org/T287916 [14:17:10] (03CR) 10MSantos: [C: 03+2] tile-pregeneration: Fix wording about envoy [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/730149 (owner: 10Jgiannelos) [14:17:16] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:17:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:10] (03Merged) 10jenkins-bot: tile-pregeneration: Fix wording about envoy [software/tegola] (wmf/v0.14.x) - 10https://gerrit.wikimedia.org/r/730149 (owner: 10Jgiannelos) [14:21:02] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:21:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:13] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [14:24:13] RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [14:24:55] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [14:25:02] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32405/console" [puppet] - 10https://gerrit.wikimedia.org/r/738615 (https://phabricator.wikimedia.org/T295661) (owner: 10Elukey) [14:26:21] (03PS1) 10Majavah: aptrepo: add k8s 1.21 to stretch too [puppet] - 10https://gerrit.wikimedia.org/r/738912 [14:30:27] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [14:30:55] (03PS1) 10Ayounsi: test_interface_termination_names: add breakout cables support [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/738913 [14:34:37] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for SCherukuwada - https://phabricator.wikimedia.org/T295550 (10Jelto) @thcipriani are you able to approve adding SCherukuwada to the deployment group ? [14:36:00] (03PS4) 10David Caro: cloud: ceph: migrate admin keyring to new abstraction [puppet] - 10https://gerrit.wikimedia.org/r/738904 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [14:38:44] jouncebot: nowandnext [14:38:44] No deployments scheduled for the next 1 hour(s) and 51 minute(s) [14:38:44] In 1 hour(s) and 51 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211115T1630) [14:38:51] * urbanecm stages at mwdebug1001 [14:38:55] (03CR) 10David Caro: [V: 03+1 C: 03+2] "PCC SUCCESS (NOOP 8): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32406/console" [puppet] - 10https://gerrit.wikimedia.org/r/737063 (https://phabricator.wikimedia.org/T293752) (owner: 10David Caro) [14:43:19] (03PS2) 10Urbanecm: GrowthExperiments: Disable link recommendation frontend on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738586 (owner: 10Kosta Harlan) [14:43:24] (03CR) 10Urbanecm: [C: 03+2] GrowthExperiments: Disable link recommendation frontend on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738586 (owner: 10Kosta Harlan) [14:43:50] (03CR) 10David Caro: [V: 03+1] "PCC SUCCESS (DIFF 5 NOOP 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32407/console" [puppet] - 10https://gerrit.wikimedia.org/r/738904 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [14:44:35] (03Merged) 10jenkins-bot: GrowthExperiments: Disable link recommendation frontend on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738586 (owner: 10Kosta Harlan) [14:44:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [14:45:37] !log btullis@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-workers (exit_code=0) restart workers for Hadoop analytics cluster: Roll restart of jvm daemons for openjdk upgrade. [14:45:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:46:51] (03PS2) 10Majavah: aptrepo: add k8s 1.21 to stretch too [puppet] - 10https://gerrit.wikimedia.org/r/738912 (https://phabricator.wikimedia.org/T282942) [14:47:35] (03PS1) 10Giuseppe Lavagetto: mediawiki: allow mtail to match all handlers [puppet] - 10https://gerrit.wikimedia.org/r/738918 [14:47:37] (03PS1) 10Giuseppe Lavagetto: mediawiki::php::restarts: support multiple versions of php [puppet] - 10https://gerrit.wikimedia.org/r/738919 [14:49:06] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 4f17e85d4708b52fc98c34b489d7504d5e94523c: GrowthExperiments: Disable link recommendation frontend on dewiki (duration: 00m 56s) [14:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [14:50:53] Amir1: there's a lot of `[2eef8570-1315-414f-bda6-24e8f8450e59] /wiki/Fichier:Vivien_-_heure_mains_jointes_1906.djvu PHP Notice: Undefined index: data`. Dunno if that's expected. [14:51:23] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:51:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:15] (03PS2) 10Giuseppe Lavagetto: mediawiki: allow mtail to match all handlers [puppet] - 10https://gerrit.wikimedia.org/r/738918 [14:52:17] (03PS2) 10Giuseppe Lavagetto: mediawiki::php::restarts: support multiple versions of php [puppet] - 10https://gerrit.wikimedia.org/r/738919 [14:53:02] urbanecm: thanks. I look at it [14:53:14] appreciated [14:55:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [14:55:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:58] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [14:58:14] RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [14:58:58] (03PS3) 10Vgutierrez: cache:haproxy: Gather TTFB metrics using mtail [puppet] - 10https://gerrit.wikimedia.org/r/738422 (https://phabricator.wikimedia.org/T290005) [15:00:06] 10SRE, 10ops-eqiad, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase103[123].eqiad.wmnet - https://phabricator.wikimedia.org/T294372 (10hnowlan) [15:00:14] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase202[456].codfw.wmnet - https://phabricator.wikimedia.org/T294377 (10hnowlan) [15:00:41] (03CR) 10Vgutierrez: [V: 03+1] "PCC SUCCESS (DIFF 1): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32408/console" [puppet] - 10https://gerrit.wikimedia.org/r/738422 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [15:02:21] (03PS5) 10Elukey: profile::base::certificates: vary trusted_certs on realm [puppet] - 10https://gerrit.wikimedia.org/r/737983 (https://phabricator.wikimedia.org/T291905) [15:02:23] (03PS7) 10Elukey: Move coal, navtiming and statsv to the new CA bundle [puppet] - 10https://gerrit.wikimedia.org/r/737970 (https://phabricator.wikimedia.org/T291905) [15:05:42] urbanecm: https://gerrit.wikimedia.org/r/c/mediawiki/core/+/738923 :D [15:07:33] (03PS3) 10Btullis: Configure stat servers to use /srv/spark-tmp as spark.local.dir [puppet] - 10https://gerrit.wikimedia.org/r/738866 (https://phabricator.wikimedia.org/T295346) [15:07:41] (03CR) 10Elukey: [V: 03+1 C: 03+2] Import new ROCm version 4.5 [puppet] - 10https://gerrit.wikimedia.org/r/738615 (https://phabricator.wikimedia.org/T295661) (owner: 10Elukey) [15:08:12] (03CR) 10Btullis: Configure stat servers to use /srv/spark-tmp as spark.local.dir (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/738866 (https://phabricator.wikimedia.org/T295346) (owner: 10Btullis) [15:08:43] (03PS1) 10Hnowlan: partmon: add reuse partmon profile for cassandra hosts [puppet] - 10https://gerrit.wikimedia.org/r/738924 (https://phabricator.wikimedia.org/T295375) [15:09:56] 10SRE-Access-Requests, 10WMF-NDA-Requests: Add EJoseph to #wmf-nda - https://phabricator.wikimedia.org/T293326 (10Dzahn) [15:11:17] Amir1: this...looks like it should fix it. Thanks :). Is/was this related to your deployments from earlier today? [15:11:32] yup [15:12:30] (03PS4) 10Hnowlan: cassandra: move cluster:user relation from 1:1 relation to a 1:many [puppet] - 10https://gerrit.wikimedia.org/r/738270 (https://phabricator.wikimedia.org/T235299) [15:12:41] (03CR) 10Hnowlan: cassandra: move cluster:user relation from 1:1 relation to a 1:many (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/738270 (https://phabricator.wikimedia.org/T235299) (owner: 10Hnowlan) [15:15:16] !log `reprepro --delete clearvanished` on apt1001 to clean-up thirdparty/amd-rocm38 (buster and stretch) - T295661 [15:15:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:15:20] T295661: Upgrade ROCm to 4.5 - https://phabricator.wikimedia.org/T295661 [15:15:59] Amir1: +2'ed, looks very safe to me :) [15:16:02] cmjohnson1: just realised i missed the shutdown time for db1112, sorry. shutting it down now [15:16:03] Thanks [15:16:41] (03PS2) 10Ema: varnish: add varnishmtail-wrapper [puppet] - 10https://gerrit.wikimedia.org/r/738910 (https://phabricator.wikimedia.org/T293879) [15:17:02] jouncebot: nowandnext [15:17:02] No deployments scheduled for the next 1 hour(s) and 12 minute(s) [15:17:02] In 1 hour(s) and 12 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211115T1630) [15:17:10] (03PS2) 10Urbanecm: uzwiki: Enable Growth features in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738297 (https://phabricator.wikimedia.org/T294245) [15:17:27] (03CR) 10Urbanecm: [C: 03+2] uzwiki: Enable Growth features in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738297 (https://phabricator.wikimedia.org/T294245) (owner: 10Urbanecm) [15:18:31] 10SRE, 10DBA, 10cloud-services-team (Kanban): db1112 (s3 contribs/rc replica) is down - https://phabricator.wikimedia.org/T294295 (10Kormat) @Cmjohnson: db1112 powered off now. Let me know when it's ready to be put back in service. Cheers. [15:18:32] Thanks kormat [15:18:42] !log uzwiki: Create growthexperiments tables (T294245) [15:18:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:18:45] T294245: Activation of the visual editor and Growth features by default on Uzbek Wikipedia - https://phabricator.wikimedia.org/T294245 [15:18:49] (03PS1) 10Elukey: aptrepo: update amd-rocm45 component's suite [puppet] - 10https://gerrit.wikimedia.org/r/738947 (https://phabricator.wikimedia.org/T295661) [15:19:05] (03PS1) 10Ladsgroup: media: Avoid logspam in case of lack of 'data' in metadata [core] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/738932 [15:19:10] (03CR) 10Ladsgroup: [C: 03+2] media: Avoid logspam in case of lack of 'data' in metadata [core] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/738932 (owner: 10Ladsgroup) [15:19:44] ACKNOWLEDGEMENT - MariaDB Replica IO: s3 on db1154 is CRITICAL: CRITICAL slave_io_state Slave_IO_Running: No, Errno: 2003, Errmsg: error reconnecting to master repl@db1112.eqiad.wmnet:3306 - retry-time: 60 maximum-retries: 86400 message: Cant connect to MySQL server on db1112.eqiad.wmnet (111 Connection refused) Kormat db1112 down for hw maintenance https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [15:21:06] (03CR) 10Elukey: [C: 03+2] aptrepo: update amd-rocm45 component's suite [puppet] - 10https://gerrit.wikimedia.org/r/738947 (https://phabricator.wikimedia.org/T295661) (owner: 10Elukey) [15:22:46] (03Merged) 10jenkins-bot: uzwiki: Enable Growth features in dark mode [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738297 (https://phabricator.wikimedia.org/T294245) (owner: 10Urbanecm) [15:23:12] (03CR) 10Ottomata: [C: 03+1] "One nit but +1 otherwise!" [puppet] - 10https://gerrit.wikimedia.org/r/738866 (https://phabricator.wikimedia.org/T295346) (owner: 10Btullis) [15:24:29] !log import AMD ROCm 4.5 in thirdparty/amd-rocm45 for buster-wikimedia - T295661 [15:24:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:32] T295661: Upgrade ROCm to 4.5 - https://phabricator.wikimedia.org/T295661 [15:25:25] (03PS2) 10Jbond: Add Typing: And fix other minopr lint issues [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738452 [15:25:32] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:25:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:36] PROBLEM - Host db1112.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:25:51] (03PS2) 10Ideophagous: arywiki NS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738876 (https://phabricator.wikimedia.org/T291737) [15:26:04] (03CR) 10Jbond: "updated" [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738437 (owner: 10Jbond) [15:26:25] !log mwscript extensions/GrowthExperiments/maintenance/initWikiConfig.php --wiki=uzwiki --phab=T294245 # T294245 [15:26:26] !log urbanecm@deploy1002 Synchronized dblists/growthexperiments.dblist: 734f3b0094799007e38dea1d152f0afeb3134e1b: uzwiki: Enable Growth features in dark mode (T294245; 1/3) (duration: 00m 55s) [15:26:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:26:28] T294245: Activation of the visual editor and Growth features by default on Uzbek Wikipedia - https://phabricator.wikimedia.org/T294245 [15:26:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:08] (03CR) 10jerkins-bot: [V: 04-1] Add Typing: And fix other minopr lint issues [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/738452 (owner: 10Jbond) [15:28:15] kormat: DIMM replacement is finished, you can have it back now, thanks [15:28:18] !log urbanecm@deploy1002 Synchronized wmf-config/config/uzwiki.yaml: 734f3b0094799007e38dea1d152f0afeb3134e1b: uzwiki: Enable Growth features in dark mode (T294245; 2/3) (duration: 00m 55s) [15:28:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:32] cmjohnson1: fantastic, thank you! [15:29:13] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 734f3b0094799007e38dea1d152f0afeb3134e1b: uzwiki: Enable Growth features in dark mode (T294245; 3/3) (duration: 00m 55s) [15:29:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:21] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:29:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:18] 10SRE, 10DBA, 10cloud-services-team (Kanban): db1112 (s3 contribs/rc replica) is down - https://phabricator.wikimedia.org/T294295 (10Cmjohnson) 05Open→03Resolved a:03Cmjohnson DIMM replaced, cleared the log, all yours [15:30:36] 10SRE, 10ops-eqiad, 10DBA, 10DC-Ops: db1112 - DIMM replacement - https://phabricator.wikimedia.org/T294345 (10Cmjohnson) 05Open→03Resolved DIMM replaced [15:30:40] 10SRE, 10DBA, 10cloud-services-team (Kanban): db1112 (s3 contribs/rc replica) is down - https://phabricator.wikimedia.org/T294295 (10Cmjohnson) [15:31:32] RECOVERY - SSH on contint1001.mgmt is OK: SSH OK - OpenSSH_6.6 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:31:44] RECOVERY - Host db1112.mgmt is UP: PING OK - Packet loss = 0%, RTA = 0.88 ms [15:37:54] (03CR) 10Jbond: [C: 03+1] "I think this is good as it is. currently trusted_certs is set to an empty array by default on wmcs meaning that nothing will happen in cl" [puppet] - 10https://gerrit.wikimedia.org/r/737983 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [15:37:58] PROBLEM - k8s API server requests latencies on kubemaster2001 is CRITICAL: instance=10.192.0.56 verb=LIST https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [15:38:59] (03PS3) 10Ema: varnish: add varnishmtail-wrapper [puppet] - 10https://gerrit.wikimedia.org/r/738910 (https://phabricator.wikimedia.org/T293879) [15:39:01] (03PS1) 10Ema: varnish: move internal mtail scripts to another instance [puppet] - 10https://gerrit.wikimedia.org/r/738949 (https://phabricator.wikimedia.org/T293879) [15:40:04] RECOVERY - k8s API server requests latencies on kubemaster2001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [15:41:15] majavah: o/ I'd merge https://gerrit.wikimedia.org/r/c/operations/puppet/+/737983, it should in theory do the right thing in deployment-prep, lemme know if you have concerns [15:41:24] * majavah looks [15:42:09] it basically creates a single .pem file in localcerts, using the deployment-prep's puppet CA and the root PKI for cloud [15:42:25] (it currently creates the .pem but with the production CA certs) [15:43:25] looks fine from a glance [15:43:26] thanks! [15:43:40] thank you for all the time that you put in deployment-prep! [15:44:17] (03CR) 10Elukey: [V: 03+1] "PCC SUCCESS (DIFF 2): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32410/console" [puppet] - 10https://gerrit.wikimedia.org/r/737983 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [15:44:20] (03Merged) 10jenkins-bot: media: Avoid logspam in case of lack of 'data' in metadata [core] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/738932 (owner: 10Ladsgroup) [15:44:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [15:45:15] 10SRE, 10DBA, 10cloud-services-team (Kanban): db1112 (s3 contribs/rc replica) is down - https://phabricator.wikimedia.org/T294295 (10Kormat) 05Resolved→03Open (Reopening for us) db1112 is back up and in service. Let's leave it a day or two before we repool it though. [15:45:38] PROBLEM - SSH on kubernetes1003.mgmt is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [15:45:46] 10SRE, 10DBA, 10cloud-services-team (Kanban): db1112 (s3 contribs/rc replica) is down - https://phabricator.wikimedia.org/T294295 (10Kormat) p:05High→03Medium a:05Cmjohnson→03None [15:46:21] !log ladsgroup@deploy1002 Synchronized php-1.38.0-wmf.7/includes/media/DjVuHandler.php: Backport: [[gerrit:738932|media: Avoid logspam in case of lack of 'data' in metadata]] (duration: 00m 55s) [15:46:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:42] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:49:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:52] (03PS1) 10Muehlenhoff: Update MAC for rpki1001 [puppet] - 10https://gerrit.wikimedia.org/r/738950 [15:49:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [15:50:20] (03PS2) 10Muehlenhoff: Update MAC for rpki1001 [puppet] - 10https://gerrit.wikimedia.org/r/738950 [15:53:13] (03CR) 10Elukey: [V: 03+1 C: 03+2] profile::base::certificates: vary trusted_certs on realm [puppet] - 10https://gerrit.wikimedia.org/r/737983 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [15:53:23] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [15:53:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:54:49] (03CR) 10Muehlenhoff: [C: 03+2] Update MAC for rpki1001 [puppet] - 10https://gerrit.wikimedia.org/r/738950 (owner: 10Muehlenhoff) [16:00:58] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for DAbad - https://phabricator.wikimedia.org/T293253 (10DAbad) Public Key: AAAAC3NzaC1lZDI1NTE5AAAAIEMCL89wONrqDKRSFKETmGNyQ5OCPlZWjDpYODpBXOMg [16:01:52] !log cmjohnson@cumin1001 START - Cookbook sre.dns.netbox [16:01:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:38] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [16:06:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:06:58] I managed to break deployment-prep's puppet, working on a fix :) [16:10:59] (03CR) 10Vgutierrez: [C: 03+1] "besides the nitpick, LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/738910 (https://phabricator.wikimedia.org/T293879) (owner: 10Ema) [16:14:02] (03PS1) 10MMandere: install_server: Add instance hardware category [puppet] - 10https://gerrit.wikimedia.org/r/738957 (https://phabricator.wikimedia.org/T282787) [16:14:23] (03PS4) 10Ema: varnish: add varnishmtail-wrapper [puppet] - 10https://gerrit.wikimedia.org/r/738910 (https://phabricator.wikimedia.org/T293879) [16:14:25] (03PS2) 10Ema: varnish: move internal mtail scripts to another instance [puppet] - 10https://gerrit.wikimedia.org/r/738949 (https://phabricator.wikimedia.org/T293879) [16:15:35] 10SRE, 10Wikimedia-Mailing-lists, 10Upstream: Add OAuth login to mailman for accessing list memberships/archive viewing - https://phabricator.wikimedia.org/T249678 (10Tgr) The appstream patch was merged, will be presumably released with 0.45 (maybe some time around the end of the year, based on their average... [16:21:25] (03CR) 10BBlack: "Looking good! Only thing I'd amend, is replace all the per-host hieradata files with a single one for the whole site at hieradata/drmrs/pr" [puppet] - 10https://gerrit.wikimedia.org/r/738957 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [16:22:21] (03CR) 10Ema: "Other than the inline comment, I wonder what happens to haproxymtail if the mtail instance is "too slow" at reading the logs, similarly to" [puppet] - 10https://gerrit.wikimedia.org/r/738422 (https://phabricator.wikimedia.org/T290005) (owner: 10Vgutierrez) [16:26:16] (03CR) 10MMandere: install_server: Add instance hardware category (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/738957 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [16:28:24] (03PS1) 10Elukey: Update deployment-prep's profile::base::certificates settings [puppet] - 10https://gerrit.wikimedia.org/r/738958 (https://phabricator.wikimedia.org/T291905) [16:30:04] jan_drewniak: I, the Bot under the Fountain, call upon thee, The Deployer, to do Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211115T1630). [16:30:26] (03CR) 10Hnowlan: [V: 03+2 C: 03+2] cassandra: add stub values for new credentials format [labs/private] - 10https://gerrit.wikimedia.org/r/738272 (https://phabricator.wikimedia.org/T235299) (owner: 10Hnowlan) [16:31:13] (03CR) 10Elukey: [C: 03+2] Update deployment-prep's profile::base::certificates settings [puppet] - 10https://gerrit.wikimedia.org/r/738958 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [16:31:23] jan_drewniak: If it won't get in your way, I'm going to temporarily update the config on mwdebug1001 to investigate an issue with QuickSurveys that urbanecm noticed [16:31:48] (03PS1) 10Cwhite: profile: drop successful access logs for shellbox-constraints [puppet] - 10https://gerrit.wikimedia.org/r/738959 (https://phabricator.wikimedia.org/T295627) [16:31:50] * urbanecm waves to phuedx [16:32:08] Hey, urbanecm *waves* [16:33:09] 10SRE, 10Patch-For-Review: Handle sunset of stretch-backports - https://phabricator.wikimedia.org/T256877 (10aborrero) [16:34:18] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32411/console" [puppet] - 10https://gerrit.wikimedia.org/r/738270 (https://phabricator.wikimedia.org/T235299) (owner: 10Hnowlan) [16:34:41] !log [urbanecm@mwmaint1002 ~]$ mwscript extensions/CentralAuth/maintenance/createLocalAccount.php --wiki=enwiki 'MU test T244635 1' [16:34:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:34:45] T244635: New users losing login session when editing from a globally blocked IP - https://phabricator.wikimedia.org/T244635 [16:34:46] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-extras] - 10https://gerrit.wikimedia.org/r/738913 (owner: 10Ayounsi) [16:34:57] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.reimage for host an-test-coord1002.eqiad.wmnet with OS bullseye [16:35:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:05] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by cmjohnson@cumin1001 for host an-test-coord1002.eqiad.wmnet with OS bullseye [16:35:33] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-test-coord1002.eqiad.wmnet with OS bullseye [16:35:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:35:40] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by cmjohnson@cumin1001 for host an-test-coord1002.eqiad.wmnet with OS bullseye executed wit... [16:35:55] (03CR) 10Volans: [C: 03+1] "I'm missing the context by commit message and code are coherent with each other :)" [software/homer/deploy] - 10https://gerrit.wikimedia.org/r/738905 (owner: 10Ayounsi) [16:37:56] (03PS2) 10Cwhite: profile: drop successful access logs for shellbox-constraints [puppet] - 10https://gerrit.wikimedia.org/r/738959 (https://phabricator.wikimedia.org/T295627) [16:38:21] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloud: ceph: migrate admin keyring to new abstraction [puppet] - 10https://gerrit.wikimedia.org/r/738904 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [16:40:28] (03CR) 10Cwhite: "PCC checks out: https://puppet-compiler.wmflabs.org/compiler1003/32413/" [puppet] - 10https://gerrit.wikimedia.org/r/738959 (https://phabricator.wikimedia.org/T295627) (owner: 10Cwhite) [16:41:05] (03CR) 10Legoktm: [C: 03+1] profile: drop successful access logs for shellbox-constraints [puppet] - 10https://gerrit.wikimedia.org/r/738959 (https://phabricator.wikimedia.org/T295627) (owner: 10Cwhite) [16:41:39] (03PS8) 10Elukey: Move coal, navtiming and statsv to the new CA bundle [puppet] - 10https://gerrit.wikimedia.org/r/737970 (https://phabricator.wikimedia.org/T291905) [16:43:16] (03CR) 10Elukey: "Dave if you want to cherry pick the change again it should work now :)" [puppet] - 10https://gerrit.wikimedia.org/r/737970 (https://phabricator.wikimedia.org/T291905) (owner: 10Elukey) [16:43:50] (03PS5) 10Hnowlan: cassandra: move cluster:user relation from 1:1 relation to a 1:many [puppet] - 10https://gerrit.wikimedia.org/r/738270 (https://phabricator.wikimedia.org/T235299) [16:44:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [16:46:38] RECOVERY - SSH on kubernetes1003.mgmt is OK: SSH OK - OpenSSH_7.0 (protocol 2.0) https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook [16:46:45] (03CR) 10Hnowlan: [V: 03+1] "PCC SUCCESS (DIFF 3): https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/32414/console" [puppet] - 10https://gerrit.wikimedia.org/r/738270 (https://phabricator.wikimedia.org/T235299) (owner: 10Hnowlan) [16:46:47] (03PS1) 10Arturo Borrero Gonzalez: cloud: ceph: auth: enable admin client [puppet] - 10https://gerrit.wikimedia.org/r/738963 (https://phabricator.wikimedia.org/T293752) [16:49:30] 10SRE, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: (Need By: TBD) rack/setup/install an-test-coord1002 - https://phabricator.wikimedia.org/T293938 (10Cmjohnson) updated firmware, updated dns in netbox. Running into errors with the install script. [16:49:52] 10SRE-OnFire: 2021-10-25 s3 db recentchanges replica - https://phabricator.wikimedia.org/T295154 (10Kormat) [16:49:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [16:49:56] 10SRE, 10ops-eqiad, 10DBA, 10Wikimedia-Incident: db1075 (s3 master) crashed - BBU failure - https://phabricator.wikimedia.org/T233534 (10Kormat) [16:52:38] (03PS1) 10Arturo Borrero Gonzalez: hieradata: ceph: auth: add dummy keydata for the admin client [labs/private] - 10https://gerrit.wikimedia.org/r/738964 (https://phabricator.wikimedia.org/T293752) [16:54:16] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] hieradata: ceph: auth: add dummy keydata for the admin client [labs/private] - 10https://gerrit.wikimedia.org/r/738964 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [16:55:25] urbanecm: Got it :) [16:55:30] Patch inbound [16:55:54] I'm scap pulling on mwdebug1001 to reset its state [16:56:18] Done [16:57:17] (03CR) 10AOkoth: [C: 03+2] gitlab: accept backup file argument [puppet] - 10https://gerrit.wikimedia.org/r/737064 (https://phabricator.wikimedia.org/T274463) (owner: 10AOkoth) [17:08:39] (03PS1) 10Arturo Borrero Gonzalez: hieradata: ceph: auth: add dummy keydata for the admin client [labs/private] - 10https://gerrit.wikimedia.org/r/738969 (https://phabricator.wikimedia.org/T293752) [17:08:57] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] hieradata: ceph: auth: add dummy keydata for the admin client [labs/private] - 10https://gerrit.wikimedia.org/r/738969 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [17:16:38] (03PS2) 10Arturo Borrero Gonzalez: cloud: ceph: auth: enable admin client [puppet] - 10https://gerrit.wikimedia.org/r/738963 (https://phabricator.wikimedia.org/T293752) [17:22:50] (03CR) 10Dzahn: "yea, this would only affect frack, I did this many years ago. let me merge it after the SRE meeting is over" [puppet] - 10https://gerrit.wikimedia.org/r/738458 (https://phabricator.wikimedia.org/T295383) (owner: 10Jgreen) [17:30:14] (03PS1) 10Phuedx: Growth IP research survey: Fix platforms [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738974 (https://phabricator.wikimedia.org/T294568) [17:35:44] PROBLEM - Host cp2032 is DOWN: PING CRITICAL - Packet loss = 100% [17:38:16] RECOVERY - Host cp2032 is UP: PING OK - Packet loss = 0%, RTA = 31.52 ms [17:38:59] ema, vgutierrez: this actually rebooted ^^^ not sure if intentional [17:39:11] not at all AFAIK [17:39:22] neither according to last ;) [17:39:37] you might want to check SEL logs [17:41:22] SEL looks clean [17:45:19] anything in getsel? [17:45:39] ah sorry I misread SEL with SAL [17:45:43] will shut up :P [17:45:51] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [17:47:23] cwhite: o/ it seems that the logstash kafka consumers are lagging a lot every hour at the same time (around :44), really weird. Is there anything ongoing? [17:49:01] it seems mostly rsyslog-notice, but also others [17:49:20] I tried to check today on the hosts but didn't find much [17:49:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [17:50:01] elukey: something from k8s dumps a bunch of logs on the queue really fast. We also noticed we're producing 2x the number of logs overall as compared to 90 days ago. [17:50:21] ouch [17:50:28] ack thanks, lemme know if I can help [17:51:33] (03PS3) 10Arturo Borrero Gonzalez: cloud: ceph: auth: enable admin client [puppet] - 10https://gerrit.wikimedia.org/r/738963 (https://phabricator.wikimedia.org/T293752) [17:51:36] (03PS1) 10MMandere: install_server: Add instance hardware category [puppet] - 10https://gerrit.wikimedia.org/r/738975 (https://phabricator.wikimedia.org/T282787) [17:52:22] (03PS3) 10Majavah: aptrepo: add k8s 1.21 to stretch too [puppet] - 10https://gerrit.wikimedia.org/r/738912 (https://phabricator.wikimedia.org/T282942) [17:55:54] (03PS4) 10Arturo Borrero Gonzalez: cloud: ceph: auth: enable admin client [puppet] - 10https://gerrit.wikimedia.org/r/738963 (https://phabricator.wikimedia.org/T293752) [17:55:56] (03PS1) 10Urbanecm: foundationwiki: Revoke 'edit' from '*' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738976 (https://phabricator.wikimedia.org/T294900) [17:56:27] (03CR) 10Urbanecm: [C: 03+2] foundationwiki: Revoke 'edit' from '*' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738976 (https://phabricator.wikimedia.org/T294900) (owner: 10Urbanecm) [17:56:34] Log spikes seem to be from production-ratelimit and api-gateway-production. [17:57:15] (03Merged) 10jenkins-bot: foundationwiki: Revoke 'edit' from '*' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738976 (https://phabricator.wikimedia.org/T294900) (owner: 10Urbanecm) [18:00:04] ryankemper: How many deployers does it take to do Wikidata Query Service weekly deploy deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211115T1800). [18:00:46] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 00650753d77d7a526b6751669bf3548cf81fb02a: foundationwiki: Revoke edit from * (T294900) (duration: 00m 56s) [18:00:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:50] T294900: Further setting modifications for Governance Wiki - https://phabricator.wikimedia.org/T294900 [18:04:20] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:04:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:30] !log upgrading gitlab version on gitlab2001 (T294580) [18:07:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:10] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:08:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:38] (03PS1) 10Ebernhardson: Add repository-swift plugin [software/elasticsearch/plugins] - 10https://gerrit.wikimedia.org/r/738979 (https://phabricator.wikimedia.org/T295705) [18:10:37] (03PS4) 10JMeybohm: charts:eventgate bump common_templates and standardize labels [deployment-charts] - 10https://gerrit.wikimedia.org/r/738578 (https://phabricator.wikimedia.org/T292390) (owner: 10Jelto) [18:10:39] (03PS5) 10Arturo Borrero Gonzalez: cloud: ceph: auth: enable admin client [puppet] - 10https://gerrit.wikimedia.org/r/738963 (https://phabricator.wikimedia.org/T293752) [18:10:50] (03PS2) 10JMeybohm: Run helmfile commands against the local version of the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/738857 [18:10:52] (03PS1) 10JMeybohm: Fix helm3 lint errors and helm dependencies [deployment-charts] - 10https://gerrit.wikimedia.org/r/738980 [18:14:03] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1001/32423/" [puppet] - 10https://gerrit.wikimedia.org/r/738963 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [18:14:19] (03CR) 10jerkins-bot: [V: 04-1] Run helmfile commands against the local version of the chart [deployment-charts] - 10https://gerrit.wikimedia.org/r/738857 (owner: 10JMeybohm) [18:16:47] (03CR) 10JMeybohm: "For toolforge this needs an update to helm-linter docker image" [deployment-charts] - 10https://gerrit.wikimedia.org/r/738857 (owner: 10JMeybohm) [18:19:01] (03PS1) 10Urbanecm: foundationwiki: Restrict editing in more namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738981 (https://phabricator.wikimedia.org/T294900) [18:19:11] (03CR) 10Urbanecm: [C: 03+2] foundationwiki: Restrict editing in more namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738981 (https://phabricator.wikimedia.org/T294900) (owner: 10Urbanecm) [18:19:29] (03CR) 10Andrew Bogott: "*bump* Let me know if you need help getting unstuck with this!" [puppet] - 10https://gerrit.wikimedia.org/r/682259 (https://phabricator.wikimedia.org/T127717) (owner: 10Southparkfan) [18:19:51] !log aokoth@cumin1001 START - Cookbook sre.hosts.downtime for 1:00:00 on gitlab2001.wikimedia.org with reason: upgrade gitlab2001 to new version https://phabricator.wikimedia.org/T294580 [18:19:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:52] !log aokoth@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 1:00:00 on gitlab2001.wikimedia.org with reason: upgrade gitlab2001 to new version https://phabricator.wikimedia.org/T294580 [18:19:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:05] (03Merged) 10jenkins-bot: foundationwiki: Restrict editing in more namespaces [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738981 (https://phabricator.wikimedia.org/T294900) (owner: 10Urbanecm) [18:21:28] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: d15948e6da61af2d1db271cb0c9d8bd9a5395d75: foundationwiki: Restrict editing in more namespaces (T294900) (duration: 00m 56s) [18:21:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:31] T294900: Further setting modifications for Governance Wiki - https://phabricator.wikimedia.org/T294900 [18:21:43] (03PS1) 10Arturo Borrero Gonzalez: cloud: ceph: reorder osd/auth profile declaration [puppet] - 10https://gerrit.wikimedia.org/r/738982 (https://phabricator.wikimedia.org/T293752) [18:23:29] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:23:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:23:47] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] "PCC: https://puppet-compiler.wmflabs.org/compiler1003/32424/" [puppet] - 10https://gerrit.wikimedia.org/r/738982 (https://phabricator.wikimedia.org/T293752) (owner: 10Arturo Borrero Gonzalez) [18:26:06] (03CR) 10BBlack: [C: 03+1] "LGTM! Thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/738975 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [18:26:12] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={gitlab,sidekiq} site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:27:14] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:27:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:18] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:27:19] (03PS2) 10Urbanecm: Growth IP research survey: Fix platforms [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738974 (https://phabricator.wikimedia.org/T294568) (owner: 10Phuedx) [18:27:27] (03CR) 10Urbanecm: [C: 03+2] "let's test this at a debug srv!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738974 (https://phabricator.wikimedia.org/T294568) (owner: 10Phuedx) [18:27:40] PROBLEM - etcd request latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 operation={delete,list} https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [18:27:50] (03CR) 10MMandere: [C: 03+2] install_server: Add instance hardware category [puppet] - 10https://gerrit.wikimedia.org/r/738975 (https://phabricator.wikimedia.org/T282787) (owner: 10MMandere) [18:28:28] PROBLEM - k8s API server requests latencies on kubemaster1002 is CRITICAL: instance=10.64.32.116 verb={DELETE,LIST} https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [18:28:32] (03Merged) 10jenkins-bot: Growth IP research survey: Fix platforms [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738974 (https://phabricator.wikimedia.org/T294568) (owner: 10Phuedx) [18:30:06] RECOVERY - etcd request latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Etcd/Main_cluster https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=28 [18:30:54] RECOVERY - k8s API server requests latencies on kubemaster1002 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api?viewPanel=27 [18:31:19] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 1be2d3941530bbed54632dafb0b804d0ddf41299: Growth IP research survey: Fix platforms (T294568) (duration: 00m 55s) [18:31:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:23] T294568: deploy quicksurvey for editors on eswiki and arwiki (for Growth IP editors research) - https://phabricator.wikimedia.org/T294568 [18:31:31] (03CR) 10Dzahn: [V: 03+1 C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/32425/alert1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/738458 (https://phabricator.wikimedia.org/T295383) (owner: 10Jgreen) [18:32:53] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6001.drmrs.wmnet with OS buster [18:32:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:33:03] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6001.drmrs.wmnet with OS buster [18:35:21] jbond: rpki1001 is brandnew, is that right? [18:35:47] and rpki2001 [18:37:17] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=routinator site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:37:34] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:37:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:01] mutante: i dont think, so, however XioNoX/moritzm where working on rpki who should know [18:38:18] jbond: ACK, thanks [18:38:19] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:38:38] jbond, mutante: Cathal was [18:38:39] they were just added to icinga when I ran puppet there and have pending checks [18:38:46] ok, *nod* thanks [18:38:54] mutante: possibly rebuilt today [18:39:08] topranks: ^^ [18:39:24] hey [18:39:38] topranks: from 18:35 :) [18:39:38] yes it is only up now a short time [18:39:56] topranks: all is good, just letting you know they were literally just added to config now [18:40:13] I was checking because I made an unrelated change to Icinga and this happened when I ran puppet [18:40:26] ah ok. yeah it's just gone in. [18:40:38] if you dont want it to alert this would be the time where you can downtime the checks [18:40:42] while they are in "pending" [18:41:11] it seems to be working, but I'm a little confused, puppetboard shows it as "failed" [18:41:23] or you could say "but, I _do_ want to see all that alert and then recover, it's the best test" :) [18:41:25] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [18:41:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:32] topranks: reload puppetboard now and did rpki1001 disappear? [18:42:38] from the failed section [18:43:11] if so.. it just needed a second run and you can manually run puppet on rpi2001 as well [18:43:22] it applied some more admin group changes and restarted nagios-nrpe [18:43:31] it's still on the list, but status on the 'nodes' page is now "changed" [18:43:36] whereas a few mins ago it was "failed" [18:43:41] that was the error: [18:43:47] https://www.irccloud.com/pastebin/GGQyncjN/ [18:43:59] yeah I was trying to dig into it. [18:44:10] intial puppet run had and issue, and the following run went well [18:44:22] sometime there are chichen/eggs issues [18:44:25] ok I was wondering if that happened. [18:44:51] systemd-timesyncd was fine, I think it was trying to remove a lock file or something. [18:45:11] somewhere it wants to create /usr/lib/nagios/plugins/check_timedatectl but at this point the nagios-plugins package is not necessarily installed already [18:45:28] so /usr/lib/nagios/plugins did not exist yet and that failed [18:45:42] ok thanks, that makes a lot of sense. [18:45:45] I guess it's a matter of luck to hit that or not [18:45:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [18:46:03] it could have an ensure_resource or so for the entire path [18:46:29] or could require the package is installed first with require => [18:46:45] ok - to control the flow so it won't try that until after the nagios-plugins package is added? [18:47:11] these kinds of races are somewhat common and can be reasons the very first puppet run fails but everything is ok after the second puppet run [18:47:20] and from there it kind of doesnt matter [18:47:44] but it sucks if you have a role already applied and then want to use a cookbook to reimage it without first going back to "insetup" [18:47:59] ok yeah so it's not crazy important to get it 100% fixed. [18:48:03] but yeah annoying [18:48:11] and that expects it to work on the first run without manually running it a second time [18:48:43] yea, that sums it up [18:48:46] topranks: for this specific check i would install the plugin to /usr/local/lib/nagios/plugins. this directory is created with a file resource wo will be auto required by any files that exist in it [18:49:32] but in genral my view is if things converge in two runs dont worry about it too much (but definetly nice to fix things as we see them) [18:49:37] but we can also fix it with: wmflib::dir::mkdir_p("....") [18:49:47] to make sure that the path exists with or without the package [18:50:25] we could also wmflib::dir::mkdir_p('/usr/lib/nagios/plugins') somewhere, however arguably /usr/local/lib/nagios/plugins is the more correct location for user provided checks [18:50:32] ack [18:50:36] mkdir_p creates the entire path? i.e. nested directories? [18:50:40] yes [18:50:45] that's why it was made [18:50:53] cool [18:50:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [18:50:55] to avoid having to list each part of the full path [18:52:18] jbond: or use require => Package['nagios-plugins'], on the file resource ? [18:52:19] !log volans@cumin2002 START - Cookbook sre.hosts.downtime for 40 days, 0:00:00 on ps1-d1-codfw with reason: Testing new PDU devices T265435 [18:52:21] !log volans@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 40 days, 0:00:00 on ps1-d1-codfw with reason: Testing new PDU devices T265435 [18:52:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:23] T265435: codfw: Testing Out Sample PDUs - https://phabricator.wikimedia.org/T265435 [18:52:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:43] mutante: adding the requrie would also fix this. however from a LFH PoV putting the file in /usr/local is still more correct [18:54:00] LFH ?? [18:54:25] s/LFH/FHS/ https://en.wikipedia.org/wiki/Filesystem_Hierarchy_Standard [18:54:48] thanks :) [18:54:51] true, local modification [18:58:30] !log mmandere@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp6001.drmrs.wmnet with OS buster [18:58:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:58:39] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6001.drmrs.wmnet with OS buster executed with errors: - cp6001... [18:58:59] (03CR) 10Dzahn: "@Jeff, deployed on alert1001, confirmed this only affects frack hosts with passive checks via NSCA. the config file was edited by puppet. " [puppet] - 10https://gerrit.wikimedia.org/r/738458 (https://phabricator.wikimedia.org/T295383) (owner: 10Jgreen) [18:59:44] topranks: mutante: FTR this is me being super picky re FHS but if we are fixing it ... :) [19:00:04] RoanKattouw and Urbanecm: Your horoscope predicts another unfortunate UTC evening backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211115T1900). [19:00:04] Juan_90264 and nn1l2: A patch you scheduled for UTC evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [19:00:08] no absolutely, "while your messing about down there" kind of thing. [19:00:18] exactly :) [19:00:57] nn1l2 [19:01:27] Hello [19:01:35] Urbanecm: ? [19:02:00] If you wait 20 mins :) [19:02:26] Urbanecm: Why? [19:02:34] (03CR) 10Dzahn: "I am not sure if the deploy user should really be in a group with www-data. At least not sure enough to merge this without wider review." [puppet] - 10https://gerrit.wikimedia.org/r/738461 (https://phabricator.wikimedia.org/T295304) (owner: 10Ahmon Dancy) [19:02:36] Because I'm a tram :) [19:02:56] It might come as a surprise, but I'm a volunteer, just as you are :)) [19:03:07] *I'm in a tram [19:03:30] Urbanecm: So okay [19:03:47] I thought tram was a new word for a train with fewer changes or something [19:04:16] I meant https://en.m.wikipedia.org/wiki/Tram :) [19:05:04] ack, A streetcar named deployer [19:05:21] hehehe [19:05:33] (03CR) 10Dzahn: [C: 03+2] mediawiki: remove font packages from API appservers [puppet] - 10https://gerrit.wikimedia.org/r/738031 (https://phabricator.wikimedia.org/T294378) (owner: 10Dzahn) [19:06:35] !log removing font packages from MW API appservers T294378 [19:06:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:06:39] T294378: Remove mediawiki::packages::fonts from non thumbor servers - https://phabricator.wikimedia.org/T294378 [19:07:56] jouncebot: nowandnext [19:07:57] For the next 0 hour(s) and 52 minute(s): UTC evening backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211115T1900) [19:07:57] In 1 hour(s) and 52 minute(s): Services – Graphoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211115T2100) [19:11:15] (03CR) 10Ahmon Dancy: mediawiki: Ensure mwdeploy user is a member of the www-data group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/738461 (https://phabricator.wikimedia.org/T295304) (owner: 10Ahmon Dancy) [19:12:41] https://bash.toolforge.org/quip/77QDJX0B1jz_IcWu9ePD [19:13:31] (03PS3) 10Ahmon Dancy: mediawiki: Ensure mwdeploy user is a member of the www-data group [puppet] - 10https://gerrit.wikimedia.org/r/738461 (https://phabricator.wikimedia.org/T295304) [19:19:31] hi all - is there a backport window underway atm? [19:21:39] Yes, but you should wait as Urbancem was in a tram [19:22:11] gtk - thanks [19:22:13] *urbanecm [19:23:20] (03CR) 10Dzahn: "long time ago I once put this snippet from Tim Starling here, the "permission/security hierarchy" section at the bottom: https://wikitech." [puppet] - 10https://gerrit.wikimedia.org/r/738461 (https://phabricator.wikimedia.org/T295304) (owner: 10Ahmon Dancy) [19:23:41] hey everyone [19:23:53] cjming: if you wish to practice deployment, feel free to lead the window again! [19:24:04] * urbanecm reached his laptop now [19:24:38] (fwiw: currently puppet is undeploying mediawiki font packages from API servers) [19:24:53] mutante: does that mean I/cjming should wait with deployment? [19:25:33] Hello again Urbanecm [19:25:46] urbanecm: no, it does not, you can still go ahead as you like [19:26:01] excellent [19:26:05] just double checking :)) [19:26:17] Juan_90264: Annir: I can deploy today! [19:26:31] (Annir I assume you're 4nn1l2 -- the votewiki patch?) [19:26:40] yes [19:26:47] urbanec: please go ahead - i was just checking on the status of patches in the queue [19:26:50] (03PS2) 10Urbanecm: Change votewiki language back to English [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738222 (https://phabricator.wikimedia.org/T292685) (owner: 104nn1l2) [19:26:53] Urbanecm: Perfect, let's start? [19:26:54] (03CR) 10Urbanecm: [C: 03+2] Change votewiki language back to English [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738222 (https://phabricator.wikimedia.org/T292685) (owner: 104nn1l2) [19:27:03] (03PS6) 10Juan90264: Enable talk for mobile users on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732705 (https://phabricator.wikimedia.org/T293946) [19:27:05] cjming: okay, doing :) [19:27:07] I am just mentioning it in case you see some crazy "omg, fonts not found" errors in the output of some extension.. but if there was we expect we would have heard by now [19:27:16] via canaries [19:27:44] (03Merged) 10jenkins-bot: Change votewiki language back to English [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738222 (https://phabricator.wikimedia.org/T292685) (owner: 104nn1l2) [19:27:51] mutante: gotit [19:28:20] Annir: your patch is available at mwdebug1001, can you have a look please? [19:28:45] confirmed [19:28:52] that was quikc [19:28:53] *quick [19:29:54] (03PS7) 10Urbanecm: Enable talk for mobile users on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732705 (https://phabricator.wikimedia.org/T293946) (owner: 10Juan90264) [19:30:11] (03CR) 10Urbanecm: [C: 03+2] "explicit PM approval for deployment: https://phabricator.wikimedia.org/T293946#7504713" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732705 (https://phabricator.wikimedia.org/T293946) (owner: 10Juan90264) [19:30:19] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: cdac608e84250207efeac9ea489a7e5be908ec70: Change votewiki language back to English (T292685) (duration: 00m 56s) [19:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:30:23] T292685: Carry out the 2021 fawiki elections on votewiki - https://phabricator.wikimedia.org/T292685 [19:30:26] Annir: should be live! [19:30:28] anything else? [19:30:28] (03PS4) 10Herron: role::elasticsearch::cirrus: ship ES logs via gelf_relay [puppet] - 10https://gerrit.wikimedia.org/r/736859 (https://phabricator.wikimedia.org/T288620) [19:30:58] (03Merged) 10jenkins-bot: Enable talk for mobile users on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/732705 (https://phabricator.wikimedia.org/T293946) (owner: 10Juan90264) [19:31:20] Thanks, [19:31:26] Great merged [19:31:29] np [19:31:30] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6001.drmrs.wmnet with OS buster [19:31:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:31:38] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6001.drmrs.wmnet with OS buster [19:31:38] if you type '/last duration' in irssi you get all the MW deploys ordered by timestamp and how long they took :) [19:31:50] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:17] Juan_90264: your patch is at mwdebug1001, can you test please? [19:32:30] Urbanecm: Yes, I can [19:32:34] go ahead then :) [19:35:39] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:55] Urbanecm: I tested and approved [19:35:59] great [19:36:04] syncing [19:37:18] !log urbanecm@deploy1002 Synchronized wmf-config/InitialiseSettings.php: 898ebb1e8400a759ffc5553794f6a7200c97bf49: Enable talk for mobile users on enwiki (T293946) (duration: 00m 57s) [19:37:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:37:22] T293946: Enable talk for mobile users on enwiki - https://phabricator.wikimedia.org/T293946 [19:37:23] should be live now [19:37:26] anything else, anyone? [19:45:00] !log UTC evening B&C window done [19:45:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:45:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [19:46:01] !log mwdebug-deploy@deploy1002 helmfile [eqiad] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:46:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:37] !log revoked all grants from wikiadmins and gave back explicit list on db2101:3315 (T249683) [19:46:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:41] T249683: Redefine mysql GRANTs for wikiadmin - https://phabricator.wikimedia.org/T249683 [19:49:48] !log mwdebug-deploy@deploy1002 helmfile [codfw] Ran 'sync' command on namespace 'mwdebug' for release 'pinkunicorn' . [19:49:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [19:52:40] (03PS1) 10Ppchelko: Demo: how group permissions could look like [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738992 [19:54:19] (03CR) 10Ryan Kemper: [C: 03+1] "Testing on elastic1049 went well, let's ship it" [puppet] - 10https://gerrit.wikimedia.org/r/736859 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron) [19:54:23] (03PS1) 10RLazarus: Add .eggs/ to flake8 exclude [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/738993 [19:56:02] (03PS1) 10BBlack: Add drmrs networks to webproxy squid config [puppet] - 10https://gerrit.wikimedia.org/r/738994 (https://phabricator.wikimedia.org/T282787) [19:56:10] (03CR) 10Ppchelko: [V: 03+2 C: 04-2] "For demo purposes only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738992 (owner: 10Ppchelko) [19:57:20] !log mmandere@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host cp6001.drmrs.wmnet with OS buster [19:57:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:31] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6001.drmrs.wmnet with OS buster executed with errors: - cp6001... [19:57:53] Thanks Urbanecm! [19:59:53] (03CR) 10RLazarus: [C: 03+2] Add .eggs/ to flake8 exclude [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/738993 (owner: 10RLazarus) [20:01:54] (03Merged) 10jenkins-bot: Add .eggs/ to flake8 exclude [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/738993 (owner: 10RLazarus) [20:02:44] (03CR) 10BBlack: [C: 03+2] Add drmrs networks to webproxy squid config [puppet] - 10https://gerrit.wikimedia.org/r/738994 (https://phabricator.wikimedia.org/T282787) (owner: 10BBlack) [20:03:24] !log revoked all grants from wikiadmin and gave back an explicit list on db1102:3312 (T249683) [20:03:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:28] T249683: Redefine mysql GRANTs for wikiadmin - https://phabricator.wikimedia.org/T249683 [20:04:46] (03PS1) 10RLazarus: Initial deb package [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/738996 [20:05:47] (03Abandoned) 10RLazarus: Initial deb package [docker-images/imagecatalog] - 10https://gerrit.wikimedia.org/r/738996 (owner: 10RLazarus) [20:07:31] !log mmandere@cumin1001 START - Cookbook sre.hosts.reimage for host cp6001.drmrs.wmnet with OS buster [20:07:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:07:40] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by mmandere@cumin1001 for host cp6001.drmrs.wmnet with OS buster [20:08:07] !log revoked all grants from wikiadmin and gave back an explicit list on clouddb1021:3311 (T249683) [20:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:09] !log revoked all grants from wikiadmin and gave back an explicit list on clouddb1013:3311 (T249683) [20:09:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:09:13] T249683: Redefine mysql GRANTs for wikiadmin - https://phabricator.wikimedia.org/T249683 [20:11:25] 10SRE, 10ops-codfw, 10DC-Ops, 10RESTBase, 10Platform Team Workboards (Platform Engineering Reliability): Q2:(Need By: TBD) rack/setup/install restbase202[456].codfw.wmnet - https://phabricator.wikimedia.org/T294377 (10Papaul) @eevans sorry was on vacation just got back today. Thank you for the clarificat... [20:14:13] (03PS1) 10Kosta Harlan: labs: Setup GEHomepageNewAccountVariantsByPlatform [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738998 (https://phabricator.wikimedia.org/T294737) [20:14:15] (03PS1) 10Kosta Harlan: GrowthExperiments: Set up GEHomepageNewAccountVariantsByPlatform [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738999 (https://phabricator.wikimedia.org/T294737) [20:14:43] (03CR) 10Dzahn: "nobody is getting emails for that though, it relies only on IRC" [puppet] - 10https://gerrit.wikimedia.org/r/732968 (https://phabricator.wikimedia.org/T151642) (owner: 10Hashar) [20:14:58] (03PS2) 104nn1l2: Disable local file upload on the Chinese Wikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738550 (https://phabricator.wikimedia.org/T295265) [20:15:11] (03CR) 10jerkins-bot: [V: 04-1] GrowthExperiments: Set up GEHomepageNewAccountVariantsByPlatform [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738999 (https://phabricator.wikimedia.org/T294737) (owner: 10Kosta Harlan) [20:16:56] (03PS1) 10Kosta Harlan: GrowthExperiments: Start imagerecommendation variant experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739000 (https://phabricator.wikimedia.org/T294737) [20:17:52] (03CR) 10jerkins-bot: [V: 04-1] GrowthExperiments: Start imagerecommendation variant experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739000 (https://phabricator.wikimedia.org/T294737) (owner: 10Kosta Harlan) [20:18:09] (03CR) 10Kosta Harlan: [C: 04-2] "Scheduled for November 29, tentatively." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739000 (https://phabricator.wikimedia.org/T294737) (owner: 10Kosta Harlan) [20:21:54] (03PS2) 10Kosta Harlan: labs: Setup GEHomepageNewAccountVariantsByPlatform [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738998 (https://phabricator.wikimedia.org/T294737) [20:21:56] (03PS2) 10Kosta Harlan: GrowthExperiments: Set up GEHomepageNewAccountVariantsByPlatform [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738999 (https://phabricator.wikimedia.org/T294737) [20:22:05] (03PS2) 10Kosta Harlan: GrowthExperiments: Start imagerecommendation variant experiment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739000 (https://phabricator.wikimedia.org/T294737) [20:36:37] (03CR) 10Dzahn: cloud/devtools: fix resolv.conf search path (wmflabs->wikimedia.cloud) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/738005 (https://phabricator.wikimedia.org/T294174) (owner: 10Dzahn) [20:38:52] (03CR) 10Dzahn: cloud/devtools: fix resolv.conf search path (wmflabs->wikimedia.cloud) (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/738005 (https://phabricator.wikimedia.org/T294174) (owner: 10Dzahn) [20:39:04] (03PS1) 10Dzahn: mediawiki: remove font packages from appservers [puppet] - 10https://gerrit.wikimedia.org/r/739002 (https://phabricator.wikimedia.org/T294378) [20:44:33] (03CR) 10Herron: profile: drop successful access logs for shellbox-constraints (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/738959 (https://phabricator.wikimedia.org/T295627) (owner: 10Cwhite) [20:45:29] (03PS3) 10Dzahn: trafficserver: remove scholarships.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/737979 (https://phabricator.wikimedia.org/T243037) [20:45:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [20:47:06] (03CR) 10Dzahn: [C: 03+2] "announced in today's SRE meeting that this is going away, just doing it" [puppet] - 10https://gerrit.wikimedia.org/r/737979 (https://phabricator.wikimedia.org/T243037) (owner: 10Dzahn) [20:47:11] (03CR) 10Herron: [C: 03+2] role::elasticsearch::cirrus: ship ES logs via gelf_relay [puppet] - 10https://gerrit.wikimedia.org/r/736859 (https://phabricator.wikimedia.org/T288620) (owner: 10Herron) [20:49:06] !log mmandere@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host cp6001.drmrs.wmnet with OS buster [20:49:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:15] 10SRE, 10Traffic, 10Patch-For-Review: Configure dns and puppet repositories for new drmrs datacenter - https://phabricator.wikimedia.org/T282787 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by mmandere@cumin1001 for host cp6001.drmrs.wmnet with OS buster completed: - cp6001 (**WARN**)... [20:49:51] !log retiring https://scholarships.wikimedia.org - removing from ATS (T243037) [20:49:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:49:54] T243037: Shutdown scholarships.wikimedia.org and archive project - https://phabricator.wikimedia.org/T243037 [20:50:22] (03CR) 10Dzahn: "ok on cp1079 - -map http://scholarships.wikimedia.org https://webserver-misc-apps.discovery.wmnet" [puppet] - 10https://gerrit.wikimedia.org/r/737979 (https://phabricator.wikimedia.org/T243037) (owner: 10Dzahn) [20:50:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [21:00:04] chrisalbon and accraze: Your horoscope predicts another unfortunate Services – Graphoid / ORES deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211115T2100). [21:02:11] (03PS3) 10Cwhite: profile: drop successful access logs for shellbox-constraints [puppet] - 10https://gerrit.wikimedia.org/r/738959 (https://phabricator.wikimedia.org/T295627) [21:02:45] (03CR) 10Cwhite: profile: drop successful access logs for shellbox-constraints (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/738959 (https://phabricator.wikimedia.org/T295627) (owner: 10Cwhite) [21:12:50] RECOVERY - BGP status on cr4-ulsfo is OK: BGP OK - up: 98, down: 0, shutdown: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [21:13:18] (03CR) 10Herron: [C: 03+1] profile: drop successful access logs for shellbox-constraints (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/738959 (https://phabricator.wikimedia.org/T295627) (owner: 10Cwhite) [21:17:49] (03CR) 10Dzahn: [C: 03+2] mediawiki: remove font packages from appservers [puppet] - 10https://gerrit.wikimedia.org/r/739002 (https://phabricator.wikimedia.org/T294378) (owner: 10Dzahn) [21:19:46] !log removing mediawiki font packages from remaining regular appservers globally (T294378) [21:19:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:19:50] T294378: Remove mediawiki::packages::fonts from non thumbor servers - https://phabricator.wikimedia.org/T294378 [21:21:57] !log uploaded php7.4_7.4.25-1+wmf2+buster1_amd64.changes to apt.wm.o with patch for T293568 [21:22:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:22:01] T293568: PHP Notice: Undefined offset in wikimedia/remex-html when rendering rest.php error page - https://phabricator.wikimedia.org/T293568 [21:32:38] !log dns6001 - reboot for another round of bios fixups [21:32:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:39:08] (03PS1) 10Nray: We need some way to distinguish namespaces [extensions/WikimediaEvents] (wmf/1.38.0-wmf.7) - 10https://gerrit.wikimedia.org/r/739004 (https://phabricator.wikimedia.org/T294738) [21:41:08] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for DAbad - https://phabricator.wikimedia.org/T293253 (10Dzahn) a:05DAbad→03None [21:43:44] (03PS1) 10Legoktm: Rebuild PHP 7.4 images for T293568 patch [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/739005 (https://phabricator.wikimedia.org/T293568) [21:44:04] 10SRE, 10SRE-Access-Requests: Requesting access to analytics-privatedata-users for DAbad - https://phabricator.wikimedia.org/T293253 (10Dzahn) a:03Jelto thanks for pasting the key @DAbad ! assigning over to Jelto based on our rotating clinic duty for access requests [21:44:28] (03CR) 10Legoktm: [V: 03+2 C: 03+2] Rebuild PHP 7.4 images for T293568 patch [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/739005 (https://phabricator.wikimedia.org/T293568) (owner: 10Legoktm) [21:44:55] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [21:45:46] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:46:03] (03PS1) 10Legoktm: Remove PHP 7.3 images [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/739006 [21:46:04] !log dns6002 - reboot for another round of bios fixups [21:46:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:47:38] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [21:49:55] (LogstashKafkaConsumerLag) resolved: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad - https://alerts.wikimedia.org [21:54:59] 10SRE, 10vm-requests, 10Patch-For-Review: eqiad: 2 VMs for cloudbackup-dev - https://phabricator.wikimedia.org/T295584 (10Dzahn) sounds reasonable to me. As @MoritzMuehlenhoff pointed out on previous requests we should not go under a certain size for the OS partition. Assuming you don't mind if it's 10G or... [21:55:21] (03CR) 10Dzahn: "the file under hieradata/ should have a .yaml extension instead of .pp" [puppet] - 10https://gerrit.wikimedia.org/r/738376 (https://phabricator.wikimedia.org/T295584) (owner: 10Arturo Borrero Gonzalez) [21:57:29] (03CR) 10Cwhite: [C: 03+2] profile: drop successful access logs for shellbox-constraints [puppet] - 10https://gerrit.wikimedia.org/r/738959 (https://phabricator.wikimedia.org/T295627) (owner: 10Cwhite) [21:57:38] 10SRE, 10serviceops, 10Patch-For-Review: Remove mediawiki::packages::fonts from non thumbor servers - https://phabricator.wikimedia.org/T294378 (10Dzahn) 05Open→03Resolved a:03Dzahn This has been completed just now: https://debmonitor.wikimedia.org/packages/fonts-vlgothic [21:59:12] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=atlas_exporter site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:00:04] Reedy and sbassett: How many deployers does it take to do Weekly Security deployment window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20211115T2200). [22:00:24] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:00:37] (03CR) 10BryanDavis: python39: Use shell reimplementation of webservice-runner (034 comments) [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/738503 (https://phabricator.wikimedia.org/T293552) (owner: 10Legoktm) [22:04:24] (03PS1) 10Dzahn: mediawiki/parsoid/wikitech: flip default for font install [puppet] - 10https://gerrit.wikimedia.org/r/739012 (https://phabricator.wikimedia.org/T294378) [22:06:31] (03PS2) 10Dzahn: mediawiki/parsoid/wikitech: flip default for font install [puppet] - 10https://gerrit.wikimedia.org/r/739012 (https://phabricator.wikimedia.org/T294378) [22:15:31] (03PS1) 10Legoktm: Move thumbor1005 from insetup to thumbor::mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/739014 (https://phabricator.wikimedia.org/T285477) [22:16:26] (03CR) 10Legoktm: [C: 03+2] Move thumbor1005 from insetup to thumbor::mediawiki [puppet] - 10https://gerrit.wikimedia.org/r/739014 (https://phabricator.wikimedia.org/T285477) (owner: 10Legoktm) [22:33:57] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=sidekiq site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:35:23] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [22:36:26] 10SRE, 10ops-eqiad, 10DC-Ops, 10Machine-Learning-Team: Q1:(Need By: TBD) rack/setup/install ml-train100[1-4] - https://phabricator.wikimedia.org/T291579 (10Jclark-ctr) [22:46:40] 10SRE, 10Scap, 10Release-Engineering-Team (Seen): find a way to systematically update the deployment server name across all repos - https://phabricator.wikimedia.org/T197470 (10Legoktm) Hit this today when setting up new thumbor servers. What I don't really understand is where it's getting deploy1001 these d... [22:52:44] 10SRE, 10Scap, 10Release-Engineering-Team (Seen): find a way to systematically update the deployment server name across all repos - https://phabricator.wikimedia.org/T197470 (10bd808) >>! In T197470#7505335, @Legoktm wrote: > Hit this today when setting up new thumbor servers. What I don't really understand... [22:54:39] PROBLEM - Memcached on thumbor1005 is CRITICAL: connect to address 10.64.0.161 and port 11211: Connection refused https://wikitech.wikimedia.org/wiki/Memcached [22:56:36] (03PS3) 10Ideophagous: arywiki NS [mediawiki-config] - 10https://gerrit.wikimedia.org/r/738876 (https://phabricator.wikimedia.org/T291737) [22:58:56] !log legoktm@cumin1001 START - Cookbook sre.hosts.downtime for 4:00:00 on thumbor1005.eqiad.wmnet with reason: reboot after first puppet run [22:58:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:57] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 4:00:00 on thumbor1005.eqiad.wmnet with reason: reboot after first puppet run [22:58:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:36] !log legoktm@cumin1001 START - Cookbook sre.hosts.reboot-single for host thumbor1005.eqiad.wmnet [22:59:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:00:31] (03PS1) 10Cwhite: logstash: amend gitlab sidekiq log mutation to use new format [puppet] - 10https://gerrit.wikimedia.org/r/739018 (https://phabricator.wikimedia.org/T295731) [23:03:07] RECOVERY - Memcached on thumbor1005 is OK: TCP OK - 0.000 second response time on 10.64.0.161 port 11211 https://wikitech.wikimedia.org/wiki/Memcached [23:10:16] !log legoktm@cumin1001 END (PASS) - Cookbook sre.hosts.reboot-single (exit_code=0) for host thumbor1005.eqiad.wmnet [23:10:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:35] (03PS2) 10Cwhite: logstash: reconstruct gitlab sidekiq message field [puppet] - 10https://gerrit.wikimedia.org/r/739018 (https://phabricator.wikimedia.org/T295731) [23:51:37] 10SRE, 10LDAP-Access-Requests: Grant Access to LDAP/WMF for JKieserman - https://phabricator.wikimedia.org/T295693 (10Dzahn) Hello Julia (@JKieserman,) welcome to Wikimedia! Could you let us know what your role at WMF will be? That would help us get a better understanding which logins you might need. So far... [23:53:39] (03PS1) 10Gergő Tisza: [beta] Disable GrowthExperiments Add Link on all but enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739023 [23:58:34] (03CR) 10Urbanecm: [C: 03+1] "lgtm" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/739023 (owner: 10Gergő Tisza)