[00:18:30] (Not accepting/receiving prefixes from anycast BGP peer) firing: (2) Alert for device cr1-eqiad.wikimedia.org - Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [00:37:22] RECOVERY - Blazegraph process -wdqs-categories- on wdqs2021 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [00:38:48] (03PS1) 10TrainBranchBot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/933683 [00:38:54] (03CR) 10TrainBranchBot: [C: 03+2] Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/933683 (owner: 10TrainBranchBot) [00:42:00] PROBLEM - Blazegraph process -wdqs-categories- on wdqs2021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [01:00:23] (03Merged) 10jenkins-bot: Branch commit for wmf/branch_cut_pretest [core] (wmf/branch_cut_pretest) - 10https://gerrit.wikimedia.org/r/933683 (owner: 10TrainBranchBot) [01:07:42] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2021 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [01:12:18] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [01:17:36] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [01:28:27] (03PS1) 10Andrew Bogott: Designate/pdns: allow designate hosts to access the pdns rest api [puppet] - 10https://gerrit.wikimedia.org/r/934427 (https://phabricator.wikimedia.org/T338779) [01:30:38] (03CR) 10Andrew Bogott: [C: 03+2] Designate/pdns: allow designate hosts to access the pdns rest api [puppet] - 10https://gerrit.wikimedia.org/r/934427 (https://phabricator.wikimedia.org/T338779) (owner: 10Andrew Bogott) [01:58:30] (Not accepting/receiving prefixes from anycast BGP peer) resolved: (2) Device cr1-eqiad.wikimedia.org recovered from Not accepting/receiving prefixes from anycast BGP peer - https://alerts.wikimedia.org/?q=alertname%3DNot+accepting%2Freceiving+prefixes+from+anycast+BGP+peer [02:00:48] RECOVERY - Check systemd state on mwlog2002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:01:32] RECOVERY - Check systemd state on mwlog1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:07:36] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:27:36] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [02:29:02] (03PS1) 10Andrew Bogott: Designate/pdns: allow designate and pdns hosts to access mdns for axfr [puppet] - 10https://gerrit.wikimedia.org/r/934430 (https://phabricator.wikimedia.org/T338779) [02:31:22] (03CR) 10Andrew Bogott: [C: 03+2] Designate/pdns: allow designate and pdns hosts to access mdns for axfr [puppet] - 10https://gerrit.wikimedia.org/r/934430 (https://phabricator.wikimedia.org/T338779) (owner: 10Andrew Bogott) [02:50:16] (MediaWikiLatencyExceeded) firing: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [02:55:16] (MediaWikiLatencyExceeded) resolved: Average latency high: eqiad parsoid GET/200 - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook#Average_latency_exceeded - https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus/ops&var-cluster=parsoid&var-method=GET - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiLatencyExceeded [04:04:08] (03PS2) 10KartikMistry: Update MinT to 2023-06-29-061037-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/933698 (https://phabricator.wikimedia.org/T340709) [05:37:48] RECOVERY - Blazegraph process -wdqs-categories- on wdqs2021 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [05:42:26] PROBLEM - Blazegraph process -wdqs-categories- on wdqs2021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [06:00:05] Deploy window MediaWiki infrastucture (UTC early) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230630T0600) [06:27:36] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [06:37:13] (03CR) 10Muehlenhoff: Drop deploy-service group (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/934411 (https://phabricator.wikimedia.org/T340165) (owner: 10Majavah) [06:40:27] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on urldownloader[1001-1002].wikimedia.org with reason: Setup in progress [06:40:46] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on urldownloader[1001-1002].wikimedia.org with reason: Setup in progress [06:41:17] !log jmm@cumin2002 START - Cookbook sre.hosts.downtime for 7 days, 0:00:00 on urldownloader[2001-2002].wikimedia.org with reason: pending decom [06:41:34] !log jmm@cumin2002 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 7 days, 0:00:00 on urldownloader[2001-2002].wikimedia.org with reason: pending decom [06:41:40] 10SRE, 10Infrastructure-Foundations: Migrate the URL downloaders to Bullseye - https://phabricator.wikimedia.org/T329945 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=04703ca5-7468-4229-a4bf-5a47b58763e0) set by jmm@cumin2002 for 7 days, 0:00:00 on 2 host(s) and their services with reason... [06:50:58] PROBLEM - Check systemd state on netbox1002 is CRITICAL: CRITICAL - degraded: The following units failed: netbox_report_accounting_run.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:52:06] (03CR) 10Muehlenhoff: "Adding Jesse as well, who recently refactored the ordering within the apt classes" [puppet] - 10https://gerrit.wikimedia.org/r/934409 (owner: 10Majavah) [06:54:40] (03Abandoned) 10Muehlenhoff: Add a ferm module to Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/658972 (owner: 10Muehlenhoff) [06:55:33] 10SRE, 10Infrastructure-Foundations: Integrate Bullseye 11.7 point update - https://phabricator.wikimedia.org/T335575 (10MoritzMuehlenhoff) [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230630T0700) [07:02:32] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:02:34] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:08:36] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8571 bytes in 0.279 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:08:38] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50134 bytes in 0.107 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [07:14:14] RECOVERY - Check systemd state on netbox1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:17:49] (03CR) 10JMeybohm: [C: 03+2] Add README, enhance changelog and switch to source format 3 [debs/envoyproxy] (v1.26) - 10https://gerrit.wikimedia.org/r/922837 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [07:20:19] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm) [07:22:38] PROBLEM - WDQS SPARQL on wdqs2009 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [07:27:31] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment shell group and nda LDAP for Superpes15 - https://phabricator.wikimedia.org/T338468 (10MatthewVernon) I have no problem with that (but I was just the clinician when this ticket came in), if the other people you tagged are also... [07:29:47] (03CR) 10MVernon: [C: 03+2] swift: roll object_expirer into cluster_info (remove profile) [puppet] - 10https://gerrit.wikimedia.org/r/933471 (https://phabricator.wikimedia.org/T229584) (owner: 10MVernon) [07:36:28] RECOVERY - WDQS SPARQL on wdqs2009 is OK: HTTP OK: HTTP/1.1 200 OK - 688 bytes in 1.181 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [07:42:36] (03CR) 10Hashar: apt: Ensure sources.list is updated before apt-get update (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/934409 (owner: 10Majavah) [07:43:23] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment shell group and nda LDAP for Superpes15 - https://phabricator.wikimedia.org/T338468 (10SLyngshede-WMF) No problem here either. [07:52:32] !log removed docker-registry.discovery.wmnet/envoy-future:1.26.1-1 - T300324 [07:52:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:37] T300324: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 [07:55:53] 10SRE, 10Traffic, 10envoy, 10serviceops, 10Patch-For-Review: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 (10JMeybohm) > #wikimedia-operations: removed docker-registry.discovery.wmnet/envoy-future:1.26.1-1 - T300324 Since 1.24, envoy required libc 2.29 and buste... [08:00:56] !log rolled back envoyproxy package in buster-wikimedia component/envoy-future to 1.18.3-1 - T300324 [08:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:02] T300324: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 [08:02:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:07:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [08:27:58] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2020 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [08:32:34] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2020 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [08:56:57] 10SRE, 10Dumps-Generation, 10Wikidata, 10observability, and 2 others: various weekly and daily dumps run from systemd timers are broken - https://phabricator.wikimedia.org/T281267 (10fgiunchedi) >>! In T281267#8954763, @ArielGlenn wrote: > @fgiunchedi I notice that in some cases phab tasks are autocreated... [09:04:29] (03CR) 10Jbond: "lgtm" [puppet-lint/wmf_styleguide-check] - 10https://gerrit.wikimedia.org/r/933920 (owner: 10JHathaway) [09:18:41] (03CR) 10Jcrespo: [C: 03+1] mariadb: Reenable db1145 notifications [puppet] - 10https://gerrit.wikimedia.org/r/934341 (https://phabricator.wikimedia.org/T340610) (owner: 10Jcrespo) [09:19:44] (03PS2) 10Jcrespo: mariadb: Reenable db1145 notifications [puppet] - 10https://gerrit.wikimedia.org/r/934341 (https://phabricator.wikimedia.org/T340610) [09:21:11] (03CR) 10Jcrespo: [C: 03+2] mariadb: Reenable db1145 notifications [puppet] - 10https://gerrit.wikimedia.org/r/934341 (https://phabricator.wikimedia.org/T340610) (owner: 10Jcrespo) [09:21:25] 10Puppet, 10SRE, 10Infrastructure-Foundations, 10Patch-For-Review: Resource attributes are quoted inconsistently - https://phabricator.wikimedia.org/T91908 (10jbond) @Dzahn this is what i wrote on T95377#8435033 >>! In T95377#8433934, @Dzahn wrote: > @jbond and all. I wonder what you would think about thi... [09:21:47] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Patch-For-Review: Resource attributes are quoted inconsistently - https://phabricator.wikimedia.org/T91908 (10jbond) [09:27:52] (03Abandoned) 10Jbond: icinga: move client_auth_puppet_post to use wmf_check_http [puppet] - 10https://gerrit.wikimedia.org/r/773279 (https://phabricator.wikimedia.org/T304321) (owner: 10Jbond) [09:28:05] (03PS1) 10JMeybohm: New upstream version 1.23.10 [debs/envoyproxy] (v1.23) - 10https://gerrit.wikimedia.org/r/934490 (https://phabricator.wikimedia.org/T300324) [09:28:43] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] New upstream version 1.23.10 [debs/envoyproxy] (v1.23) - 10https://gerrit.wikimedia.org/r/934490 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [09:33:32] (03PS1) 10JMeybohm: Switch back to source format 1.0 [debs/envoyproxy] (v1.23) - 10https://gerrit.wikimedia.org/r/934491 [09:40:07] (03PS1) 10Arturo Borrero Gonzalez: wmcs: cloud-private: store FQDN in hiera [puppet] - 10https://gerrit.wikimedia.org/r/934492 [09:42:51] (03PS1) 10Arturo Borrero Gonzalez: galera: allow to set a different local node name / address [puppet] - 10https://gerrit.wikimedia.org/r/934493 (https://phabricator.wikimedia.org/T340791) [09:44:52] (03CR) 10CI reject: [V: 04-1] galera: allow to set a different local node name / address [puppet] - 10https://gerrit.wikimedia.org/r/934493 (https://phabricator.wikimedia.org/T340791) (owner: 10Arturo Borrero Gonzalez) [09:45:13] (03CR) 10Alexandros Kosiaris: [C: 04-1] Switch back to source format 1.0 (031 comment) [debs/envoyproxy] (v1.23) - 10https://gerrit.wikimedia.org/r/934491 (owner: 10JMeybohm) [09:45:44] (03PS1) 10JMeybohm: Downgrade envoy-future from 1.26 to 1.23 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/934494 (https://phabricator.wikimedia.org/T300324) [09:45:54] (03CR) 10Alexandros Kosiaris: [C: 03+2] "And quilt requires a source package, let's move back to 1.0 instead and not delve on this one too long" [debs/envoyproxy] (v1.23) - 10https://gerrit.wikimedia.org/r/934491 (owner: 10JMeybohm) [09:48:23] (03CR) 10Alexandros Kosiaris: [C: 03+1] "😞" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/934494 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [09:50:00] (03PS2) 10Arturo Borrero Gonzalez: galera: allow to set a different local node name / address [puppet] - 10https://gerrit.wikimedia.org/r/934493 (https://phabricator.wikimedia.org/T340791) [09:52:03] (03CR) 10CI reject: [V: 04-1] galera: allow to set a different local node name / address [puppet] - 10https://gerrit.wikimedia.org/r/934493 (https://phabricator.wikimedia.org/T340791) (owner: 10Arturo Borrero Gonzalez) [09:56:21] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected: https://puppet-compiler.wmflabs.org/output/934493/42154/" [puppet] - 10https://gerrit.wikimedia.org/r/934492 (owner: 10Arturo Borrero Gonzalez) [09:59:32] (03CR) 10Jbond: [C: 03+1] apt: Ensure sources.list is updated before apt-get update [puppet] - 10https://gerrit.wikimedia.org/r/934409 (owner: 10Majavah) [09:59:52] (03PS3) 10Arturo Borrero Gonzalez: galera: allow to set a different local node name / address [puppet] - 10https://gerrit.wikimedia.org/r/934493 (https://phabricator.wikimedia.org/T340791) [10:04:21] (03PS2) 10Arturo Borrero Gonzalez: wmcs: cloud-private: store FQDN in hiera [puppet] - 10https://gerrit.wikimedia.org/r/934492 [10:04:23] (03PS4) 10Arturo Borrero Gonzalez: galera: allow to set a different local node name / address [puppet] - 10https://gerrit.wikimedia.org/r/934493 (https://phabricator.wikimedia.org/T340791) [10:04:45] (03PS2) 10Jbond: sre.hardware: Add support for adding csrf-token [cookbooks] - 10https://gerrit.wikimedia.org/r/933990 [10:07:16] 10SRE, 10Add-Link, 10GrowthExperiments-NewcomerTasks, 10serviceops, 10Growth-Team (Current Sprint): linkrecommendation kubernetes service is down with HTTP 504: "upstream request timeout" - https://phabricator.wikimedia.org/T340780 (10Urbanecm_WMF) [10:07:22] (03CR) 10CI reject: [V: 04-1] sre.hardware: Add support for adding csrf-token [cookbooks] - 10https://gerrit.wikimedia.org/r/933990 (owner: 10Jbond) [10:08:35] (03PS5) 10Arturo Borrero Gonzalez: galera: allow to set a different local node name / address [puppet] - 10https://gerrit.wikimedia.org/r/934493 (https://phabricator.wikimedia.org/T340791) [10:08:45] (03PS3) 10Jbond: sre.hardware: Add support for adding csrf-token [cookbooks] - 10https://gerrit.wikimedia.org/r/933990 [10:08:51] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected: https://puppet-compiler.wmflabs.org/output/934493/42156/" [puppet] - 10https://gerrit.wikimedia.org/r/934493 (https://phabricator.wikimedia.org/T340791) (owner: 10Arturo Borrero Gonzalez) [10:09:01] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC as expected: https://puppet-compiler.wmflabs.org/output/934493/42156/" [puppet] - 10https://gerrit.wikimedia.org/r/934492 (owner: 10Arturo Borrero Gonzalez) [10:10:25] 10SRE, 10Add-Link, 10GrowthExperiments-NewcomerTasks, 10serviceops, 10Growth-Team (Current Sprint): linkrecommendation kubernetes service is down with HTTP 504: "upstream request timeout" - https://phabricator.wikimedia.org/T340780 (10akosiaris) Since this is fixed, should we resolve this? Do we need a f... [10:11:08] (03CR) 10CI reject: [V: 04-1] sre.hardware: Add support for adding csrf-token [cookbooks] - 10https://gerrit.wikimedia.org/r/933990 (owner: 10Jbond) [10:12:00] !log jbond@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest1003'] [10:12:18] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] Downgrade envoy-future from 1.26 to 1.23 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/934494 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [10:13:42] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['sretest1003'] [10:14:53] !log jbond@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest1003'] [10:15:13] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['sretest1003'] [10:15:20] !log jbond@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest1003'] [10:15:22] (03PS1) 10Elukey: ml-services: add base autoscaling to revertrisk-multilingual [deployment-charts] - 10https://gerrit.wikimedia.org/r/934499 (https://phabricator.wikimedia.org/T340822) [10:15:46] PROBLEM - Host sretest1003 is DOWN: PING CRITICAL - Packet loss = 100% [10:17:04] (03CR) 10Elukey: [C: 03+2] ml-services: add base autoscaling to revertrisk-multilingual [deployment-charts] - 10https://gerrit.wikimedia.org/r/934499 (https://phabricator.wikimedia.org/T340822) (owner: 10Elukey) [10:18:10] 10SRE, 10Add-Link, 10GrowthExperiments-NewcomerTasks, 10serviceops, 10Growth-Team (Current Sprint): linkrecommendation kubernetes service is down with HTTP 504: "upstream request timeout" - https://phabricator.wikimedia.org/T340780 (10Urbanecm_WMF) >>! In T340780#8980138, @akosiaris wrote: > Since this i... [10:18:24] RECOVERY - Host sretest1003 is UP: PING OK - Packet loss = 0%, RTA = 0.31 ms [10:20:11] (03CR) 10ClĂ©ment Goubert: [C: 03+1] opentelemetry-collector: Switch off unused default receivers and ports [deployment-charts] - 10https://gerrit.wikimedia.org/r/934420 (https://phabricator.wikimedia.org/T320564) (owner: 10RLazarus) [10:20:20] (03CR) 10David Caro: [C: 03+1] "LGTM, I have some questions though, not blockers in any case" [puppet] - 10https://gerrit.wikimedia.org/r/934493 (https://phabricator.wikimedia.org/T340791) (owner: 10Arturo Borrero Gonzalez) [10:20:36] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment shell group and nda LDAP for Superpes15 - https://phabricator.wikimedia.org/T338468 (10MatthewVernon) 05Open→03Resolved [10:20:50] !log elukey@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [10:22:28] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [10:22:39] (03PS4) 10Jbond: sre.hardware: Add support for adding csrf-token [cookbooks] - 10https://gerrit.wikimedia.org/r/933990 [10:24:50] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [10:24:58] (03CR) 10CI reject: [V: 04-1] sre.hardware: Add support for adding csrf-token [cookbooks] - 10https://gerrit.wikimedia.org/r/933990 (owner: 10Jbond) [10:26:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-codfw&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:27:36] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [10:30:59] (03PS6) 10Arturo Borrero Gonzalez: galera: allow to set a different local node name / address [puppet] - 10https://gerrit.wikimedia.org/r/934493 (https://phabricator.wikimedia.org/T340791) [10:31:37] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [10:45:14] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['sretest1003'] [10:45:26] (03PS1) 10Effie Mouzeli: site.pp: Add kubestagemaster[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/934503 (https://phabricator.wikimedia.org/T329827) [10:48:50] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1] "PCC: https://puppet-compiler.wmflabs.org/output/934493/42157/" [puppet] - 10https://gerrit.wikimedia.org/r/934493 (https://phabricator.wikimedia.org/T340791) (owner: 10Arturo Borrero Gonzalez) [10:57:00] (03PS1) 10Hashar: contint: parameterize the docker lvm disk size [puppet] - 10https://gerrit.wikimedia.org/r/934505 (https://phabricator.wikimedia.org/T340070) [10:57:15] (03CR) 10JMeybohm: [C: 03+1] site.pp: Add kubestagemaster[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/934503 (https://phabricator.wikimedia.org/T329827) (owner: 10Effie Mouzeli) [10:57:20] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] wmcs: cloud-private: store FQDN in hiera [puppet] - 10https://gerrit.wikimedia.org/r/934492 (owner: 10Arturo Borrero Gonzalez) [10:57:44] (03PS1) 10JMeybohm: envoy*: Fix envoy-basic-config, add tests [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/934506 (https://phabricator.wikimedia.org/T300324) [10:57:46] (03CR) 10Arturo Borrero Gonzalez: [V: 03+1 C: 03+2] galera: allow to set a different local node name / address [puppet] - 10https://gerrit.wikimedia.org/r/934493 (https://phabricator.wikimedia.org/T340791) (owner: 10Arturo Borrero Gonzalez) [10:58:58] (03CR) 10CI reject: [V: 04-1] contint: parameterize the docker lvm disk size [puppet] - 10https://gerrit.wikimedia.org/r/934505 (https://phabricator.wikimedia.org/T340070) (owner: 10Hashar) [11:01:39] (03PS2) 10Hashar: contint: parameterize the docker lvm disk size [puppet] - 10https://gerrit.wikimedia.org/r/934505 (https://phabricator.wikimedia.org/T340070) [11:02:22] (03PS1) 10Hnowlan: poolcounter: emit metrics to display the type of throttling [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/934508 [11:02:57] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/934309 (https://phabricator.wikimedia.org/T325756) (owner: 10FNegri) [11:04:06] !log jbond@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest1003'] [11:04:42] (03CR) 10ClĂ©ment Goubert: [C: 03+1] envoy*: Fix envoy-basic-config, add tests [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/934506 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [11:05:31] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['sretest1003'] [11:05:41] !log jbond@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest1003'] [11:05:45] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['sretest1003'] [11:05:49] !log jbond@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest1003'] [11:07:14] (03CR) 10Alexandros Kosiaris: [C: 03+1] envoy*: Fix envoy-basic-config, add tests [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/934506 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [11:11:53] (03CR) 10JMeybohm: [V: 03+2 C: 03+2] envoy*: Fix envoy-basic-config, add tests [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/934506 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [11:14:16] !log imported envoyproxy 1.23.10 to component/envoy-future in buster-wikimedia - T300324 [11:14:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:21] T300324: Upgrade Envoy to supported version - https://phabricator.wikimedia.org/T300324 [11:14:47] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['sretest1003'] [11:15:20] !log published image docker-registry.discovery.wmnet/envoy:1.18.3-2-s3 and docker-registry.discovery.wmnet/envoy-future:1.23.10-1-s1 - T300324 [11:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:16:56] PROBLEM - Host sretest1003 is DOWN: PING CRITICAL - Packet loss = 100% [11:19:34] RECOVERY - Host sretest1003 is UP: PING OK - Packet loss = 0%, RTA = 0.26 ms [11:20:17] (03CR) 10Effie Mouzeli: [C: 03+2] site.pp: Add kubestagemaster[12]002 [puppet] - 10https://gerrit.wikimedia.org/r/934503 (https://phabricator.wikimedia.org/T329827) (owner: 10Effie Mouzeli) [11:22:38] PROBLEM - Host sretest1003 is DOWN: PING CRITICAL - Packet loss = 100% [11:22:48] RECOVERY - Host sretest1003 is UP: PING OK - Packet loss = 0%, RTA = 1.23 ms [11:22:50] !log jbond@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest1003'] [11:23:26] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=1) upgrade firmware for hosts ['sretest1003'] [11:23:33] !log jbond@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest1003'] [11:26:20] PROBLEM - Host sretest1003 is DOWN: PING CRITICAL - Packet loss = 100% [11:28:40] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubestagemaster1002.eqiad.wmnet with OS bullseye [11:28:44] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['sretest1003'] [11:28:52] RECOVERY - Host sretest1003 is UP: PING OK - Packet loss = 0%, RTA = 0.32 ms [11:28:55] !log jiji@cumin1001 START - Cookbook sre.hosts.reimage for host kubestagemaster2002.codfw.wmnet with OS bullseye [11:31:53] !log jbond@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest1003'] [11:36:26] PROBLEM - Host sretest1003 is DOWN: PING CRITICAL - Packet loss = 100% [11:36:31] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestagemaster1002.eqiad.wmnet with reason: host reimage [11:37:34] (03PS1) 10JMeybohm: Add new mesh.deployment version 1.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/934512 (https://phabricator.wikimedia.org/T300324) [11:37:36] (03PS1) 10JMeybohm: mesh.deployment: Allow to configure the envoy image name [deployment-charts] - 10https://gerrit.wikimedia.org/r/934513 (https://phabricator.wikimedia.org/T300324) [11:37:39] (03PS1) 10JMeybohm: mathoid: Update to mesh.deployment:1.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/934514 (https://phabricator.wikimedia.org/T300324) [11:38:26] (03CR) 10CI reject: [V: 04-1] mathoid: Update to mesh.deployment:1.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/934514 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [11:38:28] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['sretest1003'] [11:38:56] RECOVERY - Host sretest1003 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [11:39:06] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestagemaster1002.eqiad.wmnet with reason: host reimage [11:39:34] (03PS2) 10AOkoth: admin: add gengh to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/934389 (https://phabricator.wikimedia.org/T340614) [11:41:23] (03PS2) 10JMeybohm: Add new mesh.deployment version 1.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/934512 (https://phabricator.wikimedia.org/T300324) [11:41:25] (03PS2) 10JMeybohm: mesh.deployment: Allow to configure the envoy image name [deployment-charts] - 10https://gerrit.wikimedia.org/r/934513 (https://phabricator.wikimedia.org/T300324) [11:41:27] (03PS2) 10JMeybohm: mathoid: Update to mesh.deployment:1.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/934514 (https://phabricator.wikimedia.org/T300324) [11:42:05] (03CR) 10jenkins-bot: mathoid: Update to mesh.deployment:1.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/934514 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [11:42:37] (03CR) 10ClĂ©ment Goubert: [C: 03+1] Add new mesh.deployment version 1.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/934512 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [11:43:23] (03CR) 10ClĂ©ment Goubert: [C: 03+1] mesh.deployment: Allow to configure the envoy image name [deployment-charts] - 10https://gerrit.wikimedia.org/r/934513 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [11:44:44] 10SRE, 10Machine-Learning-Team, 10MinT, 10serviceops, and 2 others: New Service Deployment Request: NNLB-200 for machine translation - https://phabricator.wikimedia.org/T329971 (10Pginer-WMF) [11:48:56] (03PS1) 10Slyngshede: Allow users to update their email address. [software/bitu] - 10https://gerrit.wikimedia.org/r/934519 (https://phabricator.wikimedia.org/T340637) [11:51:25] !log jiji@cumin1001 START - Cookbook sre.hosts.downtime for 2:00:00 on kubestagemaster2002.codfw.wmnet with reason: host reimage [11:54:43] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) for 2:00:00 on kubestagemaster2002.codfw.wmnet with reason: host reimage [11:55:08] (03CR) 10Kosta Harlan: [C: 03+2] ipoid: Use date/time image version name [deployment-charts] - 10https://gerrit.wikimedia.org/r/933096 (https://phabricator.wikimedia.org/T336163) (owner: 10Kosta Harlan) [11:55:10] (03PS3) 10JMeybohm: mesh.deployment: Allow to configure the envoy image name [deployment-charts] - 10https://gerrit.wikimedia.org/r/934513 (https://phabricator.wikimedia.org/T300324) [11:55:12] (03PS3) 10JMeybohm: mathoid: Update to mesh.deployment:1.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/934514 (https://phabricator.wikimedia.org/T300324) [11:55:14] (03PS1) 10JMeybohm: mediawiki: Update to mesh.deployment 1.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/934520 (https://phabricator.wikimedia.org/T300324) [11:56:08] (03Merged) 10jenkins-bot: ipoid: Use date/time image version name [deployment-charts] - 10https://gerrit.wikimedia.org/r/933096 (https://phabricator.wikimedia.org/T336163) (owner: 10Kosta Harlan) [11:57:22] (03CR) 10ClĂ©ment Goubert: [C: 03+1] Add new mesh.deployment version 1.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/934512 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [11:58:09] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2020 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [11:58:23] (03CR) 10ClĂ©ment Goubert: [C: 03+1] mesh.deployment: Allow to configure the envoy image name [deployment-charts] - 10https://gerrit.wikimedia.org/r/934513 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [11:59:01] (03CR) 10ClĂ©ment Goubert: [C: 03+1] mathoid: Update to mesh.deployment:1.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/934514 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [11:59:20] (03PS1) 10JMeybohm: mathoid: Use envoy-future in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/934521 (https://phabricator.wikimedia.org/T300324) [11:59:38] !log jiji@cumin1001 END (PASS) - Cookbook sre.hosts.reimage (exit_code=0) for host kubestagemaster1002.eqiad.wmnet with OS bullseye [12:01:29] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2020 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [12:02:02] (03CR) 10ClĂ©ment Goubert: [C: 03+1] mediawiki: Update to mesh.deployment 1.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/934520 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [12:03:32] !log kharlan@deploy1002 helmfile [staging] START helmfile.d/services/ipoid: apply [12:04:00] (03CR) 10ClĂ©ment Goubert: [C: 03+1] mathoid: Use envoy-future in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/934521 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [12:04:23] (03CR) 10JMeybohm: [C: 03+2] mathoid: Use envoy-future in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/934521 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [12:04:27] (03CR) 10JMeybohm: [C: 03+2] mediawiki: Update to mesh.deployment 1.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/934520 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [12:04:29] (03CR) 10JMeybohm: [C: 03+2] mathoid: Update to mesh.deployment:1.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/934514 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [12:04:31] (03CR) 10JMeybohm: [C: 03+2] mesh.deployment: Allow to configure the envoy image name [deployment-charts] - 10https://gerrit.wikimedia.org/r/934513 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [12:04:33] (03CR) 10JMeybohm: [C: 03+2] Add new mesh.deployment version 1.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/934512 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [12:05:52] (03Merged) 10jenkins-bot: Add new mesh.deployment version 1.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/934512 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [12:05:54] (03Merged) 10jenkins-bot: mesh.deployment: Allow to configure the envoy image name [deployment-charts] - 10https://gerrit.wikimedia.org/r/934513 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [12:05:56] (03Merged) 10jenkins-bot: mathoid: Update to mesh.deployment:1.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/934514 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [12:05:59] (03Merged) 10jenkins-bot: mediawiki: Update to mesh.deployment 1.2.1 [deployment-charts] - 10https://gerrit.wikimedia.org/r/934520 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [12:06:01] (03Merged) 10jenkins-bot: mathoid: Use envoy-future in staging [deployment-charts] - 10https://gerrit.wikimedia.org/r/934521 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [12:09:52] !log kharlan@deploy1002 helmfile [staging] START helmfile.d/services/ipoid: apply [12:10:06] !log kharlan@deploy1002 helmfile [staging] DONE helmfile.d/services/ipoid: apply [12:16:16] (03PS5) 10Jbond: sre.hardware: Add support for adding csrf-token [cookbooks] - 10https://gerrit.wikimedia.org/r/933990 [12:16:25] !log jbond@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest1003'] [12:17:04] !log jbond@cumin1001 END (FAIL) - Cookbook sre.hardware.upgrade-firmware (exit_code=99) upgrade firmware for hosts ['sretest1003'] [12:17:08] !log jbond@cumin1001 START - Cookbook sre.hardware.upgrade-firmware upgrade firmware for hosts ['sretest1003'] [12:18:33] PROBLEM - Host sretest1003 is DOWN: PING CRITICAL - Packet loss = 100% [12:18:42] (03CR) 10CI reject: [V: 04-1] sre.hardware: Add support for adding csrf-token [cookbooks] - 10https://gerrit.wikimedia.org/r/933990 (owner: 10Jbond) [12:20:02] !log jiji@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host kubestagemaster2002.codfw.wmnet with OS bullseye [12:22:33] RECOVERY - Host sretest1003 is UP: PING OK - Packet loss = 0%, RTA = 0.28 ms [12:22:33] !log jbond@cumin1001 END (PASS) - Cookbook sre.hardware.upgrade-firmware (exit_code=0) upgrade firmware for hosts ['sretest1003'] [12:24:56] (03PS6) 10Jbond: sre.hardware: Add support for adding csrf-token [cookbooks] - 10https://gerrit.wikimedia.org/r/933990 [12:27:08] (03CR) 10CI reject: [V: 04-1] sre.hardware: Add support for adding csrf-token [cookbooks] - 10https://gerrit.wikimedia.org/r/933990 (owner: 10Jbond) [12:30:02] !log bking@cumin1001 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host wdqs2021.codfw.wmnet with OS bullseye [12:39:24] !log kharlan@deploy1002 helmfile [eqiad] START helmfile.d/services/ipoid: apply [13:11:27] (03PS1) 10Jforrester: Fix bug in opening dialog [extensions/TimedMediaHandler] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/934476 (https://phabricator.wikimedia.org/T340816) [13:15:08] (03PS1) 10Jforrester: wikifunctions: Drop all the comments and default values [deployment-charts] - 10https://gerrit.wikimedia.org/r/934536 (https://phabricator.wikimedia.org/T297314) [13:15:10] (03PS1) 10Jforrester: wikifunctions: Add initial ENV values for orchestrator to talk to wiki and evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/934537 (https://phabricator.wikimedia.org/T297314) [13:15:43] (03CR) 10CI reject: [V: 04-1] wikifunctions: Drop all the comments and default values [deployment-charts] - 10https://gerrit.wikimedia.org/r/934536 (https://phabricator.wikimedia.org/T297314) (owner: 10Jforrester) [13:16:05] (03CR) 10CI reject: [V: 04-1] wikifunctions: Add initial ENV values for orchestrator to talk to wiki and evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/934537 (https://phabricator.wikimedia.org/T297314) (owner: 10Jforrester) [13:17:38] (03CR) 10Elukey: [C: 03+1] ml-services: remove nsfw model [deployment-charts] - 10https://gerrit.wikimedia.org/r/934380 (https://phabricator.wikimedia.org/T331416) (owner: 10Ilias Sarantopoulos) [13:18:02] (03CR) 10Kamila SoučkovĂĄ: [C: 03+1] "Is it obvious from the context that this is about throttling or should the word "throttle" be in the metric names? Otherwise LGTM." [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/934508 (owner: 10Hnowlan) [13:19:02] 10SRE, 10Add-Link, 10GrowthExperiments-NewcomerTasks, 10serviceops, 10Growth-Team (Current Sprint): linkrecommendation kubernetes service is down with HTTP 504: "upstream request timeout" - https://phabricator.wikimedia.org/T340780 (10akosiaris) 05Open→03Resolved a:03akosiaris >>! In T340780#898014... [13:19:28] (03PS2) 10Jforrester: wikifunctions: Drop all the comments and default values [deployment-charts] - 10https://gerrit.wikimedia.org/r/934536 (https://phabricator.wikimedia.org/T297314) [13:19:30] (03PS2) 10Jforrester: wikifunctions: Add initial ENV values for orchestrator to talk to wiki and evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/934537 (https://phabricator.wikimedia.org/T297314) [13:20:04] (03CR) 10CI reject: [V: 04-1] wikifunctions: Drop all the comments and default values [deployment-charts] - 10https://gerrit.wikimedia.org/r/934536 (https://phabricator.wikimedia.org/T297314) (owner: 10Jforrester) [13:20:27] (03CR) 10CI reject: [V: 04-1] wikifunctions: Add initial ENV values for orchestrator to talk to wiki and evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/934537 (https://phabricator.wikimedia.org/T297314) (owner: 10Jforrester) [13:20:39] (03CR) 10AOkoth: [C: 03+2] admin: add gengh to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/934389 (https://phabricator.wikimedia.org/T340614) (owner: 10AOkoth) [13:22:15] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for gengh - https://phabricator.wikimedia.org/T340614 (10Arnoldokoth) @gengh This should be all good now. Merged in gerrit and puppet too. [13:23:09] (03PS1) 10Andrew Bogott: wmcs-image-create: allow dhcp setting of resolvers on first boot [puppet] - 10https://gerrit.wikimedia.org/r/934539 [13:23:15] (03PS3) 10Jforrester: wikifunctions: Drop all the comments and default values [deployment-charts] - 10https://gerrit.wikimedia.org/r/934536 (https://phabricator.wikimedia.org/T297314) [13:23:17] (03PS3) 10Jforrester: wikifunctions: Add initial ENV values for orchestrator to talk to wiki and evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/934537 (https://phabricator.wikimedia.org/T297314) [13:23:26] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/mathoid: apply [13:23:47] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/mathoid: apply [13:25:31] (03Abandoned) 10Jforrester: blubberoid: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/842395 (owner: 10PipelineBot) [13:25:39] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: remove nsfw model [deployment-charts] - 10https://gerrit.wikimedia.org/r/934380 (https://phabricator.wikimedia.org/T331416) (owner: 10Ilias Sarantopoulos) [13:25:46] (03Abandoned) 10Jforrester: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/924135 (https://phabricator.wikimedia.org/T337464) (owner: 10PipelineBot) [13:25:50] (03Abandoned) 10Jforrester: wikifeeds: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/927782 (owner: 10PipelineBot) [13:25:53] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/928921 (owner: 10PipelineBot) [13:25:56] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/928922 (owner: 10PipelineBot) [13:26:00] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/929751 (owner: 10PipelineBot) [13:26:04] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/930586 (owner: 10PipelineBot) [13:26:08] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/931913 (owner: 10PipelineBot) [13:26:11] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/932808 (owner: 10PipelineBot) [13:26:14] (03Abandoned) 10Jforrester: mobileapps: pipeline bot promote [deployment-charts] - 10https://gerrit.wikimedia.org/r/933675 (owner: 10PipelineBot) [13:26:40] (03Merged) 10jenkins-bot: ml-services: remove nsfw model [deployment-charts] - 10https://gerrit.wikimedia.org/r/934380 (https://phabricator.wikimedia.org/T331416) (owner: 10Ilias Sarantopoulos) [13:30:38] 10SRE, 10Add-Link, 10GrowthExperiments-NewcomerTasks, 10serviceops, 10Growth-Team (Current Sprint): linkrecommendation kubernetes service is down with HTTP 504: "upstream request timeout" - https://phabricator.wikimedia.org/T340780 (10Urbanecm_WMF) a:05akosiaris→03Urbanecm_WMF [13:31:36] (03PS1) 10Lucas Werkmeister (WMDE): static-codereview: Link docs for finding links [puppet] - 10https://gerrit.wikimedia.org/r/934543 [13:32:22] (03CR) 10Reedy: [C: 03+1] static-codereview: Link docs for finding links [puppet] - 10https://gerrit.wikimedia.org/r/934543 (owner: 10Lucas Werkmeister (WMDE)) [13:33:03] (ProbeDown) firing: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:38:03] (ProbeDown) resolved: Service centrallog1002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog1002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [13:43:54] (03CR) 10Alexandros Kosiaris: [C: 04-1] "I wouldn't remove the service, I see the following" [puppet] - 10https://gerrit.wikimedia.org/r/934411 (https://phabricator.wikimedia.org/T340165) (owner: 10Majavah) [13:47:31] 10SRE, 10SRE-Access-Requests, 10serviceops, 10Patch-For-Review: Drop the `deploy-service` right, move three included users to `deployment` (or drop access)? - https://phabricator.wikimedia.org/T340165 (10akosiaris) `deployment` is the group to be used for deploying to k8s. Initially we had targetted `wikid... [13:49:29] (03CR) 10Ssingh: "recheck" [debs/dnsdist] - 10https://gerrit.wikimedia.org/r/934377 (owner: 10Ssingh) [13:50:27] (03CR) 10CI reject: [V: 04-1] Release dnsdist 1.8.0-1+wmf11u1 [debs/dnsdist] - 10https://gerrit.wikimedia.org/r/934377 (owner: 10Ssingh) [13:51:36] (03PS2) 10Ssingh: Release dnsdist 1.8.0-1+wmf11u1 [debs/dnsdist] - 10https://gerrit.wikimedia.org/r/934377 [13:55:03] (03CR) 10JHathaway: site.pp: Drop wmnet domain and always use regexes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [13:58:35] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2020 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:03:11] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2020 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:04:36] (03CR) 10Ssingh: "Ready for review." [debs/dnsdist] - 10https://gerrit.wikimedia.org/r/934377 (owner: 10Ssingh) [14:05:03] (03CR) 10Andrew Bogott: [C: 03+2] apt: Ensure sources.list is updated before apt-get update [puppet] - 10https://gerrit.wikimedia.org/r/934409 (owner: 10Majavah) [14:07:19] RECOVERY - Blazegraph process -wdqs-categories- on wdqs2021 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:07:36] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:11:57] PROBLEM - Blazegraph process -wdqs-categories- on wdqs2021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:14:23] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [14:17:36] (JobUnavailable) firing: (4) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [14:20:33] (03PS2) 10Andrew Bogott: wmcs-image-create: allow dhcp setting of resolvers on first boot [puppet] - 10https://gerrit.wikimedia.org/r/934539 [14:20:35] (03PS1) 10Andrew Bogott: cloud-init: apt_preserve_sources_list=True [puppet] - 10https://gerrit.wikimedia.org/r/934550 (https://phabricator.wikimedia.org/T340814) [14:22:14] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-image-create: allow dhcp setting of resolvers on first boot [puppet] - 10https://gerrit.wikimedia.org/r/934539 (owner: 10Andrew Bogott) [14:22:21] (03CR) 10Andrew Bogott: [C: 03+2] cloud-init: apt_preserve_sources_list=True [puppet] - 10https://gerrit.wikimedia.org/r/934550 (https://phabricator.wikimedia.org/T340814) (owner: 10Andrew Bogott) [14:30:03] !log jhancock@cumin2002 START - Cookbook sre.hosts.reimage for host an-worker1149.eqiad.wmnet with OS bullseye [14:30:10] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage was started by jhancock@cumin2002 for host an-worker1149.eqiad.wmnet with OS bullseye [14:31:03] (03PS1) 10Effie Mouzeli: (WIP) service: Add kubestagemaster service (#1) [puppet] - 10https://gerrit.wikimedia.org/r/934552 (https://phabricator.wikimedia.org/T329827) [14:31:41] Hey all - I know it’s Friday, but I’d like to get a quick update to a mitigation in PS.php for T337593 deployed if I can.  Just adding a new stewards-confirmed IP range.  Should be pretty low-risk. [14:31:42] T337593: Create a cli tool for reporting on various health and security metrics of a given Wikimedia repository - https://phabricator.wikimedia.org/T337593 [14:34:25] (03CR) 10JHathaway: apt: Ensure sources.list is updated before apt-get update (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/934409 (owner: 10Majavah) [14:42:42] !log Deployed updated mitigation for T337593 [14:42:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:47] T337593: Create a cli tool for reporting on various health and security metrics of a given Wikimedia repository - https://phabricator.wikimedia.org/T337593 [14:43:52] !log jiji@cumin1001 conftool action : ÎłÎ”Ï„; selector: service=kube-apiserver [14:49:36] is that a new conftool feature? [14:53:09] anything conftool is greek to me [14:54:56] effie: ^ [14:55:24] as claim said, it is actually greek [14:56:10] but I am not sure why this was announced here [14:57:59] RECOVERY - Blazegraph process -wdqs-categories- on wdqs2020 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [14:59:56] (03PS1) 10Andrew Bogott: wmcs-image-create: allow specifying an upstream image file [puppet] - 10https://gerrit.wikimedia.org/r/934556 [15:01:49] (03PS1) 10Effie Mouzeli: (WIP) hieradata: Add profile::lvs::realserver to kubestagemaster (#2) [puppet] - 10https://gerrit.wikimedia.org/r/934557 (https://phabricator.wikimedia.org/T329827) [15:02:43] PROBLEM - Blazegraph process -wdqs-categories- on wdqs2020 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:03:25] 10SRE, 10CFSSL-PKI, 10Infrastructure-Foundations, 10Patch-For-Review, 10Puppet (Puppet 7.0): PKI: add the new puppet CA to the pki infrastructre - https://phabricator.wikimedia.org/T340557 (10jbond) > Looking at the certificates the only difference (i could spot) is that the node the following Authority... [15:04:29] (03PS1) 10Effie Mouzeli: (WIP) hieradata: add control plane nodes for kubestagemaster (#3) [puppet] - 10https://gerrit.wikimedia.org/r/934558 (https://phabricator.wikimedia.org/T329827) [15:09:52] (03PS1) 10Ilias Sarantopoulos: ml-services: Deploy langid service [deployment-charts] - 10https://gerrit.wikimedia.org/r/934559 (https://phabricator.wikimedia.org/T340507) [15:15:22] (03PS2) 10Ilias Sarantopoulos: ml-services: Deploy langid service [deployment-charts] - 10https://gerrit.wikimedia.org/r/934559 (https://phabricator.wikimedia.org/T340507) [15:16:30] (03CR) 10Elukey: [C: 03+1] ml-services: Deploy langid service [deployment-charts] - 10https://gerrit.wikimedia.org/r/934559 (https://phabricator.wikimedia.org/T340507) (owner: 10Ilias Sarantopoulos) [15:18:22] (03CR) 10Ilias Sarantopoulos: [C: 03+2] ml-services: Deploy langid service [deployment-charts] - 10https://gerrit.wikimedia.org/r/934559 (https://phabricator.wikimedia.org/T340507) (owner: 10Ilias Sarantopoulos) [15:18:46] (03CR) 10Dzahn: [C: 03+2] "thanks for this!" [puppet] - 10https://gerrit.wikimedia.org/r/934543 (owner: 10Lucas Werkmeister (WMDE)) [15:19:24] (03Merged) 10jenkins-bot: ml-services: Deploy langid service [deployment-charts] - 10https://gerrit.wikimedia.org/r/934559 (https://phabricator.wikimedia.org/T340507) (owner: 10Ilias Sarantopoulos) [15:19:37] (03PS1) 10Elukey: ml-services: update Docker images for ReverRisk model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/934560 [15:20:15] (03PS2) 10Elukey: ml-services: update Docker images for ReverRisk model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/934560 [15:20:38] (03CR) 10David Caro: Allow cloudcumin hosts to connect to wm-bot (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/934309 (https://phabricator.wikimedia.org/T325756) (owner: 10FNegri) [15:20:42] (03CR) 10David Caro: [C: 03+1] Allow cloudcumin hosts to connect to wm-bot [puppet] - 10https://gerrit.wikimedia.org/r/934309 (https://phabricator.wikimedia.org/T325756) (owner: 10FNegri) [15:21:05] !log isaranto@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [15:21:51] (03CR) 10Dzahn: [C: 03+2] "deployed on prod miscweb ganeti VMs. page is cached but it should show up soon'ish." [puppet] - 10https://gerrit.wikimedia.org/r/934543 (owner: 10Lucas Werkmeister (WMDE)) [15:21:53] (03CR) 10AikoChou: [C: 03+1] ml-services: update Docker images for ReverRisk model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/934560 (owner: 10Elukey) [15:22:34] (03CR) 10Elukey: [C: 03+2] ml-services: update Docker images for ReverRisk model servers [deployment-charts] - 10https://gerrit.wikimedia.org/r/934560 (owner: 10Elukey) [15:23:20] (03CR) 10Andrew Bogott: [C: 03+2] wmcs-image-create: allow specifying an upstream image file [puppet] - 10https://gerrit.wikimedia.org/r/934556 (owner: 10Andrew Bogott) [15:28:18] (03PS1) 10AikoChou: ml-services: add readability isvc to experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/934562 (https://phabricator.wikimedia.org/T334182) [15:28:39] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2020 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:28:51] PROBLEM - Check systemd state on kubestagemaster1002 is CRITICAL: CRITICAL - degraded: The following units failed: ferm.service https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:30:13] PROBLEM - Check whether ferm is active by checking the default input chain on kubestagemaster1002 is CRITICAL: ERROR ferm input drop default policy not set, ferm might not have been started correctly https://wikitech.wikimedia.org/wiki/Monitoring/check_ferm [15:33:19] PROBLEM - Blazegraph process -wdqs-blazegraph- on wdqs2020 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [15:35:08] !log elukey@deploy1002 helmfile [ml-serve-eqiad] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [15:35:33] !log elukey@deploy1002 helmfile [ml-serve-codfw] Ran 'sync' command on namespace 'revertrisk' for release 'main' . [15:38:37] (LogstashKafkaConsumerLag) firing: Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://grafana.wikimedia.org/d/000000484/kafka-consumer-lag?var-cluster=logging-eqiad&var-datasource=eqiad%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:39:55] (03CR) 10Elukey: [C: 03+1] ml-services: add readability isvc to experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/934562 (https://phabricator.wikimedia.org/T334182) (owner: 10AikoChou) [15:43:37] (LogstashKafkaConsumerLag) firing: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:50:03] (ProbeDown) firing: (2) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://wikitech.wikimedia.org/wiki/TLS/Runbook#centrallog2002:6514 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/custom&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:50:23] !log jhancock@cumin2002 END (FAIL) - Cookbook sre.hosts.reimage (exit_code=99) for host an-worker1149.eqiad.wmnet with OS bullseye [15:50:29] 10SRE, 10ops-eqiad, 10DC-Ops, 10Data-Platform-SRE: Q3:rack/setup/install an-worker11[49-56] - https://phabricator.wikimedia.org/T327295 (10ops-monitoring-bot) Cookbook cookbooks.sre.hosts.reimage started by jhancock@cumin2002 for host an-worker1149.eqiad.wmnet with OS bullseye executed with errors: - an-wo... [15:50:50] (03PS1) 10Cwhite: Logstash: implement availability SLO [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/934453 (https://phabricator.wikimedia.org/T331461) [15:51:18] (03PS2) 10Cwhite: Logstash: implement availability SLO [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/934453 (https://phabricator.wikimedia.org/T331461) [15:51:22] (03CR) 10AikoChou: [C: 03+2] ml-services: add readability isvc to experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/934562 (https://phabricator.wikimedia.org/T334182) (owner: 10AikoChou) [15:52:48] (03Merged) 10jenkins-bot: ml-services: add readability isvc to experimental ns [deployment-charts] - 10https://gerrit.wikimedia.org/r/934562 (https://phabricator.wikimedia.org/T334182) (owner: 10AikoChou) [15:52:52] (03PS3) 10Cwhite: Logstash: implement availability SLO [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/934453 (https://phabricator.wikimedia.org/T331461) [15:53:18] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [15:53:32] (03PS4) 10Cwhite: Logstash: implement availability SLO [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/934453 (https://phabricator.wikimedia.org/T331461) [15:53:37] (LogstashKafkaConsumerLag) resolved: (2) Too many messages in kafka logging - https://wikitech.wikimedia.org/wiki/Logstash#Kafka_consumer_lag - https://alerts.wikimedia.org/?q=alertname%3DLogstashKafkaConsumerLag [15:55:03] (ProbeDown) resolved: (3) Service centrallog2002:6514 has failed probes (tcp_rsyslog_receiver_ip4) - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [15:58:18] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST secrets) on k8s-staging@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-staging - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [16:07:46] (03CR) 10Lucas Werkmeister (WMDE): static-codereview: Link docs for finding links (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/934543 (owner: 10Lucas Werkmeister (WMDE)) [16:08:53] (03CR) 10Cwhite: Logstash: implement availability SLO (031 comment) [grafana-grizzly] - 10https://gerrit.wikimedia.org/r/934453 (https://phabricator.wikimedia.org/T331461) (owner: 10Cwhite) [16:09:46] !log aikochou@deploy1002 helmfile [ml-staging-codfw] Ran 'sync' command on namespace 'experimental' for release 'main' . [16:14:17] (03PS1) 10Dzahn: aptrepo: update gitlab-ce & gitlab-runner versions to between 15.11 and 15.12 [puppet] - 10https://gerrit.wikimedia.org/r/934579 (https://phabricator.wikimedia.org/T340839) [16:22:13] (03PS1) 10JMeybohm: mathoid: Enable telemetry [deployment-charts] - 10https://gerrit.wikimedia.org/r/934580 (https://phabricator.wikimedia.org/T300324) [16:23:49] (03CR) 10JMeybohm: [C: 03+2] mathoid: Enable telemetry [deployment-charts] - 10https://gerrit.wikimedia.org/r/934580 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [16:24:34] (03Merged) 10jenkins-bot: mathoid: Enable telemetry [deployment-charts] - 10https://gerrit.wikimedia.org/r/934580 (https://phabricator.wikimedia.org/T300324) (owner: 10JMeybohm) [16:25:19] (03PS1) 10AikoChou: ml-services: increase memory resources for readability isvc [deployment-charts] - 10https://gerrit.wikimedia.org/r/934582 (https://phabricator.wikimedia.org/T334182) [16:25:23] !log jayme@deploy1002 helmfile [staging] START helmfile.d/services/mathoid: apply [16:25:42] !log jayme@deploy1002 helmfile [staging] DONE helmfile.d/services/mathoid: apply [16:25:49] !log jayme@deploy1002 helmfile [codfw] START helmfile.d/services/mathoid: apply [16:26:39] !log jayme@deploy1002 helmfile [codfw] DONE helmfile.d/services/mathoid: apply [16:27:15] !log jayme@deploy1002 helmfile [eqiad] START helmfile.d/services/mathoid: apply [16:27:48] !log jayme@deploy1002 helmfile [eqiad] DONE helmfile.d/services/mathoid: apply [16:28:19] (03CR) 10Dzahn: [C: 03+2] aptrepo: update gitlab-ce & gitlab-runner versions to between 15.11 and 15.12 [puppet] - 10https://gerrit.wikimedia.org/r/934579 (https://phabricator.wikimedia.org/T340839) (owner: 10Dzahn) [16:28:29] (03CR) 10LSobanski: [C: 03+1] aptrepo: update gitlab-ce & gitlab-runner versions to between 15.11 and 15.12 [puppet] - 10https://gerrit.wikimedia.org/r/934579 (https://phabricator.wikimedia.org/T340839) (owner: 10Dzahn) [16:30:31] 10SRE-OnFire, 10Data Engineering and Event Platform Team, 10Data-Engineering, 10serviceops, 10Event-Platform: Incident: 2022-12-09 api appserver worker starvation - https://phabricator.wikimedia.org/T324994 (10JArguello-WMF) [16:32:39] 10SRE-OnFire, 10SRE-Sprint-Week-Sustainability-March2023, 10Data Engineering and Event Platform Team, 10Data-Engineering, and 3 others: Uneven CPU throttling of eventgate-analytics under load - https://phabricator.wikimedia.org/T325068 (10JArguello-WMF) [16:32:55] 10SRE, 10Data Engineering and Event Platform Team, 10Data-Engineering, 10Observability-Logging, and 2 others: Integrate Event Platform and ECS logs - https://phabricator.wikimedia.org/T291645 (10JArguello-WMF) [16:33:01] 10SRE, 10Data Engineering and Event Platform Team, 10Data-Engineering, 10observability, and 3 others: Upgrade Kafka to 2.x or 3.x - https://phabricator.wikimedia.org/T300102 (10JArguello-WMF) [16:33:45] 10SRE, 10Data Engineering and Event Platform Team, 10Data-Engineering, 10serviceops, and 2 others: DRY kafka broker declaration in helmfiles - https://phabricator.wikimedia.org/T253058 (10JArguello-WMF) [16:34:29] 10SRE, 10Data Engineering and Event Platform Team, 10Data-Engineering, 10serviceops-radar, 10Event-Platform: Consider Julie for managing Kafka settings, perhaps even integrating with Event Stream Config - https://phabricator.wikimedia.org/T276088 (10JArguello-WMF) [16:35:41] 10SRE-tools, 10Infrastructure-Foundations, 10Patch-For-Review, 10cloud-services-team (FY2022/2023-Q4): Allow wmcs cookbooks running on cloudcuminXXXX to write to the SAL - https://phabricator.wikimedia.org/T325756 (10fnegri) The patch https://gerrit.wikimedia.org/r/934555 is a proof-of-concept that only wo... [16:52:35] jouncebot: nowandnext [16:52:35] For the next 14 hour(s) and 7 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230630T0700) [16:52:35] In 14 hour(s) and 7 minute(s): No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20230701T0700) [16:59:23] !log dzahn@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1003.wikimedia.org with reason: security release [17:15:44] (ThanosQueryInstantLatencyHigh) firing: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [17:20:44] (ThanosQueryInstantLatencyHigh) resolved: Thanos Query Frontend has high latency for queries. - https://wikitech.wikimedia.org/wiki/Thanos#Alerts - https://grafana.wikimedia.org/d/aa7Rx0oMk/thanos-query-frontend - https://alerts.wikimedia.org/?q=alertname%3DThanosQueryInstantLatencyHigh [17:37:11] RECOVERY - Blazegraph process -wdqs-categories- on wdqs2021 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:41:55] PROBLEM - Blazegraph process -wdqs-categories- on wdqs2021 is CRITICAL: PROCS CRITICAL: 0 processes with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [17:44:56] 10SRE, 10API Platform, 10Anti-Harassment, 10Cloud-Services, and 18 others: Migrate PipelineLib repos to GitLab - https://phabricator.wikimedia.org/T332953 (10JArguello-WMF) [17:45:02] 10SRE, 10Data Engineering and Event Platform Team, 10Traffic: Add a rolled-up cache_status field to druid webrequest_sampled_128 - https://phabricator.wikimedia.org/T319344 (10JArguello-WMF) [17:46:07] (03PS1) 10Bking: wdqs.data-transfer: reformat using black [cookbooks] - 10https://gerrit.wikimedia.org/r/934595 (https://phabricator.wikimedia.org/T340793) [17:56:45] James_F: any objection to my deploying https://gerrit.wikimedia.org/r/c/mediawiki/extensions/TimedMediaHandler/+/934476/ ? [17:57:30] brennen: Go for it. [17:59:42] 10SRE, 10Data Engineering and Event Platform Team, 10Data Pipelines, 10Traffic-Icebox: Mobile redirects drop provenance parameters - https://phabricator.wikimedia.org/T252227 (10JArguello-WMF) [18:00:22] (03CR) 10TrainBranchBot: [C: 03+2] "Approved by brennen@deploy1002 using scap backport" [extensions/TimedMediaHandler] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/934476 (https://phabricator.wikimedia.org/T340816) (owner: 10Jforrester) [18:00:42] (03CR) 10Cory Massaro: [C: 03+1] "I can't +2, but this looks good to me!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/934537 (https://phabricator.wikimedia.org/T297314) (owner: 10Jforrester) [18:01:38] Huh, Cory should have C+2. [18:01:47] (03CR) 10Cory Massaro: wikifunctions: Drop all the comments and default values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/934536 (https://phabricator.wikimedia.org/T297314) (owner: 10Jforrester) [18:01:52] Did we forget to put him in the gerrit group? [18:04:03] heh [18:05:36] More Phab fun. [18:06:02] I think you should be able to update that yourself [18:06:38] !log dzahn@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1003.wikimedia.org with reason: security release [18:06:57] RECOVERY - Blazegraph process -wdqs-categories- on wdqs2021 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:11:07] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:13:40] (03CR) 10Jforrester: wikifunctions: Drop all the comments and default values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/934536 (https://phabricator.wikimedia.org/T297314) (owner: 10Jforrester) [18:15:05] (03PS4) 10Jforrester: wikifunctions: Drop all the comments and default values [deployment-charts] - 10https://gerrit.wikimedia.org/r/934536 (https://phabricator.wikimedia.org/T297314) [18:15:06] (03PS4) 10Jforrester: wikifunctions: Add initial ENV values for orchestrator to talk to wiki and evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/934537 (https://phabricator.wikimedia.org/T297314) [18:15:33] taavi: Aha, apparently yes? Not sure that's right, but hey. [18:15:43] PROBLEM - WDQS SPARQL on wdqs1012 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:15:44] 10SRE, 10SRE-Access-Requests, 10Abstract Wikipedia team (Phase λ – Launch): Please add Abstract Wiki team members to `deployment` prod SRE group - https://phabricator.wikimedia.org/T339936 (10Arnoldokoth) [18:16:07] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [18:16:39] (03Merged) 10jenkins-bot: Fix bug in opening dialog [extensions/TimedMediaHandler] (wmf/1.41.0-wmf.15) - 10https://gerrit.wikimedia.org/r/934476 (https://phabricator.wikimedia.org/T340816) (owner: 10Jforrester) [18:16:58] !log brennen@deploy1002 Started scap: Backport for [[gerrit:934476|Fix bug in opening dialog (T340816)]] [18:17:02] T340816: Unable to load media player on mobile devices - https://phabricator.wikimedia.org/T340816 [18:17:13] 10SRE, 10SRE-Access-Requests, 10Abstract Wikipedia team (Phase λ – Launch): Please add Abstract Wiki team members to `deployment` prod SRE group - https://phabricator.wikimedia.org/T339936 (10Jdforrester-WMF) Have also added all to https://gerrit.wikimedia.org/r/admin/groups/wmf-deployment,members (except fo... [18:17:36] (JobUnavailable) firing: (2) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [18:17:53] (03CR) 10AOkoth: [C: 03+2] admin: Add SSH key for urbanecm [puppet] - 10https://gerrit.wikimedia.org/r/934366 (https://phabricator.wikimedia.org/T340752) (owner: 10Urbanecm) [18:18:36] !log brennen@deploy1002 brennen and jforrester: Backport for [[gerrit:934476|Fix bug in opening dialog (T340816)]] synced to the testservers: mwdebug2002.codfw.wmnet, mwdebug2001.codfw.wmnet, mwdebug1002.eqiad.wmnet, mwdebug1001.eqiad.wmnet [18:18:39] RECOVERY - WDQS SPARQL on wdqs1012 is OK: HTTP OK: HTTP/1.1 200 OK - 689 bytes in 1.082 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:19:41] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Add a production SSH key for urbanecm - https://phabricator.wikimedia.org/T340752 (10Arnoldokoth) @Urbanecm Should be good to go now. [18:19:54] !log dzahn@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab2002.wikimedia.org with reason: security release [18:20:04] i lack an ios device to confirm fix, but at any rate it doesn't seem to break normal media playing. going ahead. [18:20:06] !log upgrading gitlab on gitlab-replica.wikimedia.org [18:20:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:30] I am not supposed to do this while other deployments are going on [18:20:35] and to check with jouncebot [18:21:11] but it had told me nothing on Friday, so that's going on now but you probably just care about gerrit [18:21:14] mutante: this should be finished momentarily, although i would not expect any conflicts. [18:21:25] ack [18:21:42] for quite a while it will be busy making a backup anyways [18:21:45] before it does it [18:21:52] it's a cookbook [18:25:35] !log brennen@deploy1002 Finished scap: Backport for [[gerrit:934476|Fix bug in opening dialog (T340816)]] (duration: 08m 37s) [18:25:40] T340816: Unable to load media player on mobile devices - https://phabricator.wikimedia.org/T340816 [18:34:39] (03CR) 10Jforrester: wikifunctions: Drop all the comments and default values (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/934536 (https://phabricator.wikimedia.org/T297314) (owner: 10Jforrester) [18:37:51] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2021 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [18:39:22] 10SRE, 10SRE-Access-Requests, 10User-Urbanecm: Add a production SSH key for urbanecm - https://phabricator.wikimedia.org/T340752 (10Dzahn) a:05Arnoldokoth→03Urbanecm Wanna confirm and resolve? [18:40:53] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for gengh - https://phabricator.wikimedia.org/T340614 (10Dzahn) a:05Arnoldokoth→03gengh Wanna try it out? [18:41:29] 10SRE, 10SRE-Access-Requests, 10Abstract Wikipedia team (Phase λ – Launch): Please add Abstract Wiki team members to `deployment` prod SRE group - https://phabricator.wikimedia.org/T339936 (10Dzahn) Access for Geno should work now. Feel free to try it out and let us know. [18:49:44] 10SRE, 10SRE-Access-Requests, 10User-Urbanecm: Add a production SSH key for urbanecm - https://phabricator.wikimedia.org/T340752 (10Urbanecm) 05Open→03Resolved a:05Urbanecm→03Arnoldokoth It works well, thanks! [18:51:58] (03CR) 10Cory Massaro: [C: 03+2] wikifunctions: Drop all the comments and default values [deployment-charts] - 10https://gerrit.wikimedia.org/r/934536 (https://phabricator.wikimedia.org/T297314) (owner: 10Jforrester) [18:52:02] (03CR) 10Cory Massaro: [C: 03+2] wikifunctions: Add initial ENV values for orchestrator to talk to wiki and evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/934537 (https://phabricator.wikimedia.org/T297314) (owner: 10Jforrester) [18:52:59] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to ops (or wmcs-roots) for TheresNoTime - https://phabricator.wikimedia.org/T337829 (10Arnoldokoth) Is this good to go? [18:53:05] (03Merged) 10jenkins-bot: wikifunctions: Drop all the comments and default values [deployment-charts] - 10https://gerrit.wikimedia.org/r/934536 (https://phabricator.wikimedia.org/T297314) (owner: 10Jforrester) [18:53:07] (03Merged) 10jenkins-bot: wikifunctions: Add initial ENV values for orchestrator to talk to wiki and evaluator [deployment-charts] - 10https://gerrit.wikimedia.org/r/934537 (https://phabricator.wikimedia.org/T297314) (owner: 10Jforrester) [19:07:38] (03CR) 10AOkoth: [C: 03+1] vrts: replace OTRS in Wikitech monitoring notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/932320 (https://phabricator.wikimedia.org/T280392) (owner: 10Dzahn) [19:09:01] (03CR) 10AOkoth: [C: 03+1] vrts: replace OTRS in Wikitech monitoring notes URLs (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932320 (https://phabricator.wikimedia.org/T280392) (owner: 10Dzahn) [19:09:46] (03CR) 10Dzahn: [C: 03+2] "alright, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/932320 (https://phabricator.wikimedia.org/T280392) (owner: 10Dzahn) [19:13:01] (03PS1) 10Bking: wdqs.data-transfer: Add more pool options [cookbooks] - 10https://gerrit.wikimedia.org/r/934602 (https://phabricator.wikimedia.org/T340793) [19:13:40] (03CR) 10Dzahn: "thanks, sounds good" [puppet] - 10https://gerrit.wikimedia.org/r/932466 (https://phabricator.wikimedia.org/T337972) (owner: 10JHathaway) [19:14:26] (03CR) 10Dzahn: [C: 03+1] "not deploying on Friday, want to actually test it too, but other than that, good to go" [puppet] - 10https://gerrit.wikimedia.org/r/932440 (https://phabricator.wikimedia.org/T338071) (owner: 10Dzahn) [19:15:19] (03CR) 10CI reject: [V: 04-1] wdqs.data-transfer: Add more pool options [cookbooks] - 10https://gerrit.wikimedia.org/r/934602 (https://phabricator.wikimedia.org/T340793) (owner: 10Bking) [19:25:38] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [19:25:43] !log dzahn@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab2002.wikimedia.org with reason: security release [19:25:58] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [19:26:49] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [19:37:36] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [19:38:16] PROBLEM - mailman list info on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:38:46] PROBLEM - mailman archives on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:42:11] (03CR) 10AOkoth: [C: 03+1] vrts: rename otrs_aliases to vrts_aliases (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/932316 (https://phabricator.wikimedia.org/T280392) (owner: 10Dzahn) [19:46:08] PROBLEM - mailman list info ssl expiry on lists1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:52:30] RECOVERY - mailman archives on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 50135 bytes in 0.116 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:52:42] RECOVERY - mailman list info ssl expiry on lists1001 is OK: OK - Certificate lists.wikimedia.org will expire on Sat 19 Aug 2023 04:23:22 AM GMT +0000. https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:53:18] RECOVERY - mailman list info on lists1001 is OK: HTTP OK: HTTP/1.1 200 OK - 8646 bytes in 0.308 second response time https://wikitech.wikimedia.org/wiki/Mailman/Monitoring [19:53:26] !log dzahn@cumin1001 START - Cookbook sre.gitlab.upgrade on GitLab host gitlab1004.wikimedia.org with reason: security release [19:55:43] !log please hold code changes and deploys if using gitlab - upgrade in progress [19:55:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:17] (03CR) 10Dzahn: "this does not apply to vrts machines! (/usr/local/bin/otrs_aliases does not exist there), this is a change on the MX server" [puppet] - 10https://gerrit.wikimedia.org/r/932316 (https://phabricator.wikimedia.org/T280392) (owner: 10Dzahn) [20:04:47] (03CR) 10Dzahn: "yea, this will write to the file /etc/exim4/otrs_emails on mx*. not deploying right now :)" [puppet] - 10https://gerrit.wikimedia.org/r/932316 (https://phabricator.wikimedia.org/T280392) (owner: 10Dzahn) [20:18:06] (03PS3) 10Jforrester: Add wikifunctions.org to wgCentralNoticeContentSecurityPolicy [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771622 (https://phabricator.wikimedia.org/T275945) [20:18:08] (03PS3) 10Jforrester: [DNM] Add wikifunctions.org to prod wgLocalVirtualHosts [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771623 (https://phabricator.wikimedia.org/T275945) [20:18:10] (03PS3) 10Jforrester: Add wikifunctions.org to foundationwiki's custom CSP [mediawiki-config] - 10https://gerrit.wikimedia.org/r/771624 [20:18:12] (03PS7) 10Jforrester: Let wikifunctions.org use the Graph system [mediawiki-config] - 10https://gerrit.wikimedia.org/r/740795 [20:18:14] (03PS1) 10Jforrester: Follow-up ca3aa70754: Drop 30x30px Notifications icons, unused for 7 years [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934630 (https://phabricator.wikimedia.org/T147219) [20:18:16] (03PS1) 10Jforrester: [DNM][WIP] Initial configuration for Wikifunctions.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934631 (https://phabricator.wikimedia.org/T275945) [20:18:19] (03PS1) 10Jforrester: [Beta Cluster] Drop duplicate settings now Wikifunctions.org exists [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934632 [20:29:32] !log jforrester@deploy1002 helmfile [staging] START helmfile.d/services/wikifunctions: apply [20:29:34] !log jforrester@deploy1002 helmfile [staging] DONE helmfile.d/services/wikifunctions: apply [20:30:16] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10Dzahn) https://wikitech.wikimedia.org/wiki/Contint2001 and https://wikitech.wikimedia.org/wiki/Contint2002 with fingerprint p... [20:32:43] (03PS1) 10Dzahn: admin: remove old ssh key from user dzahn [puppet] - 10https://gerrit.wikimedia.org/r/934634 [20:33:38] 10SRE, 10SRE-Access-Requests: Requesting access to RESOURCE for USER[S] - https://phabricator.wikimedia.org/T340890 (10CCoxwell-WMF) [20:35:07] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for cec - https://phabricator.wikimedia.org/T340890 (10Jdforrester-WMF) [20:36:43] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for cec - https://phabricator.wikimedia.org/T340890 (10Jdforrester-WMF) [20:36:58] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for cec - https://phabricator.wikimedia.org/T340890 (10Jdforrester-WMF) Approved from my end. [20:38:06] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10Dzahn) https://www.mediawiki.org/wiki/Continuous_integration/Data_center_switch exists but at least one thing is outdated. it... [20:45:16] (MediaWikiHighErrorRate) firing: Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://grafana.wikimedia.org/d/000000438/mediawiki-exceptions-alerts?panelId=18&fullscreen&orgId=1&var-datasource=codfw%20prometheus/ops - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:50:16] (MediaWikiHighErrorRate) resolved: (2) Elevated rate of MediaWiki errors - jobrunner - https://wikitech.wikimedia.org/wiki/Application_servers/Runbook - https://alerts.wikimedia.org/?q=alertname%3DMediaWikiHighErrorRate [20:53:15] 10SRE, 10SRE-Access-Requests, 10Abstract Wikipedia team (Phase λ – Launch): Please add Abstract Wiki team members to `deployment` prod SRE group - https://phabricator.wikimedia.org/T339936 (10Arnoldokoth) [20:53:17] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for cec - https://phabricator.wikimedia.org/T340890 (10Arnoldokoth) 05Open→03In progress [20:57:30] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [20:57:36] 10SRE, 10SRE-Access-Requests: Requesting access to deployment for cec - https://phabricator.wikimedia.org/T340890 (10Arnoldokoth) [20:57:38] RECOVERY - Blazegraph process -wdqs-blazegraph- on wdqs2020 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9999 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [20:59:06] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [20:59:14] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [21:00:18] (03CR) 10Catrope: [C: 03+1] Follow-up ca3aa70754: Drop 30x30px Notifications icons, unused for 7 years [mediawiki-config] - 10https://gerrit.wikimedia.org/r/934630 (https://phabricator.wikimedia.org/T147219) (owner: 10Jforrester) [21:00:22] (03PS1) 10AOkoth: admin: add cec to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/934635 (https://phabricator.wikimedia.org/T340890) [21:00:24] !log bking@cumin1001 START - Cookbook sre.wdqs.data-transfer [21:02:00] (03CR) 10Dzahn: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/934635 (https://phabricator.wikimedia.org/T340890) (owner: 10AOkoth) [21:02:36] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:03:12] !log dzahn@cumin1001 END (PASS) - Cookbook sre.gitlab.upgrade (exit_code=0) on GitLab host gitlab1004.wikimedia.org with reason: security release [21:03:32] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for cec - https://phabricator.wikimedia.org/T340890 (10Arnoldokoth) [21:03:44] (03CR) 10AOkoth: [C: 03+2] admin: add cec to deployment group [puppet] - 10https://gerrit.wikimedia.org/r/934635 (https://phabricator.wikimedia.org/T340890) (owner: 10AOkoth) [21:07:29] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for cec - https://phabricator.wikimedia.org/T340890 (10Arnoldokoth) @CCoxwell-WMF This is good now. Feel free to test and resolve. [21:07:36] 10SRE, 10SRE-Access-Requests, 10Abstract Wikipedia team (Phase λ – Launch): Please add Abstract Wiki team members to `deployment` prod SRE group - https://phabricator.wikimedia.org/T339936 (10Arnoldokoth) [21:07:37] !log debugging a cert issue on pki1001.eqiad [21:07:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:50] 10SRE, 10SRE-Access-Requests, 10Patch-For-Review: Requesting access to deployment for cec - https://phabricator.wikimedia.org/T340890 (10Arnoldokoth) a:03CCoxwell-WMF [21:12:36] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [21:14:47] 10SRE, 10SRE-Access-Requests, 10Abstract Wikipedia team (Phase λ – Launch): Please add Abstract Wiki team members to `deployment` prod SRE group - https://phabricator.wikimedia.org/T339936 (10Dzahn) @Jdforrester-WMF This should be resolved:) [21:25:07] (ProbeDown) firing: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:30:07] (ProbeDown) resolved: Service thumbor:8800 has failed probes (http_thumbor_ip4) - https://wikitech.wikimedia.org/wiki/Runbook#thumbor:8800 - https://grafana.wikimedia.org/d/O0nHhdhnz/network-probes-overview?var-job=probes/service&var-module=All - https://alerts.wikimedia.org/?q=alertname%3DProbeDown [21:39:21] (03PS1) 10Dzahn: microsites: fix quoting style for ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/934637 [21:43:51] (03PS1) 10Dzahn: gerrit: fix quoting for ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/934638 [21:49:13] (03PS1) 10Dzahn: phabricator: fix quoting for ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/934639 [21:51:54] (03PS1) 10Dzahn: wikistats: fix quoting for ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/934640 [21:54:55] (03PS1) 10Dzahn: vrts: fix quoting of ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/934641 [21:56:33] (03PS1) 10Dzahn: releases: fix quoting of ensure parameter [puppet] - 10https://gerrit.wikimedia.org/r/934642 [21:58:18] !log bking@deploy1002 Started deploy [wdqs/wdqs@dff41b7]: 0.3.124 [21:58:24] !log bking@deploy1002 Finished deploy [wdqs/wdqs@dff41b7]: 0.3.124 (duration: 00m 05s) [22:00:26] !log bking@deploy1002 Started deploy [wdqs/wdqs@dff41b7]: 0.3.124 [22:00:31] !log bking@deploy1002 Finished deploy [wdqs/wdqs@dff41b7]: 0.3.124 (duration: 00m 05s) [22:01:10] 10SRE, 10Infrastructure-Foundations, 10Puppet-Core, 10Patch-For-Review: Resource attributes are quoted inconsistently - https://phabricator.wikimedia.org/T91908 (10Dzahn) > If a string is a value from an enumerable set of options, such as present and absent, it SHOULD NOT be enclosed in quotes at all. Tha... [22:02:12] 10SRE, 10SRE-Access-Requests, 10Abstract Wikipedia team (Phase λ – Launch): Please add Abstract Wiki team members to `deployment` prod SRE group - https://phabricator.wikimedia.org/T339936 (10Jdforrester-WMF) >>! In T339936#8982927, @Dzahn wrote: > @Jdforrester-WMF This should be resolved:) Thank you! [22:02:44] 10SRE, 10SRE-Access-Requests, 10Abstract Wikipedia team (Phase λ – Launch): Please add Abstract Wiki team members to `deployment` prod SRE group - https://phabricator.wikimedia.org/T339936 (10Jdforrester-WMF) 05In progress→03Resolved [22:08:48] !log bking@deploy1002 Started deploy [wdqs/wdqs@dff41b7]: 0.3.124 [22:08:49] !log bking@deploy1002 deploy aborted: 0.3.124 (duration: 00m 00s) [22:08:50] !log bking@deploy1002 Started deploy [wdqs/wdqs@dff41b7]: 0.3.124 [22:09:38] !log bking@deploy1002 Finished deploy [wdqs/wdqs@dff41b7]: 0.3.124 (duration: 00m 47s) [22:09:38] RECOVERY - Blazegraph Port for wdqs-blazegraph on wdqs2020 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9999 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [22:10:08] RECOVERY - Query Service HTTP Port on wdqs2020 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.057 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [22:10:12] RECOVERY - WDQS SPARQL on wdqs2020 is OK: HTTP OK: HTTP/1.1 200 OK - 690 bytes in 1.253 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [22:10:50] RECOVERY - Blazegraph process -wdqs-categories- on wdqs2020 is OK: PROCS OK: 1 process with UID = 498 (blazegraph), regex args ^java .* --port 9990 .* blazegraph-service-.*war https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [22:11:10] RECOVERY - Blazegraph Port for wdqs-categories on wdqs2020 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 9990 https://wikitech.wikimedia.org/wiki/Wikidata_query_service/Runbook [22:12:36] (JobUnavailable) firing: (3) Reduced availability for job cloud_dev_pdns in ops@codfw - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets - https://alerts.wikimedia.org/?q=alertname%3DJobUnavailable [22:15:27] (03PS2) 10Bking: wdqs.data-transfer: Add more pool options [cookbooks] - 10https://gerrit.wikimedia.org/r/934602 (https://phabricator.wikimedia.org/T340793) [22:18:37] (03CR) 10CI reject: [V: 04-1] wdqs.data-transfer: Add more pool options [cookbooks] - 10https://gerrit.wikimedia.org/r/934602 (https://phabricator.wikimedia.org/T340793) (owner: 10Bking) [22:19:10] !log ryankemper@puppetmaster1001 conftool action : set/weight=0:pooled=inactive; selector: name=wdqs2022.* [22:19:14] !log ryankemper@puppetmaster1001 conftool action : set/weight=0:pooled=inactive; selector: name=wdqs2021.* [22:19:19] !log ryankemper@puppetmaster1001 conftool action : set/weight=0:pooled=inactive; selector: name=wdqs2020.* [22:19:25] !log ryankemper@puppetmaster1001 conftool action : set/weight=0:pooled=inactive; selector: name=wdqs2019.* [22:19:29] !log ryankemper@puppetmaster1001 conftool action : set/weight=0:pooled=inactive; selector: name=wdqs2018.* [22:19:33] !log ryankemper@puppetmaster1001 conftool action : set/weight=0:pooled=inactive; selector: name=wdqs2017.* [22:19:38] !log ryankemper@puppetmaster1001 conftool action : set/weight=0:pooled=inactive; selector: name=wdqs2016.* [22:20:01] !log ryankemper@puppetmaster1001 conftool action : set/weight=0:pooled=inactive; selector: name=wdqs2015.* [22:20:05] !log ryankemper@puppetmaster1001 conftool action : set/weight=0:pooled=inactive; selector: name=wdqs2014.* [22:20:20] !log ryankemper@puppetmaster1001 conftool action : set/weight=0:pooled=inactive; selector: name=wdqs2013.* [22:21:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:26:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:27:48] (03CR) 10Dzahn: planet: restrict firewall source range for port 443 to envoy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/924604 (owner: 10Dzahn) [22:28:20] (03Abandoned) 10Dzahn: planet: restrict firewall source range for port 443 to envoy [puppet] - 10https://gerrit.wikimedia.org/r/924604 (owner: 10Dzahn) [22:34:19] !log bking@cumin1001 END (FAIL) - Cookbook sre.wdqs.data-transfer (exit_code=99) [22:36:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [22:37:20] 10SRE, 10Continuous-Integration-Infrastructure, 10serviceops-collab, 10Patch-For-Review: contint2002 service implementation tracking - https://phabricator.wikimedia.org/T324659 (10Dzahn) a:05Dzahn→03None [22:41:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:21:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:26:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:36:38] (KubernetesAPILatency) firing: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency [23:41:38] (KubernetesAPILatency) resolved: High Kubernetes API latency (LIST services) on k8s-mlserve@codfw - https://wikitech.wikimedia.org/wiki/Kubernetes - https://grafana.wikimedia.org/d/000000435?var-site=codfw&var-cluster=k8s-mlserve - https://alerts.wikimedia.org/?q=alertname%3DKubernetesAPILatency