[00:25:40] RESOLVED: VarnishPrometheusExporterDown: Varnish Exporter on instance cp3073:9331 is unreachable - https://wikitech.wikimedia.org/wiki/Prometheus#Prometheus_job_unavailable - https://grafana.wikimedia.org/d/000000304/varnish-dc-stats?viewPanel=17 - https://alerts.wikimedia.org/?q=alertname%3DVarnishPrometheusExporterDown [00:26:38] FIRING: [4x] LVSRealserverMSS: Unexpected MSS value on 185.15.59.224:443 @ cp3073 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=esams&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [00:28:25] FIRING: [3x] SystemdUnitFailed: haproxy.service on cp3073:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [00:28:43] RESOLVED: [2x] HaproxyKafkaExporterDown: HaproxyKafka on cp3073 is down - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaExporterDown - https://grafana.wikimedia.org/d/d3e4e37c-c1d9-47af-9aad-a08dae2b3fd5/haproxykafka?orgId=1&var-site=esams&var-instance=cp3073 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaExporterDown [00:31:38] RESOLVED: [4x] LVSRealserverMSS: Unexpected MSS value on 185.15.59.224:443 @ cp3073 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=esams&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [00:32:00] FIRING: PurgedHighEventLag: High event process lag with purged on cp3073:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=esams%20prometheus/ops&var-instance=cp3073 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighEventLag [00:32:01] FIRING: PurgedHighBacklogQueue: Large backlog queue for purged on cp3073:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://grafana.wikimedia.org/d/RvscY1CZk/purged?var-datasource=esams%20prometheus/ops&var-instance=cp3073 - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [00:32:43] FIRING: HaproxyKafkaNoMessages: Unexpected rate of produced HaproxyKafka messages by cp3073 - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaNoMessages - https://grafana.wikimedia.org/d/d3e4e37c-c1d9-47af-9aad-a08dae2b3fd5/haproxykafka?orgId=1&var-site=esams&var-instance=cp3073 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaNoMessages [00:38:25] RESOLVED: SystemdUnitFailed: wmf_auto_restart_prometheus-trafficserver-backend-exporter.service on cp4052:9100 - https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state - https://grafana.wikimedia.org/d/g-AaZRFWk/systemd-status - https://alerts.wikimedia.org/?q=alertname%3DSystemdUnitFailed [03:36:20] 10netops, 06Traffic, 06Infrastructure-Foundations: Upgrade End Of Support Junos - https://phabricator.wikimedia.org/T390813#11296819 (10Papaul) @ssingh @Vgutierrez hello just checking in to see if you have a day and time for this for drmrs. Thanks [04:18:24] 06Traffic, 06serviceops: X-Request-Id response header off by 5000 - https://phabricator.wikimedia.org/T407826#11296868 (10tstarling) 05In progressβ†’03Resolved Thanks everyone. I guess it's resolved? Please reopen if there's something left to do. [04:32:43] FIRING: HaproxyKafkaNoMessages: Unexpected rate of produced HaproxyKafka messages by cp3073 - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaNoMessages - https://grafana.wikimedia.org/d/d3e4e37c-c1d9-47af-9aad-a08dae2b3fd5/haproxykafka?orgId=1&var-site=esams&var-instance=cp3073 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaNoMessages [04:42:43] RESOLVED: [2x] HaproxyKafkaNoMessages: Unexpected rate of produced HaproxyKafka messages by cp3073 - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaNoMessages - https://grafana.wikimedia.org/d/d3e4e37c-c1d9-47af-9aad-a08dae2b3fd5/haproxykafka?orgId=1&var-site=esams&var-instance=cp3073 - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaNoMessages [06:34:40] FIRING: VarnishHighThreadCount: Varnish's thread count on cp5028:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://grafana.wikimedia.org/d/wiU3SdEWk/cache-host-drilldown?viewPanel=99&var-site=eqsin&var-instance=cp5028 - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [06:38:00] FIRING: [5x] PurgedHighBacklogQueue: Large backlog queue for purged on cp5026:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [06:39:40] FIRING: [6x] VarnishHighThreadCount: Varnish's thread count on cp5025:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [06:43:00] FIRING: [6x] PurgedHighBacklogQueue: Large backlog queue for purged on cp5026:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [06:44:40] FIRING: [8x] VarnishHighThreadCount: Varnish's thread count on cp5025:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [06:48:00] FIRING: [6x] PurgedHighBacklogQueue: Large backlog queue for purged on cp5026:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [06:49:40] FIRING: [8x] VarnishHighThreadCount: Varnish's thread count on cp5025:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [06:53:00] FIRING: [7x] PurgedHighBacklogQueue: Large backlog queue for purged on cp5026:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [07:03:00] RESOLVED: [7x] PurgedHighBacklogQueue: Large backlog queue for purged on cp5027:2112 - https://wikitech.wikimedia.org/wiki/Purged#Alerts - https://alerts.wikimedia.org/?q=alertname%3DPurgedHighBacklogQueue [07:30:28] 10netops, 06Infrastructure-Foundations, 07sre-alert-triage: Alert in need of triage: PeeringBGPDown (instance cr2-drmrs:9804) - https://phabricator.wikimedia.org/T407945 (10LSobanski) 03NEW [07:30:49] 10netops, 06Infrastructure-Foundations, 07sre-alert-triage: Alert in need of triage: PeeringBGPDown (instance cr3-eqsin:9804) - https://phabricator.wikimedia.org/T407946 (10LSobanski) 03NEW [07:34:40] FIRING: [9x] VarnishHighThreadCount: Varnish's thread count on cp5025:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [07:39:40] FIRING: [10x] VarnishHighThreadCount: Varnish's thread count on cp5025:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [07:44:08] 10netops, 06Infrastructure-Foundations, 07sre-alert-triage: Alert in need of triage: PeeringBGPDown (instance cr2-drmrs:9804) - https://phabricator.wikimedia.org/T407945#11297243 (10cmooney) 05Openβ†’03Resolved a:03cmooney There are other peers to that ASN, these not establishing. Removed. [07:44:55] 10netops, 06Infrastructure-Foundations, 07sre-alert-triage: Alert in need of triage: PeeringBGPDown (instance cr3-eqsin:9804) - https://phabricator.wikimedia.org/T407946#11297248 (10cmooney) 05Openβ†’03Resolved a:03cmooney There are other sessions to that ASN but they have not configured these two.... [07:49:40] FIRING: [12x] VarnishHighThreadCount: Varnish's thread count on cp5025:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [07:54:40] FIRING: [14x] VarnishHighThreadCount: Varnish's thread count on cp5025:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [07:59:40] FIRING: [13x] VarnishHighThreadCount: Varnish's thread count on cp5025:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [08:09:40] FIRING: [9x] VarnishHighThreadCount: Varnish's thread count on cp5025:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [08:14:40] RESOLVED: [7x] VarnishHighThreadCount: Varnish's thread count on cp5025:0 is high - https://wikitech.wikimedia.org/wiki/Varnish - https://alerts.wikimedia.org/?q=alertname%3DVarnishHighThreadCount [12:52:38] FIRING: [4x] LVSRealserverMSS: Unexpected MSS value on 208.80.153.224:443 @ cp2027 - https://wikitech.wikimedia.org/wiki/LVS#LVSRealserverMSS_alert - https://grafana.wikimedia.org/d/Y9-MQxNSk/ipip-encapsulated-services?orgId=1&viewPanel=2&var-site=codfw&var-cluster=cache_text - https://alerts.wikimedia.org/?q=alertname%3DLVSRealserverMSS [12:52:44] ok silencing [13:27:25] 10netops, 06Infrastructure-Foundations, 06SRE, 13Patch-For-Review: Nokia: add new switches in eqiad/codfw to monitoring and make 'active' - https://phabricator.wikimedia.org/T405558#11298210 (10ops-monitoring-bot) Icinga downtime and Alertmanager silence (ID=3ced65be-cbbb-4ba9-91b3-b0f2c626ba79) set by cmo... [14:44:11] 06Traffic, 06DC-Ops, 10ops-eqiad, 06SRE: eqiad row C/D Traffic host migrations - https://phabricator.wikimedia.org/T405623#11298598 (10RobH) Please note this migration has shifted from Oct 15th start date to Nov 1 start date. [14:45:27] 10netops, 06Traffic, 06Infrastructure-Foundations: Upgrade End Of Support Junos - https://phabricator.wikimedia.org/T390813#11298623 (10ssingh) >>! In T390813#11296819, @Papaul wrote: > @ssingh @Vgutierrez hello just checking in to see if you have a day and time for this for drmrs. > Thanks Hi @Papaul. Wha... [14:46:52] 06Traffic, 10Beta-Cluster-Infrastructure: Puppet agent failure detected on instance deployment-cache-upload08 in project deployment-prep - https://phabricator.wikimedia.org/T407966#11298627 (10dancy) [14:53:22] I'm working on an alert for cp* hosts that have been depooled for an extended period of time. (Phab task for reference: https://phabricator.wikimedia.org/T406641) However, I don't know where this should run, and searching Wikitech hasn't helped with that. Does anyone have any suggestions as to where this should run or any docs that I can read to help with this decision? I'd appreciate any help y'all can provide [14:55:57] 06Traffic, 10Beta-Cluster-Infrastructure: Puppet agent failure detected on instance deployment-cache-upload08 in project deployment-prep - https://phabricator.wikimedia.org/T407966#11298664 (10ssingh) This is because of: ` + # lint:ignore:puppet_url_without_modules + file { '/etc/varnish/browser-detecti... [14:59:56] ChrisDobbins901_: it's basically going to be a python script, or tantamount to that? an existing host with conftool installed is probably easiest -- like maybe the puppetservers [15:01:29] 06Traffic, 10Beta-Cluster-Infrastructure: Puppet agent failure detected on instance deployment-cache-upload08 in project deployment-prep - https://phabricator.wikimedia.org/T407966#11298701 (10dancy) Probably introduced via https://gerrit.wikimedia.org/r/c/operations/puppet/+/1197986 [15:01:52] 06Traffic, 10Beta-Cluster-Infrastructure: Puppet agent failure detected on instance deployment-cache-upload08 in project deployment-prep - https://phabricator.wikimedia.org/T407966#11298703 (10dancy) [15:01:54] 06Traffic, 10Hiddenparma, 06SRE: Integrate code from the private repository into the CDN - https://phabricator.wikimedia.org/T404826#11298704 (10dancy) [15:02:58] 06Traffic, 10Hiddenparma, 06SRE: Integrate code from the private repository into the CDN - https://phabricator.wikimedia.org/T404826#11298723 (10dancy) https://gerrit.wikimedia.org/r/c/operations/puppet/+/1197986 has caused puppet to break on `deployment-cache-upload08.deployment-prep`. Please help! [15:03:28] 06Traffic, 10Beta-Cluster-Infrastructure: Puppet agent failure detected on instance deployment-cache-upload08 in project deployment-prep - https://phabricator.wikimedia.org/T407966#11298728 (10dancy) I asked for help in T404826 [15:08:25] 06Traffic, 10Hiddenparma, 06SRE: Integrate code from the private repository into the CDN - https://phabricator.wikimedia.org/T404826#11298750 (10ssingh) >>! In T404826#11298704, @dancy wrote: > https://gerrit.wikimedia.org/r/c/operations/puppet/+/1197986 has caused puppet to break on `deployment-cache-upload... [15:53:59] cdanis: yep, a python script. I considered the puppetservers, but wasn't sure if that would be appropriate [15:54:21] seems fine to me :) [15:54:36] awesome. thank you! [15:55:30] ChrisDobbins901_: oh actually, probably the right profile to do it in is the same as this patch: https://gerrit.wikimedia.org/r/c/operations/puppet/+/1191483 [15:56:26] πŸ’™cdanis@cumin1003.eqiad.wmnet ~ πŸ•šβ˜• sudo cumin P:conftool::state [15:56:28] 12 hosts will be targeted: [15:56:30] config-master2001.codfw.wmnet,config-master1001.eqiad.wmnet,puppetmaster[2001-2002].codfw.wmnet,puppetmaster[1001,1003].eqiad.wmnet,puppetserver[2001-2002,2004].codfw.wmnet,puppetserver[1001-1003].eqiad.wmnet [15:57:07] wow, that's super helpful. I really appreciate that :D [16:08:38] cdanis: Your PS1 has emoji? :D [16:09:16] does the coffee or heart change? Maybe a broken heart when you're root? :P [16:09:36] no, it's a broken heart when the exit code of the previous process was nonzero [16:09:44] oh and the beverage and the clock change with the time of day [16:10:13] ha, that's fun [16:12:00] oh and I forgot I did this for root [16:12:03] πŸ’™root@cumin1003.eqiad.wmnet ~ πŸ•›πŸ™ƒ [16:12:07] background is red as well [16:15:19] Love it! [16:16:27] <3 [16:45:58] that's really cool! [16:49:54] mine is pretty boring - just changes the dollar to a red hash [16:57:23] 06Traffic: [Update DNS Record Request] - wikimedia.org - https://phabricator.wikimedia.org/T408003 (10JKelsoteel-WMF) 03NEW [17:37:17] 06Traffic: [Update DNS Record Request] - wikimedia.org - https://phabricator.wikimedia.org/T408003#11299592 (10ssingh) a:03BCornwall [17:56:27] 06Traffic, 10Hiddenparma, 13Patch-For-Review: Introduce known-client identity objects and integrate with requestctl - https://phabricator.wikimedia.org/T403220#11299688 (10Scott_French) [18:03:27] 10Domains, 06Traffic: URL can use another script - https://phabricator.wikimedia.org/T32766#11299727 (10BCornwall) WMF utilizes Markmonitor's trademark protection offerings, including GlobalBlock, which provides protections against homograph attacks. For instance: replacing the first `i` in wikipedia.org with... [18:37:26] 10Domains, 06Traffic: URL can use another script - https://phabricator.wikimedia.org/T32766#11299854 (10Dzahn) @BCornwall I feel like this ticket is partially a MediaWiki feature request. Might disagree with Pppery removing the tag indicating that. But not entirely sure. It seems likely that this is rejected... [18:52:38] 06Traffic, 10Hiddenparma, 13Patch-For-Review: Introduce known-client identity objects and integrate with requestctl - https://phabricator.wikimedia.org/T403220#11299896 (10Scott_French) [19:10:19] 06Traffic, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Puppet agent failure detected on instance deployment-cache-upload08 in project deployment-prep - https://phabricator.wikimedia.org/T407966#11299949 (10ssingh) 05Openβ†’03Resolved a:03ssingh Sorry about this, this should now be fixed. And g... [19:27:28] 06Traffic, 10Beta-Cluster-Infrastructure, 13Patch-For-Review: Puppet agent failure detected on instance deployment-cache-upload08 in project deployment-prep - https://phabricator.wikimedia.org/T407966#11300004 (10dancy) Thanks @ssingh ! [20:55:43] FIRING: [2x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [21:05:43] RESOLVED: [2x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [22:53:43] FIRING: [2x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [22:58:43] FIRING: [20x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [23:03:43] FIRING: [23x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [23:08:43] FIRING: [23x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages [23:13:43] RESOLVED: [23x] HaproxyKafkaSocketDroppedMessages: Sustained high rate of dropped messages from HaproxyKafka - https://wikitech.wikimedia.org/wiki/HAProxyKafka#HaproxyKafkaSocketDroppedMessages - https://alerts.wikimedia.org/?q=alertname%3DHaproxyKafkaSocketDroppedMessages