User Details
- User Since
- Aug 21 2018, 6:05 PM (352 w, 19 h)
- Availability
- Available
- LDAP User
- Cwhite
- MediaWiki User
- CWhite (WMF) [ Global Accounts ]
Yesterday
Thu, May 15
Wed, May 14
Tue, May 13
Calling this one done!
Dashboard looks migrated thanks to @andrea.denisse!
I'm going to call this resolved as we have a few methods available now.
Checking back on this, it seems the situation hasn't improved much.
I imported and modified a dashboard to help us see what's going on. It shows that up until a couple hours ago, we were still at 46k unique timeseries.
Fri, May 9
Thu, May 8
The "generated" field is is now calculated from meta.dt - age and stored.
Fri, May 2
Now that graphite is fully deprecated, I don't think we need this library any more.
Wed, Apr 30
Tue, Apr 29
Mon, Apr 28
Thu, Apr 24
The fields explosion event that prompted this task substantially slowed ingest across all topics. Given that, I do not think splitting indexes on the OpenSearch side would have isolated the problem logs from the rest of the log streams.
Another option is partitioning the data into more indexes to reduce index size.
Apr 18 2025
Apr 17 2025
The feature has landed. Thanks, all!
Apr 16 2025
Update: we saw immediate ingest improvement after removing the out_request and outRequest fields which were generated by mobileapps. We should watch for more inexplicable dips in throughput in the coming days if this was an incomplete mitigation.
Apr 15 2025
@fgiunchedi noted
Apr 14 2025
Affected dashboards: https://grafana.wikimedia.org/d/K6DEOo5Ik/grafana-graphite-datasource-utilization
Apr 9 2025
The final step in the migration is to update the dashboards using these metrics to use the Prometheus metrics provided by the Thanos datasource. I found these dashboards still referencing Graphite for this data:
- Wikidata Quality Constraints (WBQC) https://grafana.wikimedia.org/d/000000344
- Ladsgroup-test https://grafana.wikimedia.org/d/000000378
A couple points for awareness as you plan the project:
Apr 3 2025
We have alerting now and we know a simple restart of statsv brings it back. Optimistically closing.
Apr 2 2025
We do a lot of filtering on the legacy pipeline for fields that cause problems. All filtering we do there is manual and changes regularly.
Mar 28 2025
We've rolled out a logstash filter to check for name KafkaSSE and to cast the assignments field into a string. This can be undone when it is no longer needed.
Checking back today, istio-ingressgateway is logging about 1000/logs/sec these days.
Mar 27 2025
I think it's because the object is the second argument rather than the measurement.
Log volume is down quite a bit since raising the production-ratelimit log level filter. Will check back in on this in a few days to re-evaluate in case there are other applications doing the same thing.
First suspect is kubernetes.container_name:"production-ratelimit" generating debug logs at around 1500/sec.
Mar 25 2025
Mar 21 2025
Dashboards migrated
Mar 20 2025
From around the same time:
Mar 19 11:45:21 webperf1003 python3[3604232]: Process Process-1: Mar 19 11:45:21 webperf1003 python3[3604232]: Traceback (most recent call last): Mar 19 11:45:21 webperf1003 python3[3604232]: File "/usr/lib/python3.9/multiprocessing/process.py", line 315, in _bootstrap Mar 19 11:45:21 webperf1003 python3[3604232]: self.run() Mar 19 11:45:21 webperf1003 python3[3604232]: File "/usr/lib/python3.9/multiprocessing/process.py", line 108, in run Mar 19 11:45:21 webperf1003 python3[3604232]: self._target(*self._args, **self._kwargs) Mar 19 11:45:21 webperf1003 python3[3604232]: File "/srv/deployment/statsv/statsv/statsv.py", line 268, in process_queue Mar 19 11:45:21 webperf1003 python3[3604232]: emit(sock, statsd_addr, statsd_message) Mar 19 11:45:21 webperf1003 python3[3604232]: File "/srv/deployment/statsv/statsv/statsv.py", line 195, in emit Mar 19 11:45:21 webperf1003 python3[3604232]: sock.sendto(payload.encode('utf-8'), addr) Mar 19 11:45:21 webperf1003 python3[3604232]: socket.gaierror: [Errno -2] Name or service not known
It seems the statsv process wedged itself. After restarting the process, metrics are now flowing again.
Mar 18 2025
Config deployed!
Mar 14 2025
Dashboards migrated
Mar 13 2025
Dashboards migrated
Dashboards migrated.
Mar 12 2025
Linking this task here in case it helps with the investigation: T385058: logstash.rb uses deprecated Socket.gethostbyname
Mar 7 2025
Mar 5 2025
Restbase metrics are now being ingested by Prometheus.
Mar 4 2025
Feb 26 2025
Feb 25 2025
The change is deployed. I'm not sure there is anything left to do, but optimistically resolving.
Feb 21 2025
Feb 18 2025
Feb 10 2025
The rest of the metrics on the dashboard appear to be generated by Airflow.