A robust product requires a robust telemetry system. As much information we have the better, either if the information is for service status, performance, traffic, as well as the information we are moving from one side to the other through our services.
This post is about client telemetry I'm going to mention some common problems we can face if we don't have a robust telemetry system, or if we don't have the system at all.
As any architecture design, there's not a single solution for this, a telemetry solution will depend on your current architecture and how big is your product in order to prevent an overengineering when you probably can go with a simpler and still effective solution.
Common problems
- No telemetry system at all: this is the worst-case scenario where we are completely blind, and we have no way to identify client issues or take control of behavioural problems. We can also be interested in monitoring new features in order to understand if they are working as designed. No having a telemetry system at all will lead us into wasting of time, adding logs that we should visualize locally only, waiting for build times if we want new logs or mute them, and this is just for a local testing environment because we won't be able to see what end users are doing really.
- No control for reported data: this is a scenario where we already have a telemetry system, but this is a static implementation where we are not able to modify the flow of data unless we deploy a new version in production. It can be effective enough, but we can fall in unexpected issues, and we won't be able to solve this problem quickly unless we deploy a new version in production (something that can take several days if we are talking about mobile apps).
Solutions
For an open-source solution I recommend ELK + Filebeat integration, both doing a great job and providing amazing free monitoring features through Kibana.
For paid solutions, Datadog or Loggly are some good alternatives out there.
Below you can find some useful information for ELK + Filebeat integration and considerations. I was able to run the stack locally very quickly through Docker containers, and then, it can be scaled up easily.
Design for ELK + Filebeat
If you are not familiar with ELK, this is a group of open-source projects: Elasticsearch, Logstash and Kibana.
It's a great combination for log management where we have Logstash as the first layer which ingest data from multiple sources and also providing a data processing pipeline where we can filter and transform data before sending it to Elasticsearch.
Elasticsearch is one of the most popular search engine available at the moment, with features like scalable search, near real-time search, between other advantages for big data processing and scalability.
Finally, Kibana is the top layer visualization tool on top of Elasticsearch, which provides dashboards, graphs or charts.
Considering Logstash is running under the JVM, deploying it on every server for log collection and data processing becomes a root cause of significant memory consumption.
Filebeat is a more lightweight solution written in Go with a low memory footprint, it's a better alternative for data extraction. It also provides features like back pressure handling or detect JSON nested in a message.
Filebeat is one of the different Beats available, in this case will be collecting logs from file, but there are other alternatives like Packetbeat (network metrics) or Metricbeat (server metrics). More info here https://www.elastic.co/guide/en/beats/libbeat/current/beats-reference.html
Setup ELK
I recommend going to ELK documentation for all details and considerations required for a production environment, https://www.elastic.co/start
I found this Github project very useful for a development environment https://github.com/deviantony/docker-elk
It's an ELK stack ready for docker. Just run the docker compose file in your server, and it will be up and running with no extra effort. You just need some extra work for scalable solutions like Kubernetes or Rancher.
Filebeat integration:
- Modify
logstash/pipeline/logstash.conf
and include the following filter:
filter {
mutate {
rename => ["host", "hostname"]
convert => {"hostname" => "string"}
}
}
Otherwise, Logstash won't be able to process data coming from Filebeat.
2. Logstash service is exposing port 5044
for Filebeat incoming data, you must ensure you are sending Filebeat data to the right server and port.
For local testing purpose you can include the following services to the existing docker compose file:
my-service:
image: my-service:1.0
ports:
- "8080:8080"
volumes:
- ./logs/:/var/log/my-service/
networks:
- elk
filebeat:
image: docker.elastic.co/beats/filebeat:6.5.1
depends_on:
- elasticsearch
volumes:
- ./config/filebeat.yml:/usr/share/filebeat/filebeat.yml
- ./logs/:/logs/
networks:
- elk
Ensure Filebeat is able to read logs from your service (in this example both are sharing the same folder from the host)
3. Specify your Filebeat input and output data in filebeat.yml
filebeat.inputs:
- type: log
paths:
- /logs/*.log
output.logstash:
hosts: ["logstash:5044"]
In the example above, it's reading all .log
files under /logs/
folder and sending data to logstash:5044
considering both Logstash and Filebeat are running under the same network.
So... how we collect data from the client?
I was writing about all the log management stack and how to extract and process all logs on the server side, but I still need to explain how to collect logs from the client.
We need our own service that will work as a middle man between the client and Filebeat+ELK. It's a very simple API that can receive logs from the client and put them in log files. Later on, Filebeat will take those logs automatically and send them to Logstash.
Why not sending logs to Logstash directly?
Hitting the Logstash layer directly from the client side probably is not the best idea. These are some reasons why we would prefer an extra layer:
- We are eliminating Filebeat, meaning that we are increasing memory footprint and network traffic for Logstash.
- If clients suddenly start reporting more logs than normal (e.g. an unexpected error causing a peak) we can overload Logstash and stop receiving logs from all our users. Even when Logstash can scale up and prevent this issue, we need to be prepared for this situation that could lead into a higher cost to maintain all the different instances. The extra layer mentioned below is a much lightweight service that put logs into files only, then Filebeat will send them to Logstash asynchronously.
- Log format and customization, we may want to customize our log format and content per app or version. We can build a logic on top of our own service instead of trying to create custom filters and configs for Logstash. Also, we have the chance to create different services per app and distribute the load accordingly without touching Logstash at all.
A very simple approach for a Golang service, using Logrus and Gorilla Mux:
var clientLog *logging.Writer
func main() {
//we use this logger only for logs coming from the client, for everything else we can use the global logger that will print out to stdout by default
clientLog = logging.CreateWriter("/var/logs/my-service/my-service.log"))
r := mux.NewRouter()
r.HandleFunc("/process", Process).Methods(http.MethodPost)
srv := &http.Server{
Handler: r,
Addr: ":8080",
// Good practice: enforce timeouts for servers you create!
WriteTimeout: 15 * time.Second,
ReadTimeout: 15 * time.Second,
}
log.Fatal(srv.ListenAndServe())
}
func Process(w http.ResponseWriter, r *http.Request) {
logs := ...//read data
for _, el := range logs {
clientLog.Write(el)
}()
... //return successful response
}
It's expected to receive multiple logs in a single request, it will help with performance and reduce network traffic. Data compression is also recommended, considering we can send logs very often and some of them can provide a lot of useful information to analyse, we may want to reduce network traffic in this case and improve performance for lower request payload coming to the service.
What else?
I've mentioned a problem related to having static client implementation, meaning that we are not able to modify the data flow coming from the client.
Logs are expected to be delivered through a client logger implementation with different severity levels available (debug, warning, error, etc). This will allow us to customize and limit the logs we want to send to the server.
I won't cover a solution for a more flexible configuration, but it's recommended to have some control on the server side and be able to modify client logging configuration remotely. Allowing us to mute unexpected logs, or enable other log severities if we need a deep understanding of what's going on for certain components. Depending on your client application, this can be done differently by an existing config push service, or create a specific admin tool to customize and push or pull this config from the client.
ELK works perfectly fine for me, and this is a simple solution that brings huge benefits for analysing and detecting anomalies, creating more robust products and at the same time reliable for users.