Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Management interface #2

Open
cholcombe973 opened this issue Aug 18, 2018 · 9 comments
Open

Management interface #2

cholcombe973 opened this issue Aug 18, 2018 · 9 comments
Assignees

Comments

@cholcombe973
Copy link
Owner

How should the site reliability engineer(SRE) be interfacing with this filesystem? SRE's generally like metrics to be exported so that external systems can track the health of the cluster. CLI is usually also pretty high on the list as well as a REST interface for people who are more DIY oriented.

@garypen
Copy link
Collaborator

garypen commented Aug 27, 2018

SRE (general use case)

In general, an SRE is going to use whichever management facilities are provided by his/her platform and almost certainly doesn't want to learn new solutions (unless they provide massive benefits). For instance, Prometheus is a popular choice for kubernetes. Many other "log consuming" solutions exist: splunk, etc..., which all work in more of less the same way.

I think the best way to interact with these systems is to provide logging facilities which follow the "standards" in this area:

  • configurable level (error, warn, etc..)
  • configurable destination (stdout, logfile, etc...)

and then rely on tools such as Prometheus, Splunk, etc.. to provide health monitoring, alerts, etc.. based on the log contents.

Logging

We could use the "log" crate (https://github.com/rust-lang-nursery/log) as a facade over our chosen logging facility. My current preference would be to use "log4rs", but with the protection of the facade we could change that decision later if required as the logging space evolves.

An alternative is "slog", which has widespread use.

CLI/REST

I like the approach adopted by many systems nowadays of writing an API (RESTful, ...) that supports management/configuration of a system and then writing a client that exercises that API and exposes most (all?) of the features.

Good examples include: kubectl (kubernetes), openstack (OpenStack), etc...

We should be doing something like this: rusixctl... ?

Deciding which web framework (if any) to adopt to implement the RESTful interface will be tricky. There are many candidates (warp, tower-web (soon), actix-web, conduit, gotham, etc...) and this is a space that is evolving rapidly. I like the look of warp, but it's very new. I also like gotham and actix-web. Any preferences?

@cholcombe973
Copy link
Owner Author

Wow yeah I agree setting up a REST API and then letting people do their thing would be best. I'm actually not familiar with warp or tower-web. I've created a few things with rocket though and that I was nice. I'm generally agreeing with everything you're saying here and I don't have a preference for a web-framework. Rocket works well but it requires nightly which is kind of a pain sometimes. Anything that stays on stable would be really nice. I've been kinda leaning away from json lately just because it's ambiguous but I don't have a problem if you want to use it.

So far I've only really used the log crate. I gave slog a try awhile back and it was alright. I didn't feel like I gained enough from it to prefer it though. I agree making logging configurable in terms of destination and level is best. I have a few examples of doing that with the clap crate and it's super easy.

@jcgruenhage
Copy link

I don't necessarily agree on generating metrics from logs, having a built in metrics endpoint to be scraped externally instead is a lot better IMO.

For logging, the log crate is definitely the way to go, with a backend like env_logger or fern.

For choosing a web framework, I'd currently go with actix-web. It has been really nice to use when I tried it, works on stable and is still progressing. Warp also looks nice, but a bit basic, and tower-web is not there yet.

@jcgruenhage
Copy link

jcgruenhage commented Aug 29, 2018

For metrics, the tic crate looks like it would be a good fit.

@garypen
Copy link
Collaborator

garypen commented Aug 29, 2018

Looks like we are in agreement about logging: use log crate and then choose appropriate back-end. I don't have strong feelings about that so happy to go with suggestions. For web-framework: actix-web is a good choice and I'm happy to go with that.

For metrics: I still think it seems like an unnecessary thing to be considering. Mainly because we are going to do logging anyway and various ETL stacks do a good job of handling log data. e.g. grafana provides great visualisations from log data. What additional functionality would something like "tic" be providing?

@jcgruenhage
Copy link

For metrics data from logging to be useable, we'd need to log every event relevant to the metrics we want to have. If we use a different thing, dedicated to metrics, that means we don't have to spam the logs with events that we probably don't care about and that make the logs less human readable.

Tic presents those metrics on an http endpoint, where external scrapers (something like Prometheus for example) can get them. Those scrapers could then be used as a data source for Grafana and the like.

@garypen
Copy link
Collaborator

garypen commented Aug 29, 2018

Lots of applications do log many metrics nowadays because the consumers of the data don't want to deal with multiple log sources. However, I agree with you that "spamming" the logs is a problem.

Ok, I'm happy to decide to keep metric data separate from other logged data. I'm happy with the choice of tic.

Can we close this and note that:

web-framework: actix-web
metrics: REST endpoint/tic
logging: log create + back-end logger (to be decided later).
?

@jcgruenhage
Copy link

I'd be fine with those three, yes.

Maybe having an admin interface outside of that that shows a few basic things without having a monitoring/metrics stack (Prometheus + AlertManager + Grafana) would still be useful? List of storage devices with usage/health, replication/erasure coding settings/health, cluster capacity and usage, things like that.

@cholcombe973
Copy link
Owner Author

cholcombe973 commented Aug 29, 2018 via email

@garypen garypen self-assigned this Sep 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants