-
Notifications
You must be signed in to change notification settings - Fork 389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Persist repository build logs for later access #1156
Comments
Using an s3 compatible object store might be an option? It's easier to setup on public cloud than managing a K8S persistent volume, and it also means we don't need an endpoint to download the logs, if the bucket is public you can give people a direct https download link. You can optionally specify a TTL for auto-deletion, and it should work out cheaper than a disk volume. |
@manics That sounds like a great solution. We would need to:
|
I wonder how much can be accomplished by already existing software related to k8s logging, and what makes sense to develop ourselves. Here are some relevant links to consider this further.
|
I like the idea of storing the "end product" in a S3 bucket. Especially if what we store is "finished HTML" so that serving a "log page" can be done by nginx or even directly from the bucket. I read k8s logging docs and not sure which of the scenarios they list would fit us best. An idea from the guide I like is to add a sidecar to the repo2docker Pod that takes care of processing and streaming the logs to the bucket. I have used filebeat to ship JSONL from a file in a container to a ElasticSearch instance. It worked well once setup but I found the documentation for filebeat confusing/hard to read so it involved a lot off trial&error. A quick google suggests filebeat can't ship to S3. On the one hand having a off the shelf tool do this for us would be nice (one less thing to maintain), on the other hand it might be as much or more work to find and configure one as it is to write a small utility ourselves that does exactly what we want (produce ANSI coloured HTML output). For (3) from Min's suggestion: Thinking about the API endpoint: we could have What is a nice way to make the URL of the log available inside the launched Binder or do we skip that for now? I'm not sure how to do that nicely. Could BinderHub use that API endpoint to work out the bucket URL (or use the code behind that endpoint directly) and set it as a environment variable? Would we frequently be sending people to the previous build's logs because log processing hadn't completed/the new bucket not ready yet when the Pod is launched after being built? |
I think it'd be best to store the raw logs (plain text) instead of wrapping it with HTML It's what Travis and GitHub actions do, and it means if you've got a large log you can easily download and search it. Initially I think giving users a direct link to the plain text is fine since this is a new feature and you'll probably need some developer experience to understand it. A second version could add a simple HTML viewer to BinderHub. I don't think you can't easily stream logs to S3, so it'd be a case of write them to a temporary file and upload at the end of the build (Edit: See https://stackoverflow.com/a/8881939) We already calculate a deterministic image name for each build, so if we name the log something like |
It seemed like a fun mini-project to investigate on a rainy day: jupyterhub/repo2docker#967 The S3 upload bit should be relatively easy whether it's in repo2docker or BinderHub, as has been pointed out the difficult bit is getting hold of all the logs we want. I also found this issue moby/buildkit#1472 which if implemented would allow us to access logs inside the container that's failed to build. |
Thinking ahead to run-time logs, using a centralised logging system could work but we'd still need some infrastructure on top of it to filter the relevant logs for users. I don't think it's safe to make all logs public to everyone since there may be private information in there, either related to launching Jupyter or because users have run something private in their container that has emitted some logs. If this was limited to only the launch phase then if BinderHub knows the pod name it could do something like upload the output of |
I think we can handle run-time logs with a much simpler approach by using an entrypoint that tees the 'real' entrypoint output to a file folks can read within the container, rather than deployment-specific external storage. Something like
It doesn't persist beyond the life of the container, but I think that's a good thing. |
This issue has been mentioned on Jupyter Community Forum. There might be relevant details there: https://discourse.jupyter.org/t/accessing-the-jupyter-notebook-logs/6263/2 |
Proposed change
When a repository is being built we stream the output of repo2docker to the website the user sees. It would be great if this log would also be available after the build has failed. For example builds can take a very long time (hours or more) in which users close the tab or otherwise leave. They have no way to recover the build log once the build has failed. The only option is to restart the build and this time wait. It is also hard to copy&paste text from the build log to share for debugging purposes. In both cases it would help to have a stable URL that for some time shows the build log.
This is an attempt to make a actionable issue out of #155.
Alternative options
For successful builds we could store the build log in the container image as a special file. This way users could access it from "within" their binder. This would not help users for which the build fails, as they'd never get a launching binder.
Who would use this feature?
People who have long running builds that need debugging or otherwise want to share/look at the output of a build.
(Optional): Suggest a solution
A new endpoint in BinderHub that outputs the build log of the last build for a given "repo spec". The log would be overwritten the next time the same spec is built. It would be a reasonably stable URL that is easy to discover. An alternative would be to assign a "build number" to each build, this would be super stable but creates the challenge of how a user would discover what the build number of their build was.
Build logs can be large. This means it is probably not feasible to store them in the process memory of BinderHub. In memory or on disk in the pod storage also means that on a cluster that runs several instances of the binderhub pod you'd need a mechanism for sending requests for the log to the right pod. This points towards the need for an additional service in which to store the logs.
The BinderHub process already sees a copy of the build log as it streams it to users. It could stream it to a log sink at the same time. Or the log could be archived independent of the BinderHub process directly from the repo2docker pod.
The text was updated successfully, but these errors were encountered: