diff --git a/docs/source/architecture.md b/docs/source/architecture.md new file mode 100644 index 00000000..be532916 --- /dev/null +++ b/docs/source/architecture.md @@ -0,0 +1,106 @@ +# Architecture + +This is a living document talking about the architecture of repo2docker +from various perspectives. + +## Buildpack + +The **buildpack** concept comes from [Heroku](https://devcenter.heroku.com/articles/buildpacks) +and Ruby on Rails' [Convention over Configuration](http://rubyonrails.org/doctrine/#convention-over-configuration) +doctrine. + +Instead of the user specifying a complete specification of exactly how they want +their environment to be, they can focus only on how their environment differs from a conventional +environment. This means instead of deciding 'should I get Python from Apt or pyenv or ?', user +can just specify 'I want python-3.6'. Usually, specifying a **runtime** and list of **libraries** +with explicit **versions** is all that is needed. + +In repo2docker, a Buildpack does the following things: + +1. **Detect** if it can handle a given repository +2. **Build** a base language environment in the docker image +3. **Copy** the contents of the repository into the docker image +4. **Assemble** a specific environment in the docker image based on repository contents +5. **Push** the built docker image to a specific docker registry (optional) +6. **Run** the build docker image as a docker container (optional) + +### Detect + +When given a repository, repo2docker first has to determine which buildpack to use. +It takes the following steps to determine this: + +1. Look at the ordered list of `BuildPack` objects listed in `Repo2Docker.buildpacks` + traitlet. This is populated with a default set of buildpacks in most-specific-to-least-specific + order. Other applications using this can add / change this using traditional + [traitlet](http://traitlets.readthedocs.io/en/stable/) configuration mechanisms. +2. Calls the `detect` method of each `BuildPack` object. This method assumes that the repository + is present in the current working directory, and should return `True` if the repository is + something that it should be used for. For example, a `BuildPack` that uses `conda` to install + libraries can check for presence of an `environment.yml` file and say 'yes, I can handle this + repository' by returning `True`. Usually buildpacks look for presence of specific files + (`requirements.txt`, `environment.yml`, `install.R`, etc) to determine if they can handle a + repository or not. +3. If no `BuildPack` returns true, then repo2docker will use the default `BuildPack` (defined in + `Repo2Docker.default_buildpack` traitlet). + +## Build base environment + +Once a buildpack is chosen, it builds a **base environment** that is mostly the same for various +repositories built with the same buildpack. + +For example, in `CondaBuildPack`, the base environment consists of installing [miniconda](https://conda.io/miniconda.html) +and basic notebook packages (from `repo2docker/buildpacks/conda/environment.yml`). This is going +to be the same for most repositories built with `CondaBuildPack`, so we want to use +[docker layer caching](https://thenewstack.io/understanding-the-docker-cache-for-faster-builds/) as +much as possible for performance reasons. Next time a repository is built with `CondaBuildPack`, +we can skip straight to the **copy** step (since the base environment docker image *layers* have +already been built and cached). + +The `get_build_scripts` and `get_build_script_files` methods are primarily used for this. +`get_build_scripts` can return arbitrary bash script lines that can be run as different users, +and `get_build_script_files` is used to copy specific scripts (such as a conda installer) into +the image to be run as pat of `get_build_scripts`. Code in either has following constraints: + +1. You can *not* use the contents of repository in them, since this happens before the repository + is copied into the image. For example, `pip install -r requirements.txt` will not work, + since there's no `requirements.txt` inside the image at this point. This is an explicit + design decision, to enable better layer caching. +2. You *may*, however, read the contents of the repository and modify the scripts emitted based + on that! For example, in `CondaBuildPack`, if there's Python 2 specified in `environment.yml`, + a different kind of environment is set up. The reading of the `environment.yml` is performed + in the BuildPack itself, and not in the scripts returned by `get_build_scripts`. This is fine. + BuildPack authors should still try to minimize the variants created in this fashion, to + optimize the build cache. + +## Copy repository contents + +The contents of the repository are copied unconditionally into the Docker image, and made +available for all further commands. This is common to most `BuildPack`s, and the code is in +the `build` method of the `BuildPack` base class. + +## Assemble repository environment + +The **assemble** stage builds the specific environment that is requested by the repository. +This usually means installing required libraries specified in a format native to the language +(`requirements.txt`, `environment.yml`, `REQUIRE`, `install.R`, etc). + +Most of this work is done in `get_assemble_scripts` method. It can return arbitrary bash script +lines that can be run as different users, and has access to the repository contents (unlike +`get_build_scripts`). The docker image layers produced by this usually can not be cached, +so less restrictions apply to this than to `get_build_scripts`. + +At the end of the assemble step, the docker image is ready to be used in various ways! + +## Push + +Optionally, repo2docker can **push** a built image to a [docker registry](https://docs.docker.com/registry/). +This is done as a convenience only (since you can do the same with a `docker push` after using repo2docker +only to build), and implemented in `Repo2Docker.push` method. It is only activated if using the +`--push` commandline flag. + +## Run + +Optionally, repo2docker can **run** the built image and allow the user to access the Jupyter Notebook +running inside by default. This is also done as a convenience only (since you can do the same with `docker run` +after using repo2docker only to build), and implemented in `Repo2Docker.run`. It is activated by default +unless the `--no-run` commandline flag is passed. \ No newline at end of file