repo2docker/docs/source/architecture.md

7.9 KiB

Architecture of repo2docker

This is a living document talking about the architecture of repo2docker from various perspectives.

Buildpacks

The buildpack concept comes from Heroku and Ruby on Rails' Convention over Configuration doctrine.

Instead of the user specifying a complete specification of exactly how they want their environment to be, they can focus only on how their environment differs from a conventional environment. This means instead of deciding 'should I get Python from Apt or pyenv or ?', user can just specify 'I want python-3.6'. Usually, specifying a runtime and list of libraries with explicit versions is all that is needed.

In repo2docker, a Buildpack does the following things:

  1. Detect if it can handle a given repository
  2. Build a base language environment in the docker image
  3. Copy the contents of the repository into the docker image
  4. Assemble a specific environment in the docker image based on repository contents
  5. Push the built docker image to a specific docker registry (optional)
  6. Run the build docker image as a docker container (optional)

Detect

When given a repository, repo2docker first has to determine which buildpack to use. It takes the following steps to determine this:

  1. Look at the ordered list of BuildPack objects listed in Repo2Docker.buildpacks traitlet. This is populated with a default set of buildpacks in most-specific-to-least-specific order. Other applications using this can add / change this using traditional traitlet configuration mechanisms.
  2. Calls the detect method of each BuildPack object. This method assumes that the repository is present in the current working directory, and should return True if the repository is something that it should be used for. For example, a BuildPack that uses conda to install libraries can check for presence of an environment.yml file and say 'yes, I can handle this repository' by returning True. Usually buildpacks look for presence of specific files (requirements.txt, environment.yml, install.R, manifest.xml etc) to determine if they can handle a repository or not. Buildpacks may also look into specific files to determine specifics of the required environment, such as the Stencila integration which extracts the required language-specific executions contexts from an XML file (see base BuildPack). More than one buildpack may use such information, as properties can be inherited (e.g. the R buildpack uses the list of required Stencila contexts to see if R must be installed).
  3. If no BuildPack returns true, then repo2docker will use the default BuildPack (defined in Repo2Docker.default_buildpack traitlet).

Build base environment

Once a buildpack is chosen, it builds a base environment that is mostly the same for various repositories built with the same buildpack.

For example, in CondaBuildPack, the base environment consists of installing miniconda and basic notebook packages (from repo2docker/buildpacks/conda/environment.yml). This is going to be the same for most repositories built with CondaBuildPack, so we want to use docker layer caching as much as possible for performance reasons. Next time a repository is built with CondaBuildPack, we can skip straight to the copy step (since the base environment docker image layers have already been built and cached).

The get_build_scripts and get_build_script_files methods are primarily used for this. get_build_scripts can return arbitrary bash script lines that can be run as different users, and get_build_script_files is used to copy specific scripts (such as a conda installer) into the image to be run as pat of get_build_scripts. Code in either has following constraints:

  1. You can not use the contents of repository in them, since this happens before the repository is copied into the image. For example, pip install -r requirements.txt will not work, since there's no requirements.txt inside the image at this point. This is an explicit design decision, to enable better layer caching.
  2. You may, however, read the contents of the repository and modify the scripts emitted based on that! For example, in CondaBuildPack, if there's Python 2 specified in environment.yml, a different kind of environment is set up. The reading of the environment.yml is performed in the BuildPack itself, and not in the scripts returned by get_build_scripts. This is fine. BuildPack authors should still try to minimize the variants created in this fashion, to optimize the build cache.

Copy repository contents

The contents of the repository are copied unconditionally into the Docker image, and made available for all further commands. This is common to most BuildPacks, and the code is in the build method of the BuildPack base class.

Assemble repository environment

The assemble stage builds the specific environment that is requested by the repository. This usually means installing required libraries specified in a format native to the language (requirements.txt, environment.yml, REQUIRE, install.R, etc).

Most of this work is done in get_assemble_scripts method. It can return arbitrary bash script lines that can be run as different users, and has access to the repository contents (unlike get_build_scripts). The docker image layers produced by this usually can not be cached, so less restrictions apply to this than to get_build_scripts.

At the end of the assemble step, the docker image is ready to be used in various ways!

Push

Optionally, repo2docker can push a built image to a docker registry. This is done as a convenience only (since you can do the same with a docker push after using repo2docker only to build), and implemented in Repo2Docker.push method. It is only activated if using the --push commandline flag.

Run

Optionally, repo2docker can run the built image and allow the user to access the Jupyter Notebook running inside by default. This is also done as a convenience only (since you can do the same with docker run after using repo2docker only to build), and implemented in Repo2Docker.run. It is activated by default unless the --no-run commandline flag is passed.

ContentProviders

ContentProviders provide a way for repo2docker to know how to find and retrieve a repository. They follow a similar pattern as the BuildPacks described above. When repo2docker is called, its main argument will be a path to a repository. This might be a local path or a URL. Upon being called, repo2docker will loop through all ContentProviders and perform the following commands:

  • Run the detect() method on the repository path given to repo2docker. This should return any value other than None if the path matches what the ContentProvider is looking for.

    For example, the Local ContentProvider checks whether the argument is a valid local path. If so, then detect( returns a dictionary: {'path': source} which defines the path to the repository. This path is used by fetch() to check that it matches the output directory.

  • If detect() returns something other than None, run fetch() with the returned value as its argument. This should result in the contents of the repository being placed locally to a folder.

For more information on ContentProviders, take a look at the ContentProvider base class which has more explanation.