Manual

Introduction

This manual describes webgen in detail to provide in-depth information. If you find something missing, don’t hesistate to write a mail to the mailing list!

If you haven’t read the Getting Started Guide yet, read it now. It gives you a bird’s view of the most important concepts.

The CLI

The executable for webgen is called… webgen ;-) It uses a command style syntax (like Rubygem’s gem commands) through the cmdparse library. To get an overview of the possible commands run webgen help.

The output uses ANSI colors by default on Linux and Mac. If ANSICON is installed on Windows, webgen will show colors on Windows, too.

The main command is the generate command which does the actual website generation. This command uses either the environment variable WEBGEN_WEBSITE (if it is set and non-empty) or the current working directory as website directory if none was specified via the gloabl -d option. All other commands that need a valid webgen website directory specified work in the same way.

You can invoke a command by specifying its name after the executable name. Also counting the executable webgen as a command, the options for a command are specified directly after the command name and before the next command or any arguments. For example, all the following command lines are valid:

$ webgen
$ webgen generate
$ webgen -d doc generate
$ webgen -v create website -t 1024px new_site
$ webgen help generate

Use webgen help to get an overview and short descriptions of the available commands. If you need information on a specific command, just append the command name(s) after “help”. For example, to get information on the command that creates a website, you would enter webgen help create website on the command line.

A webgen Website

webgen needs a special directory structure so that it works out of the box. This directory structure is automatically created by the webgen create website command.

The root directory of a webgen website is called the website directory. The following files and directories inside this directory are used:

  • webgen.config: This file can be used to set configuration options for the website. See the Configuration File section for more information.

  • src/: The source directory in which all the source files for the website are. If this directory should not be called src or you want to include additional source directories, you need to change the  sources configuration option.

  • out/: This is the destination directory which is created, if it does not exist, when webgen generates the output. All output is put into this directory. The name and location of this directory can be changed by setting the  destination configuration option.

  • ext/: The extension directory (optional). You can put self-written extensions into this directory so that they are used by webgen.

    All files called init.rb in this directory or any of its subdirectories are loaded when webgen is invoked. So you need to make sure to either place all extensions in init.rb files or load them from on of them.

  • tmp/: The temporary directory is automatically created by webgen when needed and contains temporary files and cache data. For example, content processor sass uses it for storing pre-compiled Sass files.

Configuration File

Many user will want to change some configuration options, for example, the default language of the website since not all people will want to write websites in English. This is primarily done via the configuration file.

This configuration file is called webgen.config and has to be placed directly under the website directory. It can either be a simple YAML file or can contain Ruby code for advanced configuration possibilities.

If the first line starts with a hash (leading whitespace is ignored) and is followed with the word “ruby” somewhere afterwards, the configuration file is assumed to be a valid Ruby source file and loaded. Otherwise it is assumed to be a YAML file and loaded with the YAML parser.

Setting configuration options with YAML is easy:

website.lang: de
website.link_to_current_page: true

Nesting of option names is also allowed and simplifies setting many options. Therefore the following will YAML file do exactly the same as the above one:

website:
  lang: de
  link_to_current_page: true

You can find a list of all available configuration options that can be set and their possible values in the Configuration Options Reference.

Generating The Website

webgen goes through multiple steps when generating a website. An overview of these steps is listed below and the following sub sections explain the details.

The following three steps are always done when webgen is started (even when the webgen command is invoked with another command):

  • Loading of extension bundles. The bundle “built-in” (contains everything that is shipped with webgen itself) is loaded first. After that all auto-loadable bundles are loaded, then all bundles in the website specific extension directory and last the ext/init.rb file.

  • Then the configuration is loaded from the configuration file.

  • And at last the cache data is read from the cache file (see the  website.cache configuration option).

The actual steps for generating the website are:

  • The node tree creation phase:

    • Gather a list and basic meta information of all source paths from the sources.

    • Create zero, one or more nodes from each source path using path handlers and arrange the nodes in a hierarchical structure as defined by their source paths.

    • Use the cached data, if available, to re-create any node that has not already been created.

  • The website generation phase, repeated until no more changes are detected:

The website generation phase may be executed more than once because it is possible that the internal tree structure changes during this phase. For example, after writing a page node fragment nodes for it may have been generated. So one and the same node may be written more than once when running webgen.

Sources And Passive Sources

A source is an extension that provides source paths (or just called paths) which correspond to files or directories. webgen can use paths from many sources. The most commonly used source – and the default one – is the source file_system which provides paths (and information on them) from the file system.

Paths created by a source already have some meta information associated with them, for example, the  title or  last modification time.

The sources that should be used are defined via the  sources configuration option. Paths from earlier sources have priority over the same paths from later sources. And paths that match one of the patterns in the  sources.ignore_paths configuration option are ignored.

webgen also has the concept of passive sources. These are sources that provide passive paths. Passive sources can only be defined programmatically as they are only really useful for extension bundle authors. Paths from passive sources can be overridden by paths with the same name from normal sources.

Passive paths are the same as normal paths except that they get the additional meta information  passive set on them. When this meta information is set, a node created from such path is only written to its destination path if the node has been referenced by another node.

This allows webgen (or extension bundles) to provide standard ressources that are only written to the destination directory if the node is actually referenced.

Source Paths

A source path abstracts the physical location of a file or directory. It may refer to an actual file/directory on a hard drive or it may refer some row in a database – this doesn’t matter.

When a source path is created, the source extension that created it has to ensure that certain meta information is associated with it (for example, the  modified_at meta information). Additionally, some meta information is taken from the path name itself.

For this to work webgen assumes that the paths follow a naming convention so that meta information can correctly be extracted from the path names. There are three different cases depending on the type of path (the individual parts of a path are explained below):

  • The path specifies a directory. It must end with a slash character and must not contain any hash characters (however it may contain dots). A directory path has to follow the scheme:

      /parent/basename/
    
  • The path specifies a file. It must not end with a slash character and must not contain any hash characters. A file path has to follow the scheme:

      /parent/[sort_info.]basename[.lang][.extension]
    

    There are two special cases:

    • If the file path has only one dot and only numbers are before the dot, then the first part is considered to be the basename and the second part the extension (eg. 53.png)!

    • If the file path has two dots, the first part consists only of numbers and the middle part can be identified as a language identifier, then the first part is considered to be the basename, the second part the language identifier and the third part the extension (eg. 53.de.png)!

  • The path specifies a fragment (e.g. part of a file). It must contain exactly one hash character and it has to follow the scheme (where /parent needs to be a file path):

      /parent#basename
    

Following is the explanation of the parts of the path names:

  • /parent

    This part specifies the path of the parent of this path and is used internally to create a hierarchy of the paths. Although the leading slash is explicitly written here, it is part of the parent path, i.e. each parent path begins with a slash. Also note that the parent path will end with a slash if it is a directory!

  • sort_info

    This part is optional and has to consist of the digits 0 to 9. Its value is used for the  sort_info meta information. If not specified, it defaults to the value zero.

  • basename

    This part is used on the one hand to generate the  title meta information (but with _ and - replaced by spaces and capitalized). And on the other hand, the canonical names are derived from it. basename must not contain any dots, spaces or any character from the following list: ; ? * : ` & = + $ ,. If you do use one of them, webgen may not work correctly!

  • lang

    This part is optional and has to be an ISO-639-1/2 language identifier (two or three characters (a-z) long). It is used for the  lang meta information. It can only be specified if an extension is also specified. If the file path should not have an extension, then just add a trailing dot, e.g. /dir/file.de. - the trailing dot is ignored by webgen.

    If the language identifier can’t be matched to a valid language, it is assumed that this part isn’t actually a language identifier but a part of the extension. This also means that in the special case where the first part of an extension is also a valid language identifier, the first part is interpreted as language identifier and not as part of the extension!

  • extension

    The file extension can be anything and can include dots.

Following are some examples of source path names:

Path Type Basename Language title sort_info
/directory/ Directory directory -none- Directory -none-
/image.png File image -none- Image 0
/image.de.png File image de Image 0
/01.name_of-file.eo.page File name_of-file eo Name of file 1
/53.png File 53 -none- 53 0
/archive.tar.bz2 File archive -none- Archive 0
/archive.de.tar.bz2 File archive de Archive 0
/manual.html#sources Fragment manual -none- Manual -none-

Notice: The first two file path and the last two file path examples define the same content for two different languages (or more exactly: the first one in both cases is unlocalized and the second one localized to German) as they have the same canonical name, see below.

Path Handlers

Since one source path can potentially be used to create many destination paths, a source path cannot be used directly to address this issue. Instead so called nodes are created from source paths via path handlers.

So path handlers are extensions that take a path and create zero, one or more nodes from it.

The created nodes are then associated with their path handler. This information is later used for rendering the nodes so that they can be written to their destination path.

This means that for each kind of path a path handler needs to be written. webgen provides many common path handlers out of the box, for example, for handling template and page paths. The path handlers documentation has the full list.

The paths which should be handled by a specific path handler are either predefined by the path handler or can be set via the  handler meta information. The passive /default.metainfo path associates all built-in path handlers with their extensions using path patterns. Which extensions are used by which path handler can be found on their documentation pages.

Note that paths which are not handled by any path handler are ignored because webgen doesn’t know what to do with them! If you find that something isn’t handled at all, turn on the verbose output mode to see if some path doesn’t get handled!

Before a path gets handled by a path handler and after a node has been created, meta information defined in meta information paths is applied. This is necessary because of multiple reasons:

  • Some path types don’t provide the ability to specify meta information. The custom Webgen Page Format can provide meta information but for images and other standard formats this is not possible. The only way to specify meta information for such paths/nodes is via meta information paths.

  • Some meta information needs to be set before node creation (like the  dest_path meta information).

  • Some meta information is better set after node creation (for example, if you only want to set some meta information on nodes in a specific language).

See the path handler meta_info documentation for detailed information!

Nodes And Canonical Names

Nodes are the internal representation of the destination path(s) created from source paths. Each node is associated with the source path from which it was created, the path handler which was used to create the node, its destination path and meta information. Therefore a node holds all information need to work with it.

webgen arranges the nodes in a hierarchy which is based on the hierarchy of the source paths. This is done because the naming scheme for nodes is based on the source path names.

Note that by default this hierarchy is also reflected in the destination paths, i.e. a source path /images/private/test.png will get written to this location in the destination directory. However, webgen allows the destination hierarchy to be different. If you do this, you have to remember that the nodes are arranged in the hierarchy defined by the source paths!

As one source path can be used to create many nodes a special naming scheme is used to identify nodes, the so called canonical names.

These canonical names are automatically derived from the source path name, the  version meta information, the  lang meta information as well as the  cn meta information:

  • The basis for all names is the canonical name or cn. It is created from the basename of the source path, the  version meta information and the extension. This can be changed by using the  cn meta information.

    No language and no parent information is present in the cn. This means that more than one node under a certain parent node may have the same cn.

  • The absolute canonical name or acn is created by prepending the acn of the parent node to the cn.

    It is used, for example, by the default localization strategy – see the localization section.

  • The localized canonical name or lcn is created from the cn by inserting the two letter language code (or the three letter one if the two letter one does not exist) after the first occurence of a dot in the cn (i.e. usually before the extension).

    An lcn is unique under a parent node. Note, however, that the lcn may be equal to the cn if the node is unlocalized, i.e. does not contain language information.

  • The absolute localized canonical name or alcn is created by prepending the alcn of the parentn ode to the lcn.

    An alcn is unique, i.e. there are no two nodes that can have the same alcn. This is the name to uniquely identify a node!

Here are some examples:

Source path CN ACN LCN ALCN
/images/test/ test/ /images/test/ test/ /images/test/
/images/logo.png logo.png /images/logo.png logo.png /images/logo.png
/images/logo.de.png logo.png /images/logo.png logo.de.png /images/logo.de.png
/test.de.html#frag #frag /test.html#frag #frag /test.de.html#frag

Path handlers can change some parts of a source path. For example, the path handler page changes the extension from page to html. This has to be taken into account when specifying canonical names later!

If such changes are done they are documented on the documentation pages for the path handlers!

Internally webgen uses the canonical names for resolving paths and if this returns nothing the destination paths are tried. This means that you should always use the (absolute) (localized) canonical names, and not source paths!

This might be a bit counterintuitive at first but it allows one to specify paths independently from the destination paths and therefore the destination paths may be changed later without the need to change any path names!

Destination Path Construction

Each node represents one part of the website and since the canonical names of a node are only used internally, a destination path needs to be created for each node that specifies the location to which the rendered content of a node should be written.

The  dest_path meta information is used for this purpose, have a look at its documentation before checking out the examples.

Following are some examples of destination paths for some source path names (assuming that “en” is the default language, the version name is “default” and that the parent path is /img/):

  • <parent><basename>(-<version>)(.<lang>)<ext> (the default)

    • index.jpg/img/index.jpg

      Since the source path is unlocalized, no language part is used. The version part isn’t used as well because the version is “default”.

    • index.en.jpg/img/index.jpg

      This happens if the  path_handler.lang_code_in_dest_path configuration option is false or except_default and no unlocalized version of this path exists.

    • index.en.jpg/img/index.en.jpg

      Similar to the last example but this result occurs when there is an unlocalized version of the path (which got /img/index.jpg as its destination path) and the  path_handler.lang_code_in_dest_path configuration option is not false. If it is set to false, an error would occur because of duplicate destination paths.

    • index.de.jpg/img/index.de.jpg

      Since “de” is not the default language, the language part is used (except if the  path_handler.lang_code_in_dest_path configuration option is set to false).

  • <parent><year>/<month>/<basename>(-<version)(.<lang>)<ext>

    • index.jpg/img/2008/09/index.jpg

      This destination path style can be used to create blog style destination names.

Path Patterns

Path patterns are used in many situations in webgen. For example, path patterns define which source paths are handled by which path handler or which nodes to use in a node finder option set.

The path patterns look like normal file system paths or node (a)(l)cns but some characters have a special meaning, like * (match any file), ** (match recursively or expansively) and ? (match a single character). Have a look at the File.fnmatch API documentation for detailed information.

Note that patterns that should match fragment paths have to contain a hash character! This is necessary to prevent catch-all patterns to falsely match fragment paths.

Here are some example path patterns:

Path Pattern Result
**/*.html All files with the extension html in any directory
*/*.html All files with the extension html in sub-directories of the current directory
**/??? All files in any directory whose file name is exactly three characters long
**/*#* All fragments under any file in any directory
test.html#* | All fragments under the test.html` file  

Localization

When multiple nodes share the same acn, webgen assumes that they have the same content but in different languages.

Directory and fragment nodes are never localized, only file nodes are!

It is also possible to specify a language independent node (i.e. one without a language) which is used as a fallback. Therefore when a node is resolved using a unlocalized canonical name and a given language, it is first tried to get the node in the requested language. If this is not possible (ie. no such localization exists), the unlocalized node is returned if it exists.

Note that this is the default localization strategy that works out of the box. However, by setting the  translation_key meta information one can customize this behaviour.

Here is an example. Assume that we have the following files inside the source directory:

/test.jpg
/images/my.jpg
/images/my.de.jpg
/index.page
/index.fr.page
/index.de.page

Since the path /images/my.jpg has no language part, it is assumed to be unlocalized. In contrast, /images/my.de.jpg has the “same” picture but in a German version. So, when specifying /images/my.jpg in index.de.page, the correct localized version will be returned, i.e. /images/my.de.jpg. When specifying /images/my.jpg in index.page or in index.fr.page, the unlocalized version will be returned since no localized version for the requested languages exist.

When using an lcn or alcn to resolve a node, only the correct localized node is returned if it exists, never an unlocalized version.

Continuing the example above, this means that /images/my.de.jpg is always returned when one specifies /images/my.de.jpg in any of the index page files, e.g. even in /index.fr.page.

Extending webgen

If you know the programming language Ruby, you can easily extend webgen and add new features that you need.

All the needed developer documentation is available in the API documentation, along with sample implementations for all major types of extensions (sources, destinations, content processors, tags, path handlers, …) and detailed information about the inner workings of webgen.

It is also possible to override a built-in extension. If the extension was registered using some #register method (e.g. Webgen::ContentProcessor#register), just register your extension with the same name. If the extension was registered directly on the website.ext object by assignment, just assign your extension to the needed name (e.g. website.ext.content_processor = MyCP.new).

You might also want to have a look at https://github.com/gettalong and the repositories which are named like webgen-*-bundle. These are real world extensions that might help you get started.