AEM Dispatcher

Methods for Caching

The Dispatcher has two primary methods for updating the cache content when changes are made to the website.

  • Content Updates remove the pages that have changed, as well as files that are directly associated with them.
  • Auto-Invalidation automatically invalidates those parts of the cache that may be out of date after an update. i.e. it effectively flags relevant pages as being out of date, without deleting anything.

Content Updates

In a content update, one or more AEM documents change. AEM sends a syndication request to the Dispatcher, which updates the cache accordingly:

  1. It deletes the modified file(s) from the cache.
  2. It deletes all files that start with the same handle from the cache. For example, if the file /en/index.html is updated, all the files that start with /en/index. are deleted. This mechanism allows you to design cache-efficient sites, especially in regard to picture navigations.
  3. It touches the so-called statfile; this updates the timestamp of the statfile to indicate the date of the last change.

The following points should be noted:

  • Content Updates are typically used in conjunction with an authoring system which “knows” what must be replaced.
  • Files that are affected by a content update are removed, but not replaced immediately. The next time such a file is requested, the Dispatcher fetches the new file from the AEM instance and places it in the cache, thereby overwriting the old content.
  • Typically, automatically generated pictures that incorporate text from a page are stored in picture files starting with the same handle – thus ensuring that the association exists for deletion. For example, you may store the title text of the page mypage.html as the picture mypage.titlePicture.gif in the same folder. This way the picture is automatically deleted from the cache each time the page is updated, so you can be sure that the picture always reflects the current version of the page.
  • You may have several statfiles, for example one per language folder. If a page is updated, AEM looks for the next parent folder containing a statfile, and touches that file.

Auto-invalidation

Auto-invalidation automatically invalidates parts of the cache – without physically deleting any files. At every content update, the so-called statfile is touched, so its timestamp reflects the last content update.

The Dispatcher has a list of files that are subject to auto-invalidation. When a document from that list is requested, the Dispatcher compares the date of the cached document with the timestamp of the statfile:

  • if the cached document is newer, the Dispatcher returns it.
  • if it is older, the Dispatcher retrieves the current version from the AEM instance.

Again, certain points should be noted:

  • Auto invalidation is typically used when the inter-relations are complex e.g. for HTML pages. These pages contain links and navigation entries, so they usually have to be updated after a content update. If you have automatically generated PDF or picture files, you may choose to auto-invalidate those too.
  • Auto-invalidation does not involve any action by the dispatcher at update time, except for touching the statfile. However, touching the statfile automatically renders the cache content obsolete, without physically removing it from the cache.

Using a Dispatcher with an Author Server

if you are using AEM with Touch UI you should not cache author instance content. If caching was enabled for the author instance, you need to disable it and delete the contents of the cache directory. To disable caching, you should edit the author_dispatcher.any file and modify the /rule property of the /cache section as follows:

/rules
{
/0000
{ /type "deny" /glob "*"}
}

A Dispatcher can be used in front of an author instance to improve authoring performance. To configure an authoring Dispatcher, do the following:

  • Install a Dispatcher in a web server (this could be Apache, iPlanet, or IIS web server).

  • You may wish to test the newly installed Dispatcher against a working AEM publish instance, to ensure that a baseline correct install has been acheived.

  • Now make sure that the Dispatcher is able to connect via TCP/IP to your author instance.

  • Replace the Dispatcher’s sample dispatcher.any file with the author_dispatcher.any file provided below.

  • Open the author_dispatcher.any in a text editor and make the following changes:

    1. Change the /hostname and /port of the /renders section to point to your author instance.
    2. Change the /docroot of the /cache section to point to a cache directory. In case you are using AEM with Touch UI, see the warning above.
    3. Save the changes.
  • Delete all existing files in the /cache > /docroot directory which you configured above.

  • Restart the web server.

    Dispatcher Configuration Files

    By default the Dispatcher configuration is stored in the dispatcher.any text file, though you can change the name and location of this file during installation.

    The configuration file contains a series of single-valued or multi-valued properties that control the behavior of Dispatcher:

    • Property names are prefixed with a forward slash (“/”).
    • Multi-valued properties enclose child items using braces (“{ }”).

    An example configuration is structured as follows:

    # name of the dispatcher
    /name "internet-server"
    # each farm configures a set off (loadbalanced) renders
    /farms
     {
      # first farm entry (label is not important, just for your convenience)
       /website
         
         /clientheaders
           {
           # List of headers that are passed on
           }
         /virtualhosts
           {
           # List of URLs for this Web site
           }
         /sessionmanagement
           {
           # settings for user authentification
           }
         /renders
           {
           # List of AEM instances that render the documents
           }
         /filter
           {
           # List of filters
           }
         /vanity-urls
           {
           # List of vanity URLs
           }
         /cache
           {
           # Cache configuration
           /rules
             {
             # List of cachable documents
             }
           /invalidate
             {
             # List of auto-invalidated documents
             }
           }
         /statistics
           {
           /categories
             {
             # The document categories that are used for load balancing estimates
             }
           }
         /stickyConnectionsFor "/myFolder"
         /health_check
           {
           # Page gets contacted when an instance returns a 500
           }
         /retryDelay "1"
         /numberOfRetries "5"
         /unavailablePenalty "1"
         /failover "1"
         }
     }

    You can include other files that contribute to the configuration:

    • If your configuration file is large you can split it into several smaller files (that are easier to manage) then include these.
    • To include files that are generated automatically.

    For example, to include the file myFarm.any in the /farms configuration use the following code:

    /farms
      {
      $include "myFarm.any"
      }

    Use the asterisk (“*”) as a wildcard to specify a range of files to include.

    Defining Farms – /farms

    The /farms property defines one or more sets of Dispatcher behaviors, where each set is associated with different web sites or URLs. The /farms property can include a single farm or multiple farms:

    • Use a single farm when you want Dispatcher to handle all of your web pages or web sites in the same way.
    • Create multiple farms when different areas of your web site or different web sites require different Dispatcher behavior.

    The /farms property is a top-level property in the configuration structure. To define a farm, add a child property to the /farms property. Use a property name that uniquely identifies the farm within the Dispatcher instance.

    The /farmname property is multi-valued, and contains other properties that define Dispatcher behavior:

    • The URLs of the pages that the farm applies to.
    • One or more service URLs (typically of AEM publish instances) to use for rendering documents.
    • The statistics to use for load-balancing multiple document renderers.
    • Several other behaviors, such as which files to cache and where.

    The value can have include any alphanumeric (a-z, 0-9) character. The following example shows the skeleton definition for two farms named /daycom and /docsdaycom:

    #name of dispatcher
    /name "day sites"
    #farms section defines a list of farms or sites
    /farms
    {
       /daycom
       {
           ...
       }
       /docdaycom
       {
          ...
       }
    }

    If you use more than one render farm, the list is evaluated bottom-up. This is particularly relevant when defining Virtual Hosts for your websites.

     

    Each farm property can contain the following child properties:

    Property name Description
    /homepage Default homepage
    (optional)(IIS only)
    /clientheaders The headers from the client HTTP request to pass through.
    /virtualhosts The virtual hosts for this farm.
    /sessionmanagement Support for session management and authentication.
    /renders The servers that provide rendered pages (typically AEM publish instances).
    /filter Defines the URLs to which Dispatcher enables access.
    /vanity-urls Configures access to vanity URLs.
    /propagateSyndPost Support for the forwarding of syndication requests.
    /cache Configures caching behavior.
    /statistics Defining statistic categories for load-balancing calculations.
    /stickyConnectionsFor The folder that contains sticky documents.
    /health_check The URL to use to determine server availability.
    /retryDelay The delay before retrying a failed connection.
    /unavailablePenalty Penalties that affect statistics for load-balancing calculations.
    /failover Resend requests to different renders when the original request fails.

    Specify a Default Page (IIS Only) – /homepage

    Caution:

    This parameter is IIS only and will not have any effect in the other web servers.

    The optional /homepage parameter specifies the page that Dispatcher returns when a client requests an undeterminable page or file.

    Typically this situation occurs when a user specifies an URL for which neither IIS or AEM provides an automatic redirection target. For example, if the AEM render instance is shut down after the content is cached, the content redirect URL is unavailable.

     

    The /homepage section is located inside the /farms section, for example:

    #name of dispatcher
    /name "day sites"
    #farms section defines a list of farms or sites
    /farms
    {
       /daycom
       {
           /homepage "/index.html"
           ...
       }
       /docdaycom
       {
          ...
       }
    }

    Specifying the HTTP Headers to Pass Through – /clientheaders

    The /clientheaders property defines a list of HTTP headers that Dispatcher passes from the client HTTP request to the renderer (AEM instance).

    By default Dispatcher forwards the standard HTTP headers to the AEM instance. In some instances, you might want forward additional headers, or remove specific headers:

    • Add headers, such as custom headers, that your AEM instance expects in the HTTP request.
    • Remove headers, such as authentication headers, that are only relevant to the web server.

     

    If you customize the set of headers to pass through, you must specify an exhaustive list of headers, including those that are normally included by default.

    The following code is an example configuration for /clientheaders:

    /clientheaders
      {
      "CSRF-Token"
      "X-Forwarded-Proto"
      "referer"
      "user-agent"
      "authorization"
      "from"
      "content-type"
      "content-length"
      "accept-charset"
      "accept-encoding"
      "accept-language"
      "accept"
      "host"
      "if-match"
      "if-none-match"
      "if-range"
      "if-unmodified-since"
      "max-forwards"
      "proxy-authorization"
      "proxy-connection"
      "range"
      "cookie"
      "cq-action"
      "cq-handle"
      "handle"
      "action"
      "cqstats"
      "depth"
      "translate"
      "expires"
      "date"
      "dav"
      "ms-author-via"
      "if"
      "lock-token"
      "x-expected-entity-length"
      "destination"
      "PATH"
      }

    Identifying Virtual Hosts – /virtualhosts

    The /virtualhosts property defines a list of all hostname/URI combinations that Dispatcher accepts for this farm. You can use the asterisk (“*”) character as a wildcard. Values for the /virtualhosts property use the following format:

    [scheme]host[uri][*]

    The following example configuration handles requests for the .com and .ch domains of myCompany, and all domains of mySubDivision:

    /virtualhosts
     {
     "www.myCompany.com"
     "www.myCompany.ch"
     "www.mySubDivison.*"
     }

    Resolving the Virtual Host

    When Dispatcher receives an HTTP or HTTPS request, it finds the virtual host value that best-matches the host, uri, and scheme headers of the request. Dispatcher evaluates the values in the virtualhosts properties in the following order:

    • Dispatcher begins at the lowest farm and progresses upward in the dispatcher.any file.
    • For each farm, Dispatcher begins with the topmost value in the virtualhosts property and progresses down the list of values.

    Dispatcher finds the best-matching virtual host value in the following manner:

    • The first-encountered virtual host that matches all three of the host, the scheme, and the uri of the request is used.
    • If no virtualhosts values has scheme and uri parts that both match the scheme and uri of the request, the first-encountered virtual host that matches the host of the request is used.
    • If no virtualhosts values have a host part that matches the host of the request, the topmost virtual host of the topmost farm is used.

    Therefore, you should place your default virtual host at the top of the virtualhosts property in the topmost farm of your dispatcher.any file.

    Enabling Secure Sessions – /sessionmanagement

    /allowAuthorized must be set to “0” in the /cache section in order to enable this feature.

    Create a secure session for access to the render farm so that users need to log in to access any page in the farm. After logging in, users can access all pages in the farm. See Creating a Closed User Group for information about using this feature with CUGs.

    If sections of your website use different access requirements, you need to define multiple farms.

    /sessionmanagement has several sub-parameters:

    An example configuration looks as follows:

    /sessionmanagement
      {
      /directory "/usr/local/apache/.sessions"
      /encode "md5"
      /header "HTTP:authorization"
      /timeout "800"
      }

    Defining Page Renderers – /renders

    The /renders property defines the URL to which Dispatcher sends requests to render a document. The following example /renders section identifies a single AEM instance for rendering:

    /renders
      {
        /myRenderer
          {
          # hostname or IP of the renderer
          /hostname "aem.myCompany.com"
          # port of the renderer
          /port "4503"
          # connection timeout in milliseconds, "0" (default) waits indefinitely
          /timeout "0"
          }
      }

    The following example /renders section identifies an AEM instance that runs on the same computer as Dispatcher:

    /renders
      {
        /myRenderer
         {
         /hostname "127.0.0.1"
         /port "4503"
         }
      }

    The following example /renders section distributes render requests equally among two AEM instances:

    /renders
      {
        /myFirstRenderer
          {
          /hostname "aem.myCompany.com"
          /port "4503"
          }
        /mySecondRenderer
          {
          /hostname "127.0.0.1"
          /port "4503"
          }
      }

    Configuring Access to Content – /filter

    Use the /filter section to specify the HTTP requests that Dispatcher accepts. All other requests are sent back to the web server with a 404 error code (page not found). If no /filter section exists, all requests are accepted.

    Note: Requests for the statfile are always rejected.

    The /filter section consist of a series of rules that either deny or allow access to content according to patterns in the request-line part of the HTTP request. You should use a whiltelist strategy for your /filter section:

    • First, deny access to everything.
    • Allow access to content as needed.

    Defining a Filter

    Each item in the /filter section includes a type and a pattern that is matched with a specific element of the request line or the entire request line. Each filter can contain the following items:

    • Type: The /type indicates whether to allow or deny access for the requests that match the pattern. The value can be either allow or deny.
    • Element of the Request Line: Include /method/url/query, or /protocol and a pattern for filtering requests according to these specific parts of the request-line part of the HTTP request. Filtering on elements of the request line (rather than on the entire request line) is the preferred filter method.
    • glob Property: The /glob property is used to match with the entire request-line of the HTTP request.

    Example Filter: Deny All

    The following example filter section causes Dispatcher to deny requests for all files. You should deny access to all files and then allow access to specific areas.

    /0001  { /glob "*" /type "deny" }

    Requests to an explicitly denied area result in a 404 error code (page not found) being returned.

    Example Filter: Deny Acess to Specific Areas

    Filters also allow you to deny access to various elements for example ASP pages and sensitive areas within a publish instance. The following filter denies access to ASP pages:

    /0002  { /type "deny" /url "*.asp"  }

    Example Filter: Enable POST Requests

    The following example filter allows submitting form data by the POST method:

    /filter {
        /0001  { /glob "*" /type "deny" }
        /0002 { /type "allow" /method "POST" /url "/content/[.]*.form.html" }
    }

    Example Filter: Allow Access to the Workflow Console

    The following example shows a filter used to deny external access to the Workflow console:

    /filter {
        /0001  { /glob "*" /type "deny" }
        /0002  {  /type "allow"  /url "/libs/cq/workflow/content/console*"  }
    }

    When multiple filters patterns apply to a request, the last filter pattern that applies is effective.

    Example filter: Using Regular Expressions

    This filter enables extensions in non-public content directories using a regular expression, defined here between single quotes:

    /005  {  /type "allow" /extension '(css|gif|ico|js|png|swf|jpe?g)' }

    Example filter: Filter Additional Elements of a Request URL

    One of the enhancements introduced in dispatcher 4.2.0 is the ability to filter additional elements of the request URL. The new elements introduced are:

    • path
    • selectors
    • extension
    • suffix

    These can be configured by adding the property of the same name to the filtering rule: /path/selectors/extensionand /suffix respectively.

    Below is a rule example that blocks content grabbing from the /content path, using filters for path, selectors and extensions:

    /006 {
            /type "deny"
            /path "/content"
            /selectors '(feed|rss|pages|languages|blueprint|infinity|tidy)'
            /extension '(json|xml|html)'
            }

    Enabling Access to Vanity URLs – /vanity_urls

    Configure Dispatcher to enable access to vanity URLs that are configured for your CQ or AEM pages.

    When access to vanity URLs is enabled, Dispatcher periodically calls a service that runs on the render instance to obtain a list of vanity URLs. Dispatcher stores this list in a local file. When a request for a page is denied due to a filter in the /filter section, Dispatcher consults the list of vanity URLs. If the denied URL is on the list, Dispatcher allows access to the vanity URL.

    To enable access to vanity URLs, add a /vanity_urls section to the /farms section, similar to the following example:

    /vanity_urls {
         /url "/libs/granite/dispatcher/content/vanityUrls.html"
         /file "/tmp/vanity_urls"
         /delay 300
    }

    The /vanity_urls section contains the following properties:

    • /url: The path to the vanity URL service that runs on the render instance. The value of this property must be “/libs/granite/dispatcher/content/vanityUrls.html”.
    • /file: The path to the local file where Dispatcher stores the list of vanity URLs. Make sure that Dispatcher has write-access to this file.
    • /delay: (Seconds) The time between calls to the vanity URL service.

  • Forwarding Syndication Requests – /propagateSyndPost

    Syndication requests are usually intended for Dispatcher only, so by default they are not sent to the renderer (for example, an AEM instance).

    If necessary, set the /propagateSyndPost property to “1” to forward syndication requests to Dispatcher. If set, you must make sure that POST requests are not denied in the filter section.

    Configuring the Dispatcher Cache – /cache

    The /cache section controls how Dispatcher caches documents. Configure several sub-properties to implement your caching strategies:

    • /docroot
    • /statfile
    • /serveStaleOnError
    • /allowAuthorized
    • /rules
    • /statfileslevel
    • /invalidate
    • /invalidateHandler
    • /allowedClients
    • /ignoreUrlParams
    • /headers
    • /mode

    An example cache section might look as follows:

    /cache
      {
      /docroot "/opt/dispatcher/cache"
      /statfile  "/tmp/dispatcher-website.stat"         
      /allowAuthorized "0"
          
      /rules
        {
        # List of files that are cached
        }
      /invalidate
        {
        # List of files that are auto-invalidated
        }
      }

    Specifying the Cache Directory

    The /docroot property identifies the directory where cached files are stored.

    Note:

    The value must be the exact same path as the document root of the web server so that Dispatcher and the web server handle the same files.
    The web server is responsible for delivering the correct status code when the dispatcher cache file is used, that’s why it is important that it can find it as well.

    If you use multiple farms, each farm must use  a different document root.

    Naming the Statfile

    The /statfile property identifies the file to use as the statfile. Dispatcher uses this file to register the time of the most recent content update. The statfile can be any file on the web server.

    The statfile has no content. When content is updated, Dispatcher updates the timestamp. The default statfile is named .stat and is stored in the docroot. Dispatcher blocks access to the statfile.

    Note:

    If /statfileslevel is configured, Dispatcher ignores the /statfile property and uses .stat as the name.

    Caching When Authentication is Used

    The /allowAuthorized property controls whether requests that contain any of the following authentication information are cached:

    • The authorization header.
    • A cookie named authorization.
    • A cookie named login-token.

    By default, requests that include this authentication information are not cached because authentication is not performed when a cached document is returned to the client. This configuration prevents Dispatcher from serving cached documents to users who do not have the necessary rights.

    However, if your requirements permit the caching of authenticated documents, set /allowAuthorized to one:

    /allowAuthorized “1”

     

    Note:

    To enable session management (using the /sessionmanagement property), the /allowAuthorized property must be set to “0”.

    Specifying the Documents to Cache

    The /rules property controls which documents are cached according to the document path. Regardless of the /rules property, Dispatcher never caches a document in the following circumstances:

    • If the request URI contains a question mark (“?”).
      This usually indicates a dynamic page, such as a search result that does not need to be cached.
    • The file extension is missing.
      The web server needs the extension to determine the document type (the MIME-type).
    • The authentication header is set (this can be configured)
    • If the AEM instance responds with the following headers:
      • no-cache
      • no-store
      • must-revalidate

    Each item in the /rules property includes a glob pattern and a type:

    • The glob pattern is used to match the path of the document.
    • The type indicates whether to cache the documents  that match the glob pattern. The value can be either allow (to cache the document) or deny (to always render the document).

    If you do not have dynamic pages (beyond those already excluded by the above rules), you can configure Dispatcher to cache everything. The rules section for this looks as follows:

    /rules
      {
        /0000  {  /glob "*"   /type "allow" }
      }

    If there are some sections of your page that are dynamic (for example a news application) or within a closed user group, you can define exceptions:

    Note:

    Closed user groups must not be cached as user rights are not checked for cached pages.

    /rules
      {
       /0000  { /glob "*" /type "allow" }
       /0001  { /glob "/en/news/*" /type "deny" }
       /0002  { /glob "*/private/*" /type "deny"  }  
      }

    Invalidating Files by Folder Level

    Use the /statfileslevel property to selectively invalidate cached files according to their path:

    • Dispatcher creates .stat files in each folder from the docroot folder down to the level that you specify. The docroot folder is level 0.
    • When a file is updated, Dispatcher locates the folder on the file path that is at the statfileslevel, and invalidates all files below that folder.

    Instead of invalidating all files, only the files on the same path as an updated file are cached.

    For example, a multi-language website uses the structure /content/myWebsite/xx/topics, where xx represents the 2-letter identifier for each language. When /statfileslevel is three, (/statfileslevel = “3”), a .stat file is created in the following folders:

    • /content
    • /content/myWebsite
    • /content/myWebsite/xx (each language folder contains a .stat file)

    When a file in the /content/myWebsite/fr/topics folder is activated, the .stat file below /content/myWebsite/fr is touched. All files in the fr folder are invalidated.

    Note: If you specify a value for the /statfileslevel property, the /statfile property is ignored.

    Automatically Invalidating Cached Files

    The /invalidate property defines the documents that are automatically invalidated when content is updated.

    With automatic invalidation, Dispatcher does not delete cached files after a content update, but checks their validity when they are next requested. Documents in the cache that are not auto-invalidated will remain in the cache until a content update explicitly deletes them.

    Automatic invalidation is typically used for HTML pages. HTML pages often contain links to other pages, making it difficult to determine whether a content update affects a page. To make sure that all relevent pages are invalidated when content is updated, automatically invalidate all HTML pages. The following configuration invalidates all HTML pages:

    /invalidate
    {
     /0000  { /glob "*" /type "deny" }
     /0001  { /glob "*.html" /type "allow" }
    }

    This configuration causes the following activiy when the /content/geometrixx/en.html file is activated:

    • All the files with pattern en.* are removed from the /content/geometrixx/ folder.
    • The /content/geometrixx/en/jcr_content folder is removed.
    • All the other files that match the /invalidate configuration are not immediately deleted. These files are deleted when the next request occurs. In our example /content/geometrixx.html is not deleted, it will be deleted when /content/geometrixx.html is requested.

    If you offer automatically generated PDF and ZIP files for download, you might have to automatically invalidate these as well. A configuration example this looks as follows:

    /invalidate
      {
       /0000 { /glob "*" /type "deny" }
       /0001 { /glob "*.html" /type "allow" }
       /0002 { /glob "*.zip" /type "allow" }
       /0003 { /glob "*.pdf" /type "allow" }
      }

    The AEM integration with Adobe Analytics delivers configuration data in an analytics.sitecatalyst.js file in your website. The example dispatcher.any file that is provided with Dispatcher includes the following invalidation rule for this file:

    {
       /glob "*/analytics.sitecatalyst.js"  /type "allow"
    }

    Using custom invalidation scripts

    The /invalidateHandler property allows you to define a script which is called for each invalidation request received by Dispatcher.

    It is called with the following arguments:

    • Handle
      The content path that is invalidated
    • Action
      The replication Action (e.g. Activate, Deactivate)
    • Action Scope
      The replication Action’s Scope (empty, unless a header of CQ-Action-Scope: ResourceOnly is sent, see Invalidating Cached Pages from AEM for details)

    This can be used to cover a number of different use cases, such as invalidating other application specific caches, or to handle cases where the externalized URL of a page and its place in the docroot does not match the content path.

    Below example script logs each invalidate request to a file.

    1
    /invalidateHandler "/opt/dispatcher/scripts/invalidate.sh"

    sample invalidation handler script

    1
    2
    3
    #!/bin/bash
    printf "%-15s: %s %s" $1 $2 $3>> /opt/dispatcher/logs/invalidate.log

    Limiting the Clients That Can Flush the Cache

    The /allowedClients property defines specific clients that are allowed to flush the cache. The globbing patterns are matched against the IP.

    The following example:

    1. denies access to any client
    2. explicitly allows access to the localhost
    /allowedClients
      {
       /0001 { /glob "*.*.*.*"  /type "deny" }
       /0002 { /glob "127.0.0.1" /type "allow" }
      }

     

    It is recommended that you define the /allowedClients.

    If this is not done, any client can issue a call to clear the cache; if this is done repeatedly it can severely impact the site performance.

    Ignoring URL Parameters

    The ignoreUrlParams section defines which URL parameters are ignored when determining whether a page is cached or delivered from cache:

    • When a request URL contains parameters that are all ignored, the page is cached.
    • When a request URL contains one or more parameters that are not ignored, the page is not cached.

    When a parameter is ignored for a page, the page is cached the first time that the page is requested. Subsequent requests for the page are served the cached page, regardless of the value of the parameter in the request.

    To specify which parameters are ignored, add glob rules to the ignoreUrlParams property:

    • To ignore a parameter, create a glob property that allows the parameter.
    • To prevent the page to be cached, create a glob property that denies the parameter.

    The following example causes Dispatcher to ignores the “q” parameter, so that request URLs that include the q parameter are cached:

    1
    2
    3
    4
    5
    /ignoreUrlParams
    {
        /0001 { /glob "*" /type "deny" }
        /0002 { /glob "q" /type "allow" }
    }

    Using the example ignoreUrlParams value, the following HTTP request causes the page to be cached because the q parameter is ignored:

    1
    GET /mypage.html?q=5

    Using the example ignoreUrlParams value, the following HTTP request causes the page to not be cached because the pparameter is not ignored:

    1
    GET /mypage.html?q=5&p=4

     For information about glob properties, see Designing Patterns for glob Properties.

    Caching HTTP Response Headers

    Note:

    This feature is avaiable with version 4.1.11 of the Dispatcher.

    The /headers property allows you to define the HTTP header types that are going to be cached by the Dispatcher. On the first request to an uncached resource, all headers matching one of the configured values (see the configuration sample below) are stored in a separate file, next to the cache file. On subsequent requests to the cached resource, the stored headers are added to the response.

    Presented below is a sample from the default configuration:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    /cache {
      ...
      /headers {
        "Cache-Control"
        "Content-Disposition"
        "Content-Type"
        "Expires"
        "Last-Modified"
        "X-Content-Type-Options"
        "Last-Modified"
      }
    }

    Note:

    Also, be aware that file globbing characters are not allowed. For more details, see Designing Patterns for glob Properties.

    Dispatcher Cache File Permissions

    The mode property specifies what file permissions are applied to new directories and files in the cache. This setting is restricted by the umask of the calling process. It is an octal number constructed from the sum of one or more of the following values:

    • 0400 Allow read by owner.
    • 0200 Allow write by owner.
    • 0100 Allow the owner to search in directories.
    • 0040 Allow read by group members.
    • 0020 Allow write by group members.
    • 0010 Allow group members to search in the directory.
    • 0004 Allow read by others.
    • 0002 Allow write by others.
    • 0001 Allow others to search in the directory.

    The default value is 0755 which allows the owner to read, write or search and the group and others to read or search.

    Configuring Time Based Cache Invalidation – /enableTTL

    If set, the enableTTL property will evaluate the response headers from the backend, and if they contain a Cache-Controlmax-age or Expires date, an auxiliary, empty file next to the cache file is created, with modification time equal to the expiry date. When the cached file is requested past the modification time it is automatically re-requested from the backend.

    You can enable the feature by adding this line to the dispatcher.any file:

    1
    /enableTTL "1"

    Note:

    This feature is avaiable with version 4.1.11 of the Dispatcher.

    Configuring Load Balancing – /statistics

    The /statistics section defines categories of files for which Dispatcher scores the responsiveness of each render. Dispatcher uses the scores to determine which render to send a request.

    Each category that you create defines a glob pattern. Dispatcher compares the URI of the requested content to these patterns to determine the category of the requested content:

    • The order of the categories determines the order in which they are compared to the URI.
    • The first category pattern that matches the URI is the category of the file. No more category patterns are evaluated.

    Dispatcher supports a maximum of 8 statistics categories. If you define more than 8 categories, only the first 8 are used.

    Render Selection

    Each time Dispatcher requires a rendered page, it uses the following algorithm to select the render:

    1. If the request contains the render name in a renderid cookie, Dispatcher uses that render.
    2. If the request includes no renderid cookie, Dispatcher compares the render statistics:
      1. Dispatcher determines the cateogry of the request URI.
      2. Dispatcher determines which render has the lowest response score for that category, and selects that render.
    3. If no render is selected yet, use the first render in the list.

    The score for a render’s category is based on previous response times, as well as previous failed and successful connections that Dispatcher attempts. For each attempt, the score for the category of the requested URI is updated.

    Note:

    If you do not use load balancing, you can omit this section.

    Defining Statistics Categories

    Define a category for each type of document for which you want to keep statistics for render selection. The /statistics section contains a /categories section. To define a category, add a line below the /categories section that has the following format:

    /name { /glob “pattern“}

    The categorname must be unique to the farm. Thpattern is described in the Designing Patterns for glob Properties section.

    To determine the category of a URI, Dispatcher compares the URI with each category pattern until a match is found. Dispatcher begins with the first category in the list and cointinues in order. Therefore, place categories with more specific patterns first.

    For example, Dispatcher the default dispatcher.any file defines an HTML category and an others category. The HTML category is more specific and so it appears first:

    1
    2
    3
    4
    5
    6
    7
    8
    /statistics
      {
      /categories
        {
          /html { /glob "*.html" }
          /others  { /glob "*" }
        }
      }

    The following example also includes a category for search pages:

    1
    2
    3
    4
    5
    6
    7
    8
    9
    /statistics
      {
      /categories
        {
          /search { /glob "*search.html" }
          /html { /glob "*.html" }
          /others  { /glob "*" }
        }
      }

    Reflecting Server Unavailability in Dispatcher Statistics

    The /unavailablePenalty property sets the time (in tenths of a second) that is applied to the render statistics when a connection to the render fails. Dispatcher adds the time to the statistics category that matches the requested URI.

    For example, the penalty is applied when the TCP/IP connection to the designated hostname/port cannot be established, either because AEM is not running (and not listening) or because of a network-related problem.

    The /unavailablePenalty property is a direct child of the /farm section (a sibling of the /statistics section).

    If no /unavailablePenalty property exists, a value of “1” is used.

    1
    /unavailablePenalty "1"

     

    Identifying a Sticky Connection Folder – /stickyConnectionsFor

    The /stickyConnectionsFor property defines one folder that contains sticky documents; this will be accessed using the URL. Dispatcher sends all requests, from a single user, that are in this folder to the same render instance. Sticky connections ensure that session data is present and consistent for all documents. This mechanism uses the renderid cookie.

    The following example defines a sticky connection to the /products folder:

    1
    /stickyConnectionsFor "/products"

    When a page is composed of conent from several content nodes, include the /paths property that lists the paths to the content. For example, a page contains content from /content/image/content/video, and /var/files/pdfs. The following configuration enables sticky connections for all content on the page:

    1
    2
    3
    4
    5
    6
    7
    /stickyConnections {
      /paths {
        "/content/image"
        "/content/video"
        "/var/files/pdfs"
      }
    }

    Handling Render Connection Errors

    Configure Dispatcher behavior when the render server returns a 500 error, or is unavailable.

    Specifying a Health Check Page

    Use the /health_check property to specify a URL that is checked when a 500 status code occurs. If this page also returns a 500 status code the instance is considered to be unavailable and a configurable time penalty (/unavailablePenalty) is applied to the render before retrying.

    1
    2
    3
    4
    5
    /health_check
      {
      # Page gets contacted when an instance returns a 500
      /url "/health_check.html"
      }

    Specifying the Page Retry Delay

    The /retryDelay property sets the time (in seconds) that Dispatcher waits between rounds of connection attempts with the farm renders. For each round, the maximum number of times Dispatcher attempts a connection to a render is the number of renders in the farm.

    Dispatcher uses a value of “1” if /retryDelay is not explicitly defined. The default value is appropriate in most cases.

    1
    /retryDelay "1"

    Configuring the Number of Retries

    The /numberOfRetries property sets the maximum number of rounds of connection attempts that Dispatcher performs with the renders. If Dispatcher cannot successfully connect to a render after this number of retries, Dispatcher returns a failed response.

    For each round, the maximum number of times Dispatcher attempts a connection to a render is the number of renders in the farm. Therefore, the maximum number of times that Dispatcher attempts a connection is (/numberOfRetries) x (the number of renders).

    “5” is the default value used if not explicitly defined.

    1
    /numberOfRetries "5"

    Using the Failover Mechanism

    Enable the failover mechanism on your Dispatcher farm to resend requests to different renders when the original request fails. When failover is enabled, Dispatcher has the following behavior:

    • When a request to a render returns HTTP status 503 (UNAVAILABLE), Dispatcher sends the request to a different render.
    • When a request to a render returns HTTP status 50x (other than 503), Dispatcher sends a request for the page that is configured for the health_check property.
      • If the health check returns 500 (INTERNAL_SERVER_ERROR), Dispatcher sends the original request to a different render.
      • If the healtch check returns HTTP status 200, Dispatcher returns the initial HTTP 500 error to the client.
    To enable failover, add the following line to the farm (or website):
    1
    /failover "1"

    Note:

    To retry HTTP requests that contain a body, Dispatcher sends a Expect: 100-continue request header to the render before spooling the actual contents. CQ 5.5 with CQSE then immediately answers with either 100 (CONTINUE) or an error code. Other servlet containers should support this as well.

    Ignoring Interruption Errors – /ignoreEINTR

    Caution:

    This option is not usually needed. You only need to use this when you see the following log messages:

        Error while reading response: Interrupted system call

    Any file system oriented system call can be interrupted EINTR if the object of the system call is located on a remote system accessed via NFS. Whether these system calls can time out or be interrupted is based on how the underlying file system was mounted on the local machine.

    Use the /ignoreEINTR parameter if your instance has such a configuration and the log contains the following message:

    Error while reading response: Interrupted system call

    Internally, Dispatcher reads the response from the remote server (i.e. AEM) using a loop that can be represented as:

    while (response not finished) {
    read more data
    }

    Such messages can be generated when the EINTR occurs in the “read more data” section and are caused by the reception of a signal before any data was received.

    To ignore such interrupts you can add the following parameter to dispatcher.any (before /farms):

    /ignoreEINTR “1”

    Setting /ignoreEINTR to “1” causes Dispatcher to continue to attempt to read data until the complete response is read. The default value is 0 and deactivates the option.

Advertisements

AEM Workflows

Workflows enable you to automate Experience Manager activities. Workflows consist of a series of steps that are executed in a specific order. Each step performs a distinct activity such as activating a page or sending an email message. Workflows can interact with assets in the repository, user accounts, and Experience Manager services. Therefore, workflows can coordinate complicated activities that involve any aspect of Experience Manager.

Basics of AEM Workflow Model

Creating a workflow allows user to define and execute a series of steps. In AEM we call workflow as Workflow models .Below are the basic terms used in aem workflow model.

Model : It is made up of WorkflowNodes and WorkflowTransitions. Workflow models are versioned. Running Workflow Instances keep the initial workflow model version that is set when the workflow is started.

Step : There are different types of workflow steps:

  • Participant (User/Group): These types of steps generate a work item and assigns it to a user or group. A user must complete the work item to advance the workflow.
  • Process (Script, Java method call): This type of step is executed automatically by the system. An ECMA script or Java class implements the step. Container (Sub Workflow): This step starts another workflow model.
  • OR Split/Join: Uses logic to decide which step to execute next in the workflow.
  • AND Split/Join: Executes multiple steps simultaneously. All the steps share the following common properties: Autoadvance and Timeout alerts (scriptable).

 

Transition : Defines the link between two consecutive steps.

WorkItem : A workflow instance can have one or many WorkItems at the same time (depending on the workflow model).

Payload : References the resource that has to be advanced through a workflow. The payload implementation references a resource in the repository (by either a path or an UUID) or a resource by a URL or by a serialized java object. Referencing a resource in the repository is very flexible and in conjunction with sling very productive: for example the referenced node could be rendered as a form.

Lifecycle : It is created when a new workflow is started and ends when the end node is processed. The following actions are possible on a workflow instance:

  • Terminate
  • Suspend
  • Resume
  • Restart

Completed and terminated instances are archived.

Inbox : Each logged in user has its own workflow inbox in which the assigned WorkItems are accessible.

Launcher : Allows you to define a workflow to be launched if a specific node has been updated. When we create a new workflow model it consist of three default steps Flow Start ,Flow End and a dummy participant step named as step 1.

  • Flow Start and Flow End represent start and end of workflow.
  • Step1 Participant Step is assigned to admin user to configure a work step. We can edit or delete this step and add new steps as required.

Custom WorkFlow

//This is a component so it can provide or consume services
@Component
  
   
@Service
 
@Properties({
    @Property(name = Constants.SERVICE_DESCRIPTION, value = "Test Email workflow process implementation."),
    @Property(name = Constants.SERVICE_VENDOR, value = "Adobe"),
    @Property(name = "process.label", value = "Test Email Workflow Process") })
public class CustomStep implements WorkflowProcess
{
      
     
/** Default log. */
protected final Logger log = LoggerFactory.getLogger(this.getClass());
     
//Inject a MessageGatewayService
@Reference
private MessageGatewayService messageGatewayService;
     
public void execute(WorkItem item, WorkflowSession wfsession,MetaDataMap args) throws WorkflowException {
         
try
{
    log.info("Here in execute method");    //ensure that the execute method is invoked   
         
    //Declare a MessageGateway service
    MessageGateway<Email> messageGateway;
         
    //Set up the Email message
    Email email = new SimpleEmail();
         
    //Set the mail values
    String emailToRecipients = "tblue@nomailserver.com";
    String emailCcRecipients = "wblue@nomailserver.com";
     
    email.addTo(emailToRecipients);
    email.addCc(emailCcRecipients);
    email.setSubject("AEM Custom Step");
    email.setFrom("scottm@adobe.com");
    email.setMsg("This message is to inform you that the CQ content has been deleted");
     
    //Inject a MessageGateway Service and send the message
    messageGateway = messageGatewayService.getGateway(Email.class);
 
    // Check the logs to see that messageGateway is not null
    messageGateway.send((Email) email);
}
 
    catch (Exception e)
    {
    e.printStackTrace()  ;
    }
 }
}

AEM Architecture

aemstack

AEM technology stack can be divide into following,

  1. Java Runtime Environment (commonly known as JRE)
  2. Granite Platform
  3. AEM modules(Adobe Experience Manager)
  4. Custom Application Module (your website)

Java Runtime Environment [JRE]:

AEM is more or less a combination of or collection of jars, jsps [Java Server Pages], servlets, Java classes along with static resources such as HTMLS, pictures, assets etc. To drive this architecture, it needs JRE.

This makes AEM compatible with any OS that supports required JRE.

Granite Platform:

Granite platform is key role player in AEM stack. Granite is Adobe’s open web stack.

As you can observe from image above, Granite platform consists of below listed modules:

  1. CQSE Servlet Engine
  2. CRX Content Repository
  3. Sling Content Delivery
  4. OSGi Framework

CQSE Servlet EngineAEM requires an application server that supports Java Servlets API 2.4 or later. The AEM software package is available in two forms:

  1. cq-quickstart.jar: it includes everything needed to get up and running (also called as a “standalone executable jar”). Quickstart Standalone jar file contains a built in servlet engine. As the name “Standalone” suggests user can simply double-click the jar file to install an AEM instance with built in servlet-engine. You do not require any dedicated external application server for servlet handling. In this case you only need a JRE and a standalone quickstart JAR file.
  2. cq-quickstart.war: for deployment in a third-party application server. WAR file do not contain built in servlet engine. In this case, you need a JRE, a WAR file and third party application server for servlet handling.

CRX Content RepositoryEverything in AEM is stored in nodes and properties in built-in CRX content repository. CRX is content repository of JCR type. JCR specifications combine features of relational database and file systems, allowing fine-grained access to content repository in File-system Fashion and also in Database fashion.

Let’s understand some terms here itself:

  • CRX (Content repository Xtreme): CRX is repository built into AEM.
  • JSR (Java Specification Request): JSR’s are the formal documents that describe proposed specifications and technologies for adding to the Java platform. There are hundreds of JSR. CRX is Adobe’s implementation of the JSR-283.
  • JCR (Content Repository API for Java): It can be defined as the specifications for accessing content repository using JAVA API in uniform manner. So, JCR is “type of repository”. CRX is an implementation of JCR. Similarly, Apache Jackrabbit is an example of JCR.

To summarize, CRX is content repositories of type JCR and CRX is an implementation of JSR (JSR-283).

The Java Content Repository (JCR) standard, JSR 283, specifies a vendor-independent and implementation-independent way to access content bi-directionally on a granular level within a content repository.

Sling Content DeliverySling is a Web application framework based on REST principles. Sling allows easy development of content oriented applications. AEM is based on sling. Sling uses repository of type JCR, such as, CRX and Apache jackrabbit.

“Using Sling, the type of content to be rendered is not the first processing consideration. Instead the main consideration is whether the URL resolves to a content object for which a script can then be found to perform the rendering. This provides excellent support for web content authors to build pages which are easily customized to their requirements.”  

The advantages of content first approach are significant when we have a wide range of different content elements, or when you need easily customizable pages.

OSGi Framework (Open Service Gateway Initiative)OSGi (Open Service Gateway Initiative) is a Java framework for developing and deploying modular software programs and libraries. OSGI is modular system which implements a dynamic components / applications in form of bundles. AEM is based on OSGi. AEM can be thought of as a conglomeration of bundles(components). All the bundles in AEM can be operated from web console.

A bundle is a jar file holding Java classes and a special metadata file (META-INF subfolder). Applications or components coming in form of bundles installed, started, stopped, updated, and uninstalled without requiring a reboot. Each bundle(component/application) is a tightly coupled, dynamically loadable collection of classes, jars, and configuration files that explicitly declare their external dependencies.

OSGi framework, elements of AEM as well as any additional custom applications on top AEM platform are implemented in OSGI bundles.

AEM Modules:

AEM runs on Granite platform. AEM is a complete package of below mentioned modules built on Granite platform in OSGI framework.

  • Websites
  • Mobile Applications
  • Digital Publications
  • Forms
  • Digital Assets
  • Communities
  • Online Commerce

Customers can leverage these application-level building blocks to create customized solutions by building applications of their own.

Custom Application Module (your website):

Customer’s can build their own customized application module o top of AEM. The underlying technology stack empower customer to take advantage of AEM features such as flexibility, simplicity in the management and delivery of website’s, content and assets, and reduced complexity of delivering online experiences to the right customers.

Sling Servlet Registration AEM 6.3

To register a servlet the following properties play a vital role.

  1. sling.servlet.paths: A list of absolute paths under which the servlet is accessible as a Resource. The property value must either be a single String, an array of Strings or a Vector of Strings.

A servlet using this property might be ignored unless its path is included in the Execution Paths (servletresolver.paths) configuration(Apache Sling Servlet/Script Resolver and Error Handler) setting of the SlingServletResolver service.

 

@Component(service = Servlet.class,

property = {

Constants.SERVICE_DESCRIPTION + “=Simple Demo Servlet”,

“sling.servlet.methods=” + HttpConstants.METHOD_GET,

“sling.servlet.paths=” + “/bin/servlet”,

“sling.servlet.extensions=” + “sample”,

})

public class ResolveServletUsingPath extends SlingSafeMethodsServlet {

@Override

protected void doGet(SlingHttpServletRequest request, SlingHttpServletResponse response) throws IOException {

 

}}

 

This Servlet service registered with these properties is registered under this path: /bin/servlet

 

Note: In the above example, Servlet is only registered by path, so the registration properties sling.servlet.method, sling.servlet.extension has been ignored.

  1. sling.servlet.resourceTypes:The resource type(s) supported by the servlet. The property value must either be a single String, an array of Strings or a Vector of Strings.

Note:Either this property(sling.servlet.resourceTypes) or the sling.servlet.paths property must be set, or the servlet is ignored. If both are set, the servlet is registered using both ways.

Fig – Register the servlet using resource type
@Component(service=Servlet.class,

property={

Constants.SERVICE_DESCRIPTION + “=Simple Demo Servlet”,

“sling.servlet.methods=” + HttpConstants.METHOD_GET,

“sling.servlet.resourceTypes=”+ “community-components/components/componentpage”,

“sling.servlet.extensions=” + “sample”,

})

public class MyServlet extends SlingSafeMethodsServlet {
@Override

protected void doGet(SlingHttpServletRequest request, SlingHttpServletResponse response)throws IOException

{

 

}

}

  1. sling.servlet.selectors:The request URL selectors supported by the servlet. The selectors must be configured as they would be specified in the URL that is as a list of dot-separated strings such as print.a4. The property value must either be a single String, an array of Strings or a Vector of Strings. This property is only considered for the registration with sling.servlet.resourceTypes.
@Component(service=Servlet.class,

property={

Constants.SERVICE_DESCRIPTION + “=Simple Demo Servlet“,

“sling.servlet.methods=” + HttpConstants.METHOD_GET,

“sling.servlet.resourceTypes=”+ “community-components/components/componentpage”,

  “sling.servlet.selectors=”+”img”,

              “sling.servlet.selectors=”+”tab”,

})

public class MyServlet extends SlingSafeMethodsServlet {
@Override

protected void doGet(SlingHttpServletRequest request, SlingHttpServletResponse response)throws IOException

{

}

}

The request can be:

http://localhost:4502/content/community-components/en/tagcloud/jcr:content.img.json

http://localhost:4502/content/community-components/en/tagcloud/jcr:content.tab.json

  1. sling.servlet.extensions:The request URL extensions supported by the servlet for requests. The property value must either be a single String, an array of Strings or a Vector of Strings. This property is only considered for the registration with sling.servlet.resourceTypes.

 

  1. sling.servlet.methods: The request methods supported by the servlet. This property is only considered for the registration with sling.servlet.resourceTypes. If this property is missing, the value defaults to GET and HEAD, regardless of which methods are actually implemented/handled by the servlet.
  2. sling.servlet.prefix: The prefix or numeric index to make relative paths absolute. If the value of this property is a number (int), it defines the index of the search path entries from the resource resolver to be used as the prefix.

The defined search path is used as a prefix to mount this servlet. The number can be -1 which always points to the last search entry. If the specified value is higher than than the highest index of the search paths, the last entry is used. The index starts with 0. If the value of this property is a string and parseable as a number, the value is treated as if it would be a number. If this property is not specified, it defaults to the default configuration of the sling servlet resolver.
So if:

  • prefix=0 or prefix=/apps/, then it will accept the default request(relative path of resourceType) and the resourceType starts with “/apps”.because /apps is the first index of search path entry.
  • prefix=1 or prefix =/libs/ then it will accept the default request(relative path of resourceType) and the resourceType starts with “/libs”
  • the default scenario is prefix=-1, then it will accept the default request(relative path of resourceType) and the resourceType starts with “/libs” because /libs is last search entry.


Note:Binding resources by paths is discouraged.
Always try to register servlet using resourceTypes in place of paths.You can use selectors and extension to uniquely identify a servlet.

@Component(service=Servlet.class,

property={

Constants.SERVICE_DESCRIPTION + “=Simple Demo Servlet“,

“sling.servlet.methods=” + HttpConstants.METHOD_GET,

“sling.servlet.resourceTypes=”+ “sling/servlet/default”,

“sling.servlet.selectors=”+ “data”,

“sling.servlet.extensions=”+ “sample”
})

public class MyServlet extends SlingSafeMethodsServlet {
@Override

protected void doGet(SlingHttpServletRequest request, SlingHttpServletResponse response)throws IOException

{

}

}

The request can be:

http://{server host:server port}/{any path}.data.json

The value of “sling.servlet.resourceTypes” is “sling/servlet/default” to handle all kinds of requests having any path.We can also consider it as a default resourceType.

 

There is an OSGi Configuration named Apache Sling Servlet/Script Resolver and Error Handler in the felix console

Fig – Servlet/Script resolver configuration

There are four options in this configuration:

    • Servlet Registration Root Path:If the servlet don’t have prefix, the value of prefix need to be picked from this configuration.
  • Cache Size: This property configures the size of the cache used for script resolution.To see the scripts which are being cached
    1. Go to the Felix console.
    2. Go to Sling->Script Cache Status
Fig – Check Sling Cache Status in felix console

The cache of script resolution can be seen here:

Fig – Check Cached Script in felix console
  • Execution Paths: The paths to search for executable scripts. This configuration means : All the paths starts with the paths they have provided will be allowed.

If no path is specified, this will be treated like (/=root) which allows all scripts. If we add one path without ending with /, then it means it will only allow exact path.

  • Default Extension:The list of extensions for which the default behavior will be used.

There is a sling resolver test provided by Felix console where we can check that a particular request resolves to which servlet.

Fig – Servlet Resolver option in felix 
Fig – Check servlet is resolving or not

Link Checker AEM6.3

AEM External link checker:-

AEM Link Checker is based on an eventHandler and gets triggered on creates and updates for /content and its child nodes. All content under selected root path is parsed and links are validated. All the validation of links is done asynchronously in the background and the HTML is updated based on verification results.

Note:- If you are having huge repository (/content), that includes frequent updation of links. Then it is not advised to use link checker due to performance issues. As it gets triggered periodically and traverse the whole repository for validating links. This may cause slowness in your author instance.

Now lets see how aem link checker works:-

As soon as author save any link on page, either using rte or any custom component. Link checker eventHandler gets triggered.

Link checker event Handler traverse /content node and checks for new/updated links, once found it will store that mapping under /var/linkchecker cache folder.

Then control goes to Day CQ Link Checker Service, It checks for the scheduler.period configuration. Once scheduler time is met, it triggers the scheduler to validate the links syntax and structure against all the given configuration like the special prefix that it has to ignore during validation and the patter that the link check should use to verify the syntax of the url.

Once the syntax is validated the results are then pushed to  /etc/linkchecker.html. But the links will remain in pending state, until Day CQ Link Checker Task scheduler validated these links by making an ajax GET call. AEM link checker scheduler Day CQ Link Checker Task  runs periodically to check validity of valid and in valid links that are store under /etc/linkchecker.html.

Administrator user can configure the frequency on which he want to run this scheduler by updating Scheduler Period property its default value is 3600 sec. Once triggered it will remove all the invalid or unreachable links from /etc/linkchecker.html(http://localhost:4502/etc/linkchecker.html).

AEM Link checker is configured using below four services:-

  • Day CQ Link Checker Info Storage Service – configures the link cache size. default is 500.
  • Day CQ Link Checker Service – Configure the frequency of background check, default interval is 5 seconds
  • Day CQ Link Checker Task – Configure the frequency of background check for validating links.
  • Day CQ Link Checker Transformer – config for all the elements that need to be transformed by the link checker and rewritten.

AEM internal link checker:- Internal Links are validated as soon as content author add any internal links (repository links ex: /content/we-retail/ca) on page either using rte or any custom component. After validation, if url is no longer valid, then they are removed on the publisher or shown as broken links on the author.

Fixing broken links that link checker is not able to validate:-

Sometimes, you might run into a broken link situation means link is not available on publish even though it is a valid link. This might be because aem link checker automatically checks links and will not publish a broken link. Sometimes it is good as you have a self monitoring system that prevents you from publishing a broken link but what happen when you know that the link is correct even though aem is not able to publish it as it is considering it as broken, then it is a problem.

There are two types of links that link checker requires configuration for validating:- Links that have a special prefix (ex: href=”tel:123-123-1234″ or href=”*|something|*”). Links that after post processing having query param, which you want to mark as always valid or skip validation.

 

Replication in AEM 6.3

Replication agents are central to Adobe Experience Manager (AEM) as the mechanism used to:

  • Publish (activate) content from an author to a publish environment.
  • Explicitly flush content from the Dispatcher cache.
  • Return user input (for example, form input) from the publish environment to the author environment (under control of the author environment).

Requests are queued to the appropriate agent for processing.

User data (users, user groups, and user profiles) are not replicated between author and publish instances.

For multiple publish instances, user data is Sling distributed when User Synchronization is enabled.

Replicating from Author to Publish

Replication, to a publish instance or dispatcher, takes place in several steps:

  • the author requests that certain content be published (activated); this can be initiated by a manual request, or by automatic triggers which have been preconfigured.
  • the request is passed to the appropriate default replication agent; an environment can have several default agents which will always be selected for such actions.
  • the replication agent “packages” the content and places it in the replication queue.
  • in the Websites tab the colored status indicator is set for the individual pages.
  • the content is lifted from the queue and transported to the publish environment using the configured protocol; usually this is HTTP.
  • a servlet in the publish environment receives the request and publishes the received content; the default servlet is http://localhost:4503/bin/receive.
  • multiple author and publish environments can be configured.
chlimage_1

Replicating from Publish to Author

Some features allow users to enter data on a publish instance.

In some cases, a type of replication known as reverse replication, is needed to return this data to the author environment from where it is redistributed to other publish environments.  Due to security considerations, any traffic from the publish to the author environment must be strictly controlled.

Reverse replication uses an agent in the publish environment which references the author environment. This agent places the data into an outbox. This outbox is matched with replication listeners in the author environment. The listeners poll the outboxes to collect any data entered and then distribute it as necessary. This ensures that the author environment controls all traffic.

In other cases, such as for Communities features (for example, forums, blogs, comments, and reviews), the amount of user generated content (UGC) being entered in the publish environment is difficult to efficiently synchronize across AEM instances using replicaiton.

AEM Communities never uses replication for UGC.  Instead, the deployment for Communities requires a common store for UGC.

Replication – Out of the Box

 

To follow this example and use the default replication agents you need to Install AEM with:

  • the author environment on port 4502
  • the publish environment on port 4503

Enabled by default :

  • Agents on author : Default Agent (publish)

Effectively disabled by default (as of AEM 6.1) :

  • Agents on author  : Reverse Replication Agent (publish_reverse)
  • Agents on publish : Reverse Replication (outbox)

Replication Agents – Out of the Box

The following agents are available in a standard AEM installation:

  • Default Agent
    Used for replicating from author to publish.
  • Dispatcher Flush
    This is used for managing the Dispatcher cache.
  • Reverse Replication
    Used for replicating from publish to author.  Reverse replication is not used for Communities features, such as forums, blogs, and comments.  It is effectively disabled as the outbox is not enabled.  Use of reverse replication would require custom configuration.
  • Static Agent
    This is an “Agent that stores a static representation of a node into the filesystem.”.
    For example with the default settings, content pages and dam assets are stored under /tmp, either as HTML or the appropriate asset format.
    This was requested so that when the page is requested directly from the application server the content can be seen. This is a specialized agent and (probably) will not be required for most instances.

Replication Agents – Configuration Parameters

When configuring a replication agent from the Tools console, four tabs are available within the dialog:

Settings

  • Name : A unique name for the replication agent.

  • Description : A description of the purpose this replication agent will serve.

  • Enabled : Indicates whether the replication agent is currently enabled.

    When the agent is enabled the queue will be shown as:

    • Active when items are being processed.
    • Idle when the queue is empty.
    • Blocked when items are in the queue, but cannot be processed; for example, when the receiving queue is disabled.
  • Serialization Type : The type of serialization:

    • Default: Set if the agent is to be automatically selected.
    • Dispatcher Flush: Select this if the agent is to be used for flushing the dispatcher cache.
  • Retry Delay : The delay (waiting time in milliseconds) between two retries, should a problem be encountered.

    Default: 60000

  • Agent User Id : Depending on the environment, the agent will use this user account to:

    • collect and package the content from the author environment
    • create and write the content on the publish environment

    Leave this field empty to use the system user account (the account defined in sling as the administrator user; by default this is admin).

     

    Caution:

    For an agent on the author environment this account must have read access to all paths that you want to have replicated.

             For an agent on the publish environment this account must have the create/write access required to replicate the content.

     

    Note:

    This can be used as a mechanism for selecting specific content for replication.

  • Log Level : Specifies the level of detail to be used for log messages.

    • Error: only errors will be logged
    • Info: errors, warnings and other informational messages will be logged
    • Debug: a high level of detail will be used in the messages, primarily for debug purposes

    Default: Info

  • Use for reverse replication : Indicates whether this agent will be used for reverse replication; returns user input from the publish to author environment.

  • Alias update : Selecting this option enables alias or vanity path invalidation requests to Dispatcher.

Transport

  • URI

    This specifies the receiving servlet at the target location. In particular, you can specify the hostname (or alias) and context path to the target instance here.

    For example:

    The protocol specified here (HTTP or HTTPS) will determine the transport method.

    For Dispatcher Flush agents, the URI property is used only if you use path-based virtualhost entries to differentiate between farms, you use this field to target the  farm to invalidate. For example, farm #1 has a virtual host of http://www.mysite.com/path1/* and farm #2 has a virtual host of http://www.mysite.com/path2/*. You can use a URL of /path1/invalidate.cache to target the first farm and /path2/invalidate.cache to target the second farm.

  • User

    User name of the account to be used for accessing the target.

  • Password

    Password for the account to be used for accessing the target.

  • NTLM Domain

    Domain for NTML authentication.

  • NTLM Host

    Host for NTML authentication.

  • Enable relaxed SSL

    Enable if you want self-certified SSL certificates to be accepted.

  • Allow expired certs

    Enable if you want expired SSL certificates to be accepted.

Proxy

The following settings are only needed if a proxy is needed:

  • Proxy Host

    Hostname of the proxy used for transport.

  • Proxy Port

    Port of the proxy.

  • Proxy User

    User name of the account to be used.

  • Proxy Password

    Password of the account to be used.

  • Proxy NTLM Domain

    The proxy NTLM domain.

  • Proxy NTLM Host

    The proxy NTLM domain.

Extended

  • Interface

    Here you can define the socket interface to bind to.

    This sets the local address to be used when creating connections. If this is not set, the default address will be used. This is useful for specifying the interface to use on multi-homed or clustered systems.

  • HTTP Method

    The HTTP method to be used.

    For a Dispatcher Flush agent this is nearly always GET and should not be changed (POST would be another possible value).

  • HTTP Headers

    These are used for Dispatcher Flush agents and specify elements that must be flushed.

    For a Dispatcher Flush agent the three standard entries should not need changing:

    • CQ-Action:{action}
    • CQ-Handle:{path}
    • CQ-Path:{path}

    These are used, as appropriate, to indicate the action to be used when flushing the handle or path. The sub-parameters are dynamic:

    • {action} indicates a replication action
    • {path} indicates a path

    They are substituted by the path/action relevant to the request and therefore do not need to be “hardcoded”:

    Note:

    If you have installed AEM in a context other than the recommended default context, then you will need to register the context in the HTTP Headers. For example:
    CQ-Handle:/<yourContext>{path}

  • Close Connection

    Enable to close the connection after each request.

  • Connect Timeout

    Timeout (in milliseconds) to be applied when trying to establish a connection.

  • Socket Timeout

    Timeout (in milliseconds) to be applied when waiting for traffic after a connection has been established.

  • Protocol Version

    Version of the protocol; for example 1.0 for HTTP/1.0.

Triggers

These settings are used to define triggers for automated replication:

  • Ignore default

    If checked, the agent is excluded from default replication; this means it will not be used if a content author issues a replication action.

  • On Modification

    Here a replication by this agent will be automatically triggered when a page is modified. This is mainly used for Dispatcher Flush agents, but also for reverse replication.

  • On Distribute

    If checked, the agent will automatically replicate any content that is marked for distribution when it is modified.

  • On-/Offtime reached

    This will trigger automatic replication (to activate or deactivate a page as appropriate) when the ontimes or offtimes defined for a page occur. This is primarily used for Dispatcher Flush agents.

  • On Receive

    If checked, the agent will chain replicate whenever receiving replication events.

  • No Status Update

    When checked the agent will not force a replication status update.

  • No Versioning

    When checked the agent will not force versioning of activated pages.

Configuring your Replication Agents

Controlling Access to Replication Agents

Access to the pages used to configure the replication agents can be controlled by using user and/or group page permissions on the etc/replication node.

Note:

Setting such permissions will not affect users replicating content (e.g. from the Websites console or sidekick option). The replication framework does not use the “user session” of the current user to access replication agents when replicating pages.

  • Caution:

    Do not use the “Test Connection” link for the Reverse Replication Outbox on a publish instance.

    If a replication test is performed for an Outbox queue, any items that are older than the test replication will be re-processed with every reverse replication.

    If such items already exist in a queue, they can be found with the following XPath JCR query and should be removed.

    /jcr:root/var/replication/outbox//*[@cq:repActionType=’TEST’]

How do I use reverse replication and what’s necessary to make sure that it works?

Out-of-the box, only cq:Page nodes are reverse replicated. For any other node, it’s necessary to use the two last methods, as a project-specific implementation.

There are three possibilities

  • Use the SlingPostServlet (that is, do not create any custom post servlets or POST.jsp to handle the incoming requests) so that it implicitly triggers a related PageEvent. Then set a property name “cq:distribute” and set its value to “true” on the nodes you want to reverse replicate.
    • To implement this solution, it’s unnecessary to write any code. You can use the Form component to set all the necessary hidden fields.
  • Use your own code that accesses the repository, modify the properties “cq:lastModified,” “cq:lastModifiedBy” and “cq:distribute.”
    • Posted data can be controlled, internal code writes the data.
    • To implement this solution, it’s necessary to write the code for your project.
  • Use your own code that calls the replicate method from Replicator service with options to use distribution mode.
    • Replication is controlled from your code.
    • To implement this solution, write the code specific for your project.

Use your own code to implement a reverse replication solution

  • Add the following code to fire the event related to the page you want to reverse replicate (the example below was extracted from sample PostDataServlet.java):
...
  // set the page to hide in the navigation
  Node pageContainer = newCommentPage.getContentResource().adaptTo(Node.class);
  pageContainer.setProperty("cq:lastModified", Calendar.getInstance());
  pageContainer.setProperty("cq:lastModifiedBy", session.getUserID());
  pageContainer.setProperty("cq:distribute", true);
...
  session.save();
...

 

Attached is an example using a component to render the form and display the previous post. For each post, it creates a subpage that contains a paragraph with text in it. By doing so, it ensures that each post can be managed separately (and avoids collision with posts that could be generated from other publish instances). The storage location is defined as a parameter in the component dialog (that is, /content/usergenerated/comments/form1, which you can create using a folder in the siteadmin).

On the author instance, you can define a workflow model that would be launched when a page is created below your comments page.

Make sure that you clear the cq:distribute value in your workflow, if you reactivate the content on author to the publish, otherwise it goes in an endless loop !!!

On the publish instance, make sure that the user has sufficient rights to create content. If you test with anonymous, then change the rights accordingly using CRX Explorer for the given jcr path).

Note on replication

For replication to work properly then store data with the following rules:
(1) the replicated (root) node’s nodetype must extend nt:hierarchyNode
(2) all direct child nodes that are not nt:hierarchyNodes must be aggregated
(3) the subtrees of all nodes from (2), apart from nodetypes, must be aggregated

Adobe recommends to use the cq:Page (/jcr:content) as container for your data, as you can then easily manage it and use it with the user interface (siteadmin, and so on). You can use PageManager API to create the page.

 

Note:

Certain terms related to publishing can be confused:

  • Publish / Unpublish
    These are the primary terms for the actions that make your content publicly available on your publish environment (or not).
  • Activate / Deactivate
    These terms are synonymous with publish/unpublish.
  • Replicate / Replication
    These are the technical terms describing the movement of data (e.g. page content, files, code, user comments) from one environment to another such as when publishing or reverse-replicating user comments.

Note:

If you do not have the required privileges for publishing a specific page:

  • A workflow will be triggered to notify the appropriate person of your request to publish.
  • This workflow may have been customized by your development team.
  • A message will be displayed briefly to notify you that the workflow was triggered.

Depending on your location, you can publish:

  • From the page editor
  • From the sites console

From Page Editor

Depending on whether the page has references that need publishing:

  • The page will be published directly if there are no references to be published.
  • If the page has references that need publishing, these will be listed in the Publish wizard, where you can either:
    • Specify which of the assets/tags/etc. you want to publish together with the page, then use Publish to complete the process.
    • Use Cancel to abort the action.

Note:

Publishing from the editor is a shallow publish, i.e. only the selected page/pages is/are published and any child pages are not.

From Sites Console

In the sites console there are two options for publishing:

  • Quick Publish
  • Manage Publication

Quick Publish

Quick Publish is for simple cases and publishes the selected page(s) immediately without any further interaction. Because of this, any non-published references will also be published automatically.

Note:

Quick Publish is a shallow publish, i.e. only the selected page/pages is/are published and any child pages are not.

Manage Publication

Manage Publication offers more options than Quick Publish, allowing for the inclusion of child pages, customization of the references, and starting any applicable workflows as well as offering the option to publish at a later date.