Generate sitemaps with sitemapgen4j using Spring Batch

This post is about automatically generating sitemaps. I chose this topic, because it is fresh in my mind as I have recently started using sitemaps for pickat.sg After some research I came to the conclusion this would be a good thing – at the time of the posting Google had 3171 URLs indexed for the website (it has been live for 3 months now), whereby after generating sitemaps there were 87,818 URLs submitted. I am curios how many will get indexed after that…

So because I didn’t want to introduce over 80k URLs manually, I had to come up with an automated solution for that. Because Pickat mobile app was developed with Java Spring, it came easy to me to selectsitemapgen4j

As I refer to the article from others, you may see different methods, please focus on the logic.

Maven depedency

Check out the latest version here:


The podcasts from pickat.sg have an update frequency (DAILY, WEEKLY, MONTHLY, TERMINATED, UNKNOWN) associated, so it made sense to organize sub-sitemaps to make use of the lastMod andchangeFreq properties accordingly. This way you can modify the lastMod of the daily sitemap in the sitemap index without modifying the lastMod of the monthly sitemap, and the Google bot doesn’t need to check the monthly sitemap everyday.

Generation of sitemap

Method : createSitemapForPodcastsWithFrequency – generates one sitemap file

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
/**
 * Creates sitemap for podcasts/episodes with update frequency
 *
 * @param  updateFrequency  update frequency of the podcasts
 * @param  sitemapsDirectoryPath the location where the sitemap will be generated
 */
public void createSitemapForPodcastsWithFrequency(
        UpdateFrequencyType updateFrequency, String sitemapsDirectoryPath)  throws MalformedURLException {
    //number of URLs counted
    int nrOfURLs = 0;
    File targetDirectory = new File(sitemapsDirectoryPath);
    WebSitemapGenerator wsg = WebSitemapGenerator.builder("http://www.podcastpedia.org", targetDirectory)
                                .fileNamePrefix("sitemap_" + updateFrequency.toString()) // name of the generated sitemap
                                .gzip(true) //recommended - as it decreases the file's size significantly
                                .build();
    //reads reachable podcasts with episodes from Database with
    List podcasts = readDao.getPodcastsAndEpisodeWithUpdateFrequency(updateFrequency);
    for(Podcast podcast : podcasts) {
        String url = "http://www.podcastpedia.org" + "/podcasts/" + podcast.getPodcastId() + "/" + podcast.getTitleInUrl();
        WebSitemapUrl wsmUrl = new WebSitemapUrl.Options(url)
                                    .lastMod(podcast.getPublicationDate()) // date of the last published episode
                                    .priority(0.9) //high priority just below the start page which has a default priority of 1 by default
                                    .changeFreq(changeFrequencyFromUpdateFrequency(updateFrequency))
                                    .build();
        wsg.addUrl(wsmUrl);
        nrOfURLs++;
        for(Episode episode : podcast.getEpisodes() ){
            url = "http://www.podcastpedia.org" + "/podcasts/" + podcast.getPodcastId() + "/" + podcast.getTitleInUrl()
                    + "/episodes/" + episode.getEpisodeId() + "/" + episode.getTitleInUrl();
            //build websitemap url
            wsmUrl = new WebSitemapUrl.Options(url)
                            .lastMod(episode.getPublicationDate()) //publication date of the episode
                            .priority(0.8) //high priority but smaller than podcast priority
                            .changeFreq(changeFrequencyFromUpdateFrequency(UpdateFrequencyType.TERMINATED)) //
                            .build();
            wsg.addUrl(wsmUrl);
            nrOfURLs++;
        }
    }
    // One sitemap can contain a maximum of 50,000 URLs.
    if(nrOfURLs <= 50000){
        wsg.write();
    } else {
        // in this case multiple files will be created and sitemap_index.xml file describing the files which will be ignored
        wsg.write();
        wsg.writeSitemapsWithIndex();
    }
}

The generated file contains URLs to podcasts and episodes, with changeFreq and lastMod set accordingly.
Snippet from the generated sitemap_MONTHLY.xml:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
<?xml version="1.0" encoding="UTF-8"?>
  <url>
    <lastmod>2013-07-05T17:01+02:00</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.9</priority>
  </url>
  <url>
    <lastmod>2013-07-05T17:01+02:00</lastmod>
    <changefreq>never</changefreq>
    <priority>0.8</priority>
  </url>
  <url>
    <lastmod>2013-03-11T15:40+01:00</lastmod>
    <changefreq>never</changefreq>
    <priority>0.8</priority>
  </url>
  .....
</urlset>

Generation of sitemap index

After sitemaps are generated for all update frequencies, a sitemap index is generated to list all the sitemaps. This file will be submitted in the Google Webmaster Toolos.
Method : createSitemapIndexFile

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
/**
 * Creates a sitemap index from all the files from the specified directory excluding the test files and sitemap_index.xml files
 *
 * @param  sitemapsDirectoryPath the location where the sitemap index will be generated
 */
public void createSitemapIndexFile(String sitemapsDirectoryPath) throws MalformedURLException {
    File targetDirectory = new File(sitemapsDirectoryPath);
    // generate sitemap index for foo + bar grgrg
    File outFile = new File(sitemapsDirectoryPath + "/sitemap_index.xml");
    SitemapIndexGenerator sig = new SitemapIndexGenerator("http://www.podcastpedia.org", outFile);
    //get all the files from the specified directory
    File[] files = targetDirectory.listFiles();
    for(int i=0; i < files.length; i++){
        boolean isNotSitemapIndexFile = !files[i].getName().startsWith("sitemap_index") || !files[i].getName().startsWith("test");
        if(isNotSitemapIndexFile){
            SitemapIndexUrl sitemapIndexUrl = new SitemapIndexUrl("http://www.podcastpedia.org/" + files[i].getName(), new Date(files[i].lastModified()));
            sig.addUrl(sitemapIndexUrl);
        }
    }
    sig.write();
}

The process is quite simple – the method looks in the folder where the sitemaps files were created and generates a sitemaps index with these files setting the lastmod value to the time each file had been last modified (line 18).
Et voilà sitemap_index.xml:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
<?xml version="1.0" encoding="UTF-8"?>
  <sitemap>
    <lastmod>2013-08-01T07:24:38.450+02:00</lastmod>
  </sitemap>
  <sitemap>
    <lastmod>2013-08-01T07:25:01.347+02:00</lastmod>
  </sitemap>
  <sitemap>
    <lastmod>2013-08-01T07:25:10.392+02:00</lastmod>
  </sitemap>
  <sitemap>
    <lastmod>2013-08-01T07:26:33.067+02:00</lastmod>
  </sitemap>
  <sitemap>
    <lastmod>2013-08-01T07:24:53.957+02:00</lastmod>
  </sitemap>
</sitemapindex>

If you liked this, please show your support by helping us with Podcastpedia.org
We promise to only share high quality podcasts and episodes.

Source code

  • SitemapService.zip – the archive contains the interface and class implementation for the methods described in the post

Batch Job Approach

Eventually, I ended-up doing something similar to the first suggestion. However, instead of generating the sitemap every time the URL is accessed, I ended-up generating the sitemap from a batch job.

With this approach, I get to schedule how often the sitemap is generated. And because generation happens outside of an HTTP request, I can afford a longer time for it to complete.

Having previous experience with the framework, Spring Batch was my obvious choice. It provides a framework for building batch jobs in Java. Spring Batch works with the idea of “chunk processing” wherein huge sets of data are divided and processed as chunks.

I then searched for a Java library for writing sitemaps and came-up with SitemapGen4j. It provides an easy to use API and is released under Apache License 2.0.

Requirements

My requirements are simple: I have a couple of static web pages which can be hard-coded to the sitemap. I also have pages for each place submitted to the web site; each place is stored as a single row in the database and is identified by a unique ID. There are also pages for each registered user; similar to the places, each user is stored as a single row and is identified by a unique ID.

A job in Spring Batch is composed of 1 or more “steps”. A step encapsulates the processing needed to be executed against a set of data.

I identified 4 steps for my job:

  • Add static pages to the sitemap
  • Add place pages to the sitemap
  • Add profile pages to the sitemap
  • Write the sitemap XML to a file

Step 1

Because it does not involve processing a set of data, my first step can be implemented directly as a simple Tasklet:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
public class StaticPagesInitializerTasklet implements Tasklet {
  private static final Logger logger = LoggerFactory.getLogger(StaticPagesInitializerTasklet.class);
  private final String rootUrl;
  @Inject
  private WebSitemapGenerator sitemapGenerator;
  public StaticPagesInitializerTasklet(String rootUrl) {
    this.rootUrl = rootUrl;
  }
  @Override
  public RepeatStatus execute(StepContribution contribution, ChunkContext chunkContext) throws Exception {
    logger.info("Adding URL for static pages...");
    sitemapGenerator.addUrl(rootUrl);
    sitemapGenerator.addUrl(rootUrl + "/terms");
    sitemapGenerator.addUrl(rootUrl + "/privacy");
    sitemapGenerator.addUrl(rootUrl + "/attribution");
    logger.info("Done.");
    return RepeatStatus.FINISHED;
  }
  public void setSitemapGenerator(WebSitemapGenerator sitemapGenerator) {
    this.sitemapGenerator = sitemapGenerator;
  }
}

The starting point of a Tasklet is the execute() method. Here, I add the URLs of the known static pages of CheckTheCrowd.com.

Step 2

The second step requires places data to be read from the database then subsequently written to the sitemap.

This is a common requirement, and Spring Batch provides built-in Interfaces to help perform these types of processing:

  • ItemReader – Reads a chunk of data from a source; each data is considered an item. In my case, an item represents a place.
  • ItemProcessor – Transforms the data before writing. This is optional and is not used in this example.
  • ItemWriter – Writes a chunk of data to a destination. In my case, I add each place to the sitemap.

The Spring Batch API includes a class called JdbcCursorItemReader, an implementation of ItemReader which continously reads rows from a JDBC ResultSet. It requires aRowMapper which is responsible for mapping database rows to batch items.

For this step, I declare a JdbcCursorItemReader in my Spring configuration and set my implementation of RowMapper:

1
2
3
4
5
6
7
8
@Bean
public JdbcCursorItemReader<PlaceItem> placeItemReader() {
  JdbcCursorItemReader<PlaceItem> itemReader = new JdbcCursorItemReader<>();
  itemReader.setSql(environment.getRequiredProperty(PROP_NAME_SQL_PLACES));
  itemReader.setDataSource(dataSource);
  itemReader.setRowMapper(new PlaceItemRowMapper());
  return itemReader;
}

Line 4 sets the SQL statement to query the ResultSet. In my case, the SQL statement is fetched from a properties file.

Line 5 sets the JDBC DataSource.

Line 6 sets my implementation of RowMapper.

Next, I write my implementation of ItemWriter:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
public class PlaceItemWriter implements ItemWriter<PlaceItem> {
  private static final Logger logger = LoggerFactory.getLogger(PlaceItemWriter.class);
  private final String rootUrl;
  @Inject
  private WebSitemapGenerator sitemapGenerator;
  public PlaceItemWriter(String rootUrl) {
    this.rootUrl = rootUrl;
  }
  @Override
  public void write(List<? extends PlaceItem> items) throws Exception {
    String url;
    for (PlaceItem place : items) {
      url = rootUrl + "/place/" + place.getApiId() + "?searchId=" + place.getSearchId();
      logger.info("Adding URL: " + url);
      sitemapGenerator.addUrl(url);
    }
  }
  public void setSitemapGenerator(WebSitemapGenerator sitemapGenerator) {
    this.sitemapGenerator = sitemapGenerator;
  }
}

Places in CheckTheCrowd.com are accessible from URLs having this pattern:checkthecrowd.com/place/{placeId}?searchId={searchId}.  My ItemWritersimply iterates through the chunk of PlaceItems, builds the URL, then adds the URL to the sitemap.

Step 3

The third step is exactly the same as the previous, but this time processing is done on user profiles.

Below is my ItemReader declaration:

1
2
3
4
5
6
7
8
@Bean
public JdbcCursorItemReader<PlaceItem> profileItemReader() {
  JdbcCursorItemReader<PlaceItem> itemReader = new JdbcCursorItemReader<>();
  itemReader.setSql(environment.getRequiredProperty(PROP_NAME_SQL_PROFILES));
  itemReader.setDataSource(dataSource);
  itemReader.setRowMapper(new ProfileItemRowMapper());
  return itemReader;
}

Below is my ItemWriter implementation:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
public class ProfileItemWriter implements ItemWriter<ProfileItem> {
  private static final Logger logger = LoggerFactory.getLogger(ProfileItemWriter.class);
  private final String rootUrl;
  @Inject
  private WebSitemapGenerator sitemapGenerator;
  public ProfileItemWriter(String rootUrl) {
    this.rootUrl = rootUrl;
  }
  @Override
  public void write(List<? extends ProfileItem> items) throws Exception {
    String url;
    for (ProfileItem profile : items) {
      url = rootUrl + "/profile/" + profile.getUsername();
      logger.info("Adding URL: " + url);
      sitemapGenerator.addUrl(url);
    }
  }
  public void setSitemapGenerator(WebSitemapGenerator sitemapGenerator) {
    this.sitemapGenerator = sitemapGenerator;
  }
}

Profiles in CheckTheCrowd.com are accessed from URLs having this pattern:checkthecrowd.com/profile/{username}.

Step 4

The last step is fairly straightforward and is also implemented as a simple Tasklet:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
public class XmlWriterTasklet implements Tasklet {
  private static final Logger logger = LoggerFactory.getLogger(XmlWriterTasklet.class);
  @Inject
  private WebSitemapGenerator sitemapGenerator;
  @Override
  public RepeatStatus execute(StepContribution contribution, ChunkContext chunkContext) throws Exception {
    logger.info("Writing sitemap.xml...");
    sitemapGenerator.write();
    logger.info("Done.");
    return RepeatStatus.FINISHED;
  }
}

Notice that I am using the same instance of WebSitemapGenerator across all the steps. It is declared in my Spring configuration as:

1
2
3
4
5
6
7
@Bean
public WebSitemapGenerator sitemapGenerator() throws Exception {
  String rootUrl = environment.getRequiredProperty(PROP_NAME_ROOT_URL);
  String deployDirectory = environment.getRequiredProperty(PROP_NAME_DEPLOY_PATH);
  return WebSitemapGenerator.builder(rootUrl, new File(deployDirectory))
    .allowMultipleSitemaps(true).maxUrls(1000).build();
}

Because they change between environments (dev vs prod), rootUrl anddeployDirectory are both configured from a properties file.

Wiring them all together…

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
<beans>
    <context:component-scan base-package="com.checkthecrowd.batch.sitemapgen.config" />
    <bean class="...config.SitemapGenConfig" />
    <bean class="...config.java.process.ConfigurationPostProcessor" />
    <batch:job id="generateSitemap" job-repository="jobRepository">
        <batch:step id="insertStaticPages" next="insertPlacePages">
            <batch:tasklet ref="staticPagesInitializerTasklet" />
        </batch:step>
        <batch:step id="insertPlacePages" parent="abstractParentStep" next="insertProfilePages">
            <batch:tasklet>
                <batch:chunk reader="placeItemReader" writer="placeItemWriter" />
            </batch:tasklet>
        </batch:step>
        <batch:step id="insertProfilePages" parent="abstractParentStep" next="writeXml">
            <batch:tasklet>
                <batch:chunk reader="profileItemReader" writer="profileItemWriter" />
            </batch:tasklet>
        </batch:step>
        <batch:step id="writeXml">
            <batch:tasklet ref="xmlWriterTasklet" />
        </batch:step>
    </batch:job>
    <batch:step id="abstractParentStep" abstract="true">
        <batch:tasklet>
            <batch:chunk commit-interval="100" />
        </batch:tasklet>
    </batch:step>
</beans>

Lines 26-30 declare an abstract step which serves as the common parent for steps 2 and 3. It sets a property called commit-interval which defines how many items comprises a chunk. In this case, a chunk is comprised of 100 items.

There is a lot more to Spring Batch, kindly refer to the official reference guide.

References

Java Interfaces vs. Abstract Classes

Tags

,

A question I get a lot is what the difference is between Java interfaces and abstract classes, and when to use each. Having answered this question by email multiple times, I decided to write this tutorial about Java interfaces vs abstract classes.

Java interfaces are used to decouple the interface of some component from the implementation. In other words, to make the classes using the interface independent of the classes implementing the interface. Thus, you can exchange the implementation of the interface, without having to change the class using the interface.

Abstract classes are typically used as base classes for extension by subclasses. Some programming languages use abstract classes to achieve polymorphism, and to separate interface from implementation, but in Java you use interfaces for that. Remember, a Java class can only have 1 superclass, but it can implement multiple interfaces. Thus, if a class already has a different superclass, it can implement an interface, but it cannot extend another abstract class. Therefore interfaces are a more flexible mechanism for exposing a common interface.

If you need to separate an interface from its implementation, use an interface. If you also need to provide a base class or default implementation of the interface, add an abstract class (or normal class) that implements the interface.

Here is an example showing a class referencing an interface, an abstract class implementing that interface, and a subclass extending the abstract class.

The blue class knows only the interface. The abstract class implements the interface, and the subclass inherits from the abstract class.
The blue class knows only the interface. The abstract class implements the interface, and the subclass inherits from the abstract class.

Below are the code examples from the text on Java Abstract Classes, but with an interface added which is implemented by the abstract base class. That way it resembles the diagram above.

First the interface:

public interface URLProcessor {

    public void process(URL url) throws IOException;
}

Second, the abstract base class:

public abstract class URLProcessorBase implements URLProcessor {

    public void process(URL url) throws IOException {
        URLConnection urlConnection = url.openConnection();
        InputStream input = urlConnection.getInputStream();

        try{
            processURLData(input);
        } finally {
            input.close();
        }
    }

    protected abstract void processURLData(InputStream input)
        throws IOException;

}

Third, the subclass of the abstract base class:

public class URLProcessorImpl extends URLProcessorBase {

    @Override
    protected void processURLData(InputStream input) throws IOException {
        int data = input.read();
        while(data != -1){
            System.out.println((char) data);
            data = input.read();
        }
    }
}

Fourth, how to use the interface URLProcessor as variable type, even though it is the subclassUrlProcessorImpl that is instantiated.

URLProcessor urlProcessor = new URLProcessorImpl();

urlProcessor.process(new URL("http://jenkov.com"));

Using both an interface and an abstract base class makes your code more flexible. It possible to implement simple URL processors simply by subclassing the abstract base class. If you need something more advanced, your URL processor can just implement the URLProcessor interface directly, and not inherit fromURLProcessorBase.

Src:

http://tutorials.jenkov.com/java/interfaces-vs-abstract-classes.html

CORS on Nginx

from: http://enable-cors.org/server_nginx.html

The following Nginx configuration enables CORS, with support for preflight requests, using a regular expression to define a whitelist of allowed origins, and various default values that may be needed to workaround incorrect browser implementations.

#
# A CORS (Cross-Origin Resouce Sharing) config for nginx
#
# == Purpose
#
# This nginx configuration enables CORS requests in the following way:
# - enables CORS just for origins on a whitelist specified by a regular expression
# - CORS preflight request (OPTIONS) are responded immediately
# - Access-Control-Allow-Credentials=true for GET and POST requests
# - Access-Control-Max-Age=20days, to minimize repetitive OPTIONS requests
# - various superluous settings to accommodate nonconformant browsers
#
# == Comment on echoing Access-Control-Allow-Origin
# 
# How do you allow CORS requests only from certain domains? The last
# published W3C candidate recommendation states that the
# Access-Control-Allow-Origin header can include a list of origins.
# (See: http://www.w3.org/TR/2013/CR-cors-20130129/#access-control-allow-origin-response-header )
# However, browsers do not support this well and it likely will be
# dropped from the spec (see, http://www.rfc-editor.org/errata_search.php?rfc=6454&eid=3249 ).
# 
# The usual workaround is for the server to keep a whitelist of
# acceptable origins (as a regular expression), match the request's
# Origin header against the list, and echo back the matched value.
#
# (Yes you can use '*' to accept all origins but this is too open and
# prevents using 'Access-Control-Allow-Credentials: true', which is
# needed for HTTP Basic Access authentication.)
#
# == Comment on  spec
#
# Comments below are all based on my reading of the CORS spec as of
# 2013-Jan-29 ( http://www.w3.org/TR/2013/CR-cors-20130129/ ), the
# XMLHttpRequest spec (
# http://www.w3.org/TR/2012/WD-XMLHttpRequest-20121206/ ), and
# experimentation with latest versions of Firefox, Chrome, Safari at
# that point in time.
#
# == Changelog
#
# shared at: https://gist.github.com/algal/5480916
# based on: https://gist.github.com/alexjs/4165271
#

location / {

    # if the request included an Origin: header with an origin on the whitelist,
    # then it is some kind of CORS request.

    # specifically, this example allow CORS requests from
    #  scheme    : http or https
    #  authority : any authority ending in ".mckinsey.com"
    #  port      : nothing, or :
    if ($http_origin ~* (https?://[^/]*\.mckinsey\.com(:[0-9]+)?)$) {
        set $cors "true";
    }

    # Nginx doesn't support nested If statements, so we use string
    # concatenation to create a flag for compound conditions

    # OPTIONS indicates a CORS pre-flight request
    if ($request_method = 'OPTIONS') {
        set $cors "${cors}options";  
    }

    # non-OPTIONS indicates a normal CORS request
    if ($request_method = 'GET') {
        set $cors "${cors}get";  
    }
    if ($request_method = 'POST') {
        set $cors "${cors}post";
    }

    # if it's a GET or POST, set the standard CORS responses header
    if ($cors = "trueget") {
        # Tells the browser this origin may make cross-origin requests
        # (Here, we echo the requesting origin, which matched the whitelist.)
        add_header 'Access-Control-Allow-Origin' "$http_origin";
        # Tells the browser it may show the response, when XmlHttpRequest.withCredentials=true.
        add_header 'Access-Control-Allow-Credentials' 'true';
        # # Tell the browser which response headers the JS can see, besides the "simple response headers"
        # add_header 'Access-Control-Expose-Headers' 'myresponseheader';
    }

    if ($cors = "truepost") {
        # Tells the browser this origin may make cross-origin requests
        # (Here, we echo the requesting origin, which matched the whitelist.)
        add_header 'Access-Control-Allow-Origin' "$http_origin";
        # Tells the browser it may show the response, when XmlHttpRequest.withCredentials=true.
        add_header 'Access-Control-Allow-Credentials' 'true';
        # # Tell the browser which response headers the JS can see, besides the "simple response headers"
        # add_header 'Access-Control-Expose-Headers' 'myresponseheader';
    }

    # if it's OPTIONS, then it's a CORS preflight request so respond immediately with no response body
    if ($cors = "trueoptions") {
        # Tells the browser this origin may make cross-origin requests
        # (Here, we echo the requesting origin, which matched the whitelist.)
        add_header 'Access-Control-Allow-Origin' "$http_origin";
        # in a preflight response, tells browser the subsequent actual request can include user credentials (e.g., cookies)
        add_header 'Access-Control-Allow-Credentials' 'true';

        #
        # Return special preflight info
        #
        
        # Tell browser to cache this pre-flight info for 20 days
        add_header 'Access-Control-Max-Age' 1728000;

        # Tell browser we respond to GET,POST,OPTIONS in normal CORS requests.
        #
        # Not officially needed but still included to help non-conforming browsers.
        #
        # OPTIONS should not be needed here, since the field is used
        # to indicate methods allowed for "actual request" not the
        # preflight request.
        #
        # GET,POST also should not be needed, since the "simple
        # methods" GET,POST,HEAD are included by default.
        #
        # We should only need this header for non-simple requests
        # methods (e.g., DELETE), or custom request methods (e.g., XMODIFY)
        add_header 'Access-Control-Allow-Methods' 'GET, POST, OPTIONS';
        
        # Tell browser we accept these headers in the actual request
        #
        # A dynamic, wide-open config would just echo back all the headers
        # listed in the preflight request's
        # Access-Control-Request-Headers.
        #
        # A dynamic, restrictive config, would just echo back the
        # subset of Access-Control-Request-Headers headers which are
        # allowed for this resource.
        #
        # This static, fairly open config just returns a hardcoded set of
        # headers that covers many cases, including some headers that
        # are officially unnecessary but actually needed to support
        # non-conforming browsers
        # 
        # Comment on some particular headers below:
        #
        # Authorization -- practically and officially needed to support
        # requests using HTTP Basic Access authentication. Browser JS
        # can use HTTP BA authentication with an XmlHttpRequest object
        # req by calling
        # 
        #   req.withCredentials=true,  and
        #   req.setRequestHeader('Authorization','Basic ' + window.btoa(theusername + ':' + thepassword))
        #
        # Counterintuitively, the username and password fields on
        # XmlHttpRequest#open cannot be used to set the authorization
        # field automatically for CORS requests.
        #
        # Content-Type -- this is a "simple header" only when it's
        # value is either application/x-www-form-urlencoded,
        # multipart/form-data, or text/plain; and in that case it does
        # not officially need to be included. But, if your browser
        # code sets the content type as application/json, for example,
        # then that makes the header non-simple, and then your server
        # must declare that it allows the Content-Type header.
        # 
        # Accept,Accept-Language,Content-Language -- these are the
        # "simple headers" and they are officially never
        # required. Practically, possibly required.
        #
        # Origin -- logically, should not need to be explicitly
        # required, since it's implicitly required by all of
        # CORS. officially, it is unclear if it is required or
        # forbidden! practically, probably required by existing
        # browsers (Gecko does not request it but WebKit does, so
        # WebKit might choke if it's not returned back).
        #
        # User-Agent,DNT -- officially, should not be required, as
        # they cannot be set as "author request headers". practically,
        # may be required.
        # 
        # My Comment:
        #
        # The specs are contradictory, or else just confusing to me,
        # in how they describe certain headers as required by CORS but
        # forbidden by XmlHttpRequest. The CORS spec says the browser
        # is supposed to set Access-Control-Request-Headers to include
        # only "author request headers" (section 7.1.5). And then the
        # server is supposed to use Access-Control-Allow-Headers to
        # echo back the subset of those which is allowed, telling the
        # browser that it should not continue and perform the actual
        # request if it includes additional headers (section 7.1.5,
        # step 8). So this implies the browser client code must take
        # care to include all necessary headers as author request
        # headers.
        # 
        # However, the spec for XmlHttpRequest#setRequestHeader
        # (section 4.6.2) provides a long list of headers which the
        # the browser client code is forbidden to set, including for
        # instance Origin, DNT (do not track), User-Agent, etc.. This
        # is understandable: these are all headers that we want the
        # browser itself to control, so that malicious browser client
        # code cannot spoof them and for instance pretend to be from a
        # different origin, etc..
        #
        # But if XmlHttpRequest forbids the browser client code from
        # setting these (as per the XmlHttpRequest spec), then they
        # are not author request headers. And if they are not author
        # request headers, then the browser should not include them in
        # the preflight request's Access-Control-Request-Headers. And
        # if they are not included in Access-Control-Request-Headers,
        # then they should not be echoed by
        # Access-Control-Allow-Headers. And if they are not echoed by
        # Access-Control-Allow-Headers, then the browser should not
        # continue and execute actual request. So this seems to imply
        # that the CORS and XmlHttpRequest specs forbid certain
        # widely-used fields in CORS requests, including the Origin
        # field, which they also require for CORS requests.
        #
        # The bottom line: it seems there are headers needed for the
        # web and CORS to work, which at the moment you should
        # hard-code into Access-Control-Allow-Headers, although
        # official specs imply this should not be necessary.
        # 
        add_header 'Access-Control-Allow-Headers' 'Authorization,Content-Type,Accept,Origin,User-Agent,DNT,Cache-Control,X-Mx-ReqToken,Keep-Alive,X-Requested-With,If-Modified-Since';

        # build entire response to the preflight request
        # no body in this response
        add_header 'Content-Length' 0;
        # (should not be necessary, but included for non-conforming browsers)
        add_header 'Content-Type' 'text/plain charset=UTF-8';
        # indicate successful return with no content
        return 204;
    }
    # --PUT YOUR REGULAR NGINX CODE HERE--
}

How to update GitHub Forked Repository

Question:

Recently forked a project and applied several fixes. I then created a pull request which was then accepted.

A few days later another change was made by another contributor. So my fork doesn’t contain that change… How can I get that change into my fork?

Do I need to delete and re-create my fork when I have further changes to contribute? or is there an update button?

Answer:

In your local clone of your forked repository, you can add the original GitHub repository as a “remote”. (“Remotes” are like nicknames for the URLs of repositories – origin is one, for example.) Then you can fetch all the branches from that upstream repository, and rebase your work to continue working on the upstream version. In terms of commands that might look like:

# Add the remote, call it "upstream":

git remote add upstream https://github.com/whoever/whatever.git

# Fetch all the branches of that remote into remote-tracking branches,
# such as upstream/master:

git fetch upstream

# Make sure that you're on your master branch:

git checkout master

# Rewrite your master branch so that any commits of yours that
# aren't already in upstream/master are replayed on top of that
# other branch:

git rebase upstream/master

If you don’t want to rewrite the history of your master branch, (for example because other people may have cloned it) then you should replace the last command with git merge upstream/master. However, for making further pull requests that are as clean as possible, it’s probably better to rebase.


Update: If you’ve rebased your branch onto upstream/master you may need to force the push in order to push it to your own forked repository on GitHub. You’d do that with:

git push -f origin master

You only need to use the -f the first time after you’ve rebased.

Reference:

http://stackoverflow.com/questions/7244321/how-to-update-github-forked-repository

Broadleaf commerce to the migration of PostgreSQL and Tomcat, Deploy to AWS Beanstalk

Recently broadleaf commerce, a business website open source template.

The model is run in the jetty container, the database is HSQL. The official website describes how to migrate the database to PosgreSQL and the project needed for deployment in Tomcat configuration, but the process is not very detailed, online resources in this area is not a lot, so I decided to write this blog as a summary.

Transfer 1 database (HSQL to POSGRESQL)

(a)Open the root directory of the DemoSite project pom.xml file, in the<dependencyManagement>Regional add:

<dependency>
    <groupId>mysql</groupId>
    <artifactId>mysql-connector-java</artifactId>
    <version>5.1.26</version>
    <type>jar</type>
    <scope>compile</scope>
</dependency>

(b)Open and found in the pom.xml in the admin and site folders respectively<dependencies>Regional add:

<dependency>
    <groupId>mysql</groupId>
    <artifactId>mysql-connector-java</artifactId>
</dependency>

(c)Add a database named broadleaf in the MySQL database

(d)Open admin/src/main/webapp/META-INF and admin/src/main/webapp/META-INF in context.xml respectively, the content is replaced by the following (database configuration information such as user name and password please according to their own environment change accordingly):

<?xml version="1.0" encoding="UTF-8"?>
<Context>
    <Resource name="jdbc/web"
              auth="Container"
              type="javax.sql.DataSource"
              factory="org.apache.tomcat.jdbc.pool.DataSourceFactory"
              testWhileIdle="true"
              testOnBorrow="true"
              testOnReturn="false"
              validationQuery="SELECT 1"
              timeBetweenEvictionRunsMillis="30000"
              maxActive="15"
              maxIdle="10"
              minIdle="5"
              removeAbandonedTimeout="60"
              removeAbandoned="false"
              logAbandoned="true"
              minEvictableIdleTimeMillis="30000"
              jdbcInterceptors="org.apache.tomcat.jdbc.pool.interceptor.ConnectionState;org.apache.tomcat.jdbc.pool.interceptor.StatementFinalizer"
              username="root"
              password="123"
              driverClassName="com.postgresql.Driver"
              url="jdbc:mysql://localhost:3306/broadleaf"/>

    <Resource name="jdbc/storage"
              auth="Container"
              type="javax.sql.DataSource"
              factory="org.apache.tomcat.jdbc.pool.DataSourceFactory"
              testWhileIdle="true"
              testOnBorrow="true"
              testOnReturn="false"
              validationQuery="SELECT 1"
              timeBetweenEvictionRunsMillis="30000"
              maxActive="15"
              maxIdle="10"
              minIdle="5"
              removeAbandonedTimeout="60"
              removeAbandoned="false"
              logAbandoned="true"
              minEvictableIdleTimeMillis="30000"
              jdbcInterceptors="org.apache.tomcat.jdbc.pool.interceptor.ConnectionState;org.apache.tomcat.jdbc.pool.interceptor.StatementFinalizer"
              username="root"
              password="123"
              driverClassName="com.postgresql.Driver"
              url="jdbc:mysql://localhost:3306/broadleaf"/>

    <Resource name="jdbc/secure"
              auth="Container"
              type="javax.sql.DataSource"
              factory="org.apache.tomcat.jdbc.pool.DataSourceFactory"
              testWhileIdle="true"
              testOnBorrow="true"
              testOnReturn="false"
              validationQuery="SELECT 1"
              timeBetweenEvictionRunsMillis="30000"
              maxActive="15"
              maxIdle="10"
              minIdle="5"
              removeAbandonedTimeout="60"
              removeAbandoned="false"
              logAbandoned="true"
              minEvictableIdleTimeMillis="30000"
              jdbcInterceptors="org.apache.tomcat.jdbc.pool.interceptor.ConnectionState;org.apache.tomcat.jdbc.pool.interceptor.StatementFinalizer"
              username="root"
              password="123"
              driverClassName="com.postgresql.Driver"
              url="jdbc:mysql://localhost:3306/broadleaf"/>
</Context>

(e)Open the core/src/main/resources/runtime-properties/common-shared.properties file, the following three

blPU.hibernate.dialect=org.hibernate.dialect.HSQLDialect
blCMSStorage.hibernate.dialect=org.hibernate.dialect.HSQLDialect
blSecurePU.hibernate.dialect=org.hibernate.dialect.HSQLDialect

Were replaced by:

blPU.hibernate.dialect=org.hibernate.dialect.PostgreSQLDialect
blSecurePU.hibernate.dialect=org.hibernate.dialect.PostgreSQLDialect
blCMSStorage.hibernate.dialect=org.hibernate.dialect.org.hibernate.dialect.PostgreSQLDialect

(f)Open the DemoSite build.properties in the root directory, the following contents

ant.hibernate.sql.ddl.dialect=org.hibernate.dialect.HSQLDialect

ant.blPU.url=jdbc:hsqldb:hsql://localhost/broadleaf
ant.blPU.userName=sa
ant.blPU.password=null
ant.blPU.driverClassName=org.hsqldb.jdbcDriver

ant.blSecurePU.url=jdbc:hsqldb:hsql://localhost/broadleaf
ant.blSecurePU.userName=sa
ant.blSecurePU.password=null
ant.blSecurePU.driverClassName=org.hsqldb.jdbcDriver

ant.blCMSStorage.url=jdbc:hsqldb:hsql://localhost/broadleaf
ant.blCMSStorage.userName=sa
ant.blCMSStorage.password=null
ant.blCMSStorage.driverClassName=org.hsqldb.jdbcDriver

According to their configuration changes to database:

ant.hibernate.sql.ddl.dialect=org.hibernate.dialect.PostgreSQLDialect

ant.blPU.url=jdbc:postgresql://localhost:3306/broadleaf
ant.blPU.userName=root
ant.blPU.password=123
ant.blPU.driverClassName=org.postgresql.Driver

ant.blSecurePU.url=jdbc:postgresql://localhost:3306/broadleaf
ant.blSecurePU.userName=root
ant.blSecurePU.password=123
ant.blSecurePU.driverClassName=org.postgresql.Driver

ant.blCMSStorage.url=jdbc:postgresql://localhost:3306/broadleaf
ant.blCMSStorage.userName=root
ant.blCMSStorage.password=123
ant.blCMSStorage.driverClassName=org.postgresql.Driver

This database migration is complete.

Transfer 2 servers (from jetty to tomcat7)

(a)In the site and admin directory of the pom.xml file<plugins>Adding region:

                         <plugin>
                                <groupId>org.apache.tomcat.maven</groupId>
                                <artifactId>tomcat7-maven-plugin</artifactId>
                                <version>2.0</version>
                                <configuration>
                                        <warSourceDirectory>${webappDirectory}</warSourceDirectory>
                                        <path>/</path>
                                        <port>${httpPort}</port>
                                        <httpsPort>${httpsPort}</httpsPort>
                                        <keystoreFile>${webappDirectory}/WEB-INF/blc-example.keystore</keystoreFile>
                                        <keystorePass>broadleaf</keystorePass>
                                        <password>broadleaf</password>
                                </configuration>
                        </plugin>

(b)Right click DemoSite project in eclipse, Has run the Run As inside the Maven clean and Maven install, After the success will be in DemoSite admin and site target folder in the corresponding war packet generation, We generated two war package named admin.war and zk.war.

(c)Your environment is Ubuntu, the path to the webapps Tomcat server for /var/lib/tomcat7/webapps, admin and zk.war will be copied to the folder, and then restart the Tomcat server:

sudo /etc/init.d/tomcat7 restart

See the /var/log/tomcat7/catalina.out file error:

Caused by: java.lang.OutOfMemoryError: Java heap space
        at org.apache.tomcat.util.bcel.classfile.ClassParser.readMethods(ClassParser.java:268)
        at org.apache.tomcat.util.bcel.classfile.ClassParser.parse(ClassParser.java:128)
        at org.apache.catalina.startup.ContextConfig.processAnnotationsStream(ContextConfig.java:2105)
        at org.apache.catalina.startup.ContextConfig.processAnnotationsJar(ContextConfig.java:1981)
        at org.apache.catalina.startup.ContextConfig.processAnnotationsUrl(ContextConfig.java:1947)
        at org.apache.catalina.startup.ContextConfig.processAnnotations(ContextConfig.java:1932)
        at org.apache.catalina.startup.ContextConfig.webConfig(ContextConfig.java:1326)
        at org.apache.catalina.startup.ContextConfig.configureStart(ContextConfig.java:878)
        at org.apache.catalina.startup.ContextConfig.lifecycleEvent(ContextConfig.java:369)
        at org.apache.catalina.util.LifecycleSupport.fireLifecycleEvent(LifecycleSupport.java:119)
        at org.apache.catalina.util.LifecycleBase.fireLifecycleEvent(LifecycleBase.java:90)
        at org.apache.catalina.core.StandardContext.startInternal(StandardContext.java:5179)
        at org.apache.catalina.util.LifecycleBase.start(LifecycleBase.java:150)
        at org.apache.catalina.core.ContainerBase.addChildInternal(ContainerBase.java:901)
        at org.apache.catalina.core.ContainerBase.addChild(ContainerBase.java:877)
        at org.apache.catalina.core.StandardHost.addChild(StandardHost.java:633)
        at org.apache.catalina.startup.HostConfig.deployDirectory(HostConfig.java:1114)
        at org.apache.catalina.startup.HostConfig$DeployDirectory.run(HostConfig.java:1673)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
        ... 4 more

Baidu later learned that the memory overflow problem, the following specific solutions:

Catalina.sh files in Ubuntu (path for the file is /usr/share/tomcat7/bin/catalina.sh), add the following content in the first line in the document:

JAVA_OPTS='-server -Xms256m -Xmx512m -XX:PermSize=128M -XX:MaxPermSize=256M' #Note: single quotation marks can not be omitted

Catalina.bat files in windows, in the first row, add the following content:

set JAVA_OPTS=-server -Xms256m -Xmx512m -XX:PermSize=128M -XX:MaxPermSize=256M #Note: no single quotation marks

(d)According to (c) in the modified after the restart the Tomcat server:

sudo /etc/init.d/tomcat7 restart

You can normal open electric page in the browser: localhost:8080/zk and background management page: localhost:8080/admin, to transfer Tomcat server also be accomplished.

Cross-Origin Resources Sharing on JAX-RS web services

What is the problem?

Accessing resources offered on a different domain than the Javascript client which wants to access the data is restricted by the Same Origin Policy. This is a good idea in principle as it protects you from bad sites hacking your bank account or other relevant data. Accessing data from public APIs can be done either by JSONP requests (padding JSON, embedded JSON used by a callback method) or making [[ http://en.wikipedia.org/wiki/Cross-origin_resource_sharing][CORS]] requests. Another possibility is to make requests via a proxy but this requires the client does have a server backend available. This is not the case in Android Apps for example. Anyway, it requires addition configuration of the backend.

CORS requests

CORS is a W3C specification and is broadly implemented by all new browsers types. Creating CORS requests on client side mostly means to add an Origin URL as HTTP Header. If the remote server was configured to allow the client’s Origin accessing the server’s resources it returns an OK, otherwise the same-origin-policy applies.

HowTo Client

We won’t repeat HowTos which have already been written. HTML5Rocks supply is a good one explaining how to make CORS requests from JavaScript.

HowTo Server

Apache Tomcat since version 7+ provides a CORS filter which can easily configured in a web application’s web.xml. See Tomcat’s filter documentation for a detailed description. There are plenty of other servers which can be configured to handle CORS.

Reference:

1. Cross-Origin Resources Sharing, https://wiki.52north.org/bin/view/Documentation/CrossOriginResourceSharing

2.Using CORS

http://www.html5rocks.com/en/tutorials/cors/

3. How to enable Cross domain requests on JAX-RS web services?

http://stackoverflow.com/questions/23450494/how-to-enable-cross-domain-requests-on-jax-rs-web-services

Deploying WordPress to Amazon Web Services AWS EC2 and RDS via ElasticBeanstalk

Tags

, , ,

Src: https://www.otreva.com/blog/deploying-wordpress-amazon-web-services-aws-ec2-rds-via-elasticbeanstalk/

A common question we get asked is how do I ensure my WordPress application can scale with an influx of demand? What can I do do ensure high uptimes and great performance?

As a member of the AWS Partner Network, our typical response is to build an auto-scaling, self healing cloud application using AWS ElasticBeanstalk. AWS Elastic Beanstalk is an even easier way for you to quickly deploy and manage applications in the AWS cloud. You simply upload your application via GIT, and Elastic Beanstalk automatically handles the deployment details of capacity provisioning, load balancing, auto-scaling, and application health monitoring. But at the same time, you retain full control over the AWS resources powering your application and can access the underlying resources at any time via ElasticBeanstalk.

Sounds great right? We think so too.

The one consideration is that users have to understand the need for their application to be stateless. This means you can no longer upload photos to the server itself, you can’t install plugins on the server, and you don’t want anything to change on the server without it first being changed locally via your GIT repository. Luckily, there are a few plugins to help ease this pain, and they allow you to still use the default WordPress media uploader and browser yet instead of storing media on the server, you can configure an AWS S3 (Simple Storage Service) bucket to retain all the images off the server.

So why is stateless so important?

Auto-scaling, self-healing applications rely on this single understanding of stateless design. The reason an application must be stateless is because if it needs to scale up (add more servers behind it), it cannot be inconsistent between server instances. The application code on server instance A must be exactly the same as server instance B. This is why your repository should only contain your base PHP files, theme files, plugins and basic WordPress application files. Media like images, videos and PDFs should NEVER be kept in your GIT repository unless they are needed for your theme. So for example if you have a background image that is used throughout the site, that may be an exception that can live in your theme’s image folder inside the GIT repository. However images that go in blog posts, pages, etc should be hosted on a storage service like AWS S3. Once we upload the images to S3, we get a URL back that can live in the database which would be shared by all instances that are deployed. There are many more considerations to stateless application design which are outside the scope of this post.

Prerequisites:

  1. An AWS Account with Billing Setup
  2. GIT – Version Control System
  3. MySQL WorkBench (if migrating database – Not needed for fresh install)
  4. WordPress Install on Local Web Server
  5. WordPress wp-config.php file
  6. AWS ElasticBeanstalk command line tool

Step 1: Setup Local Web Server

This tutorial assumes you know how to install WordPress locally on your machine via a local web server like XAMPP, MAMP, LAMP or other apache based server. So once you have a folder setup and your files ready to be accessed by your local web server (localhost) you will be ready to proceed from here.

Copy all of the WordPress files into your local folder so they’ll be ready to be accessed by your local web server. Don’t setup a local database. At the time of this writing, we’re install WordPress 3.6.1. If you’ve done everything correctly so far, you should be able to access your files and see a screen like this:

Wordpress create a configuration file

Step 2: Configure AWS ElasticBeanstalk and WordPress for the Cloud

If you made it here, great. The next few steps will be slightly different from the typical WordPress install so try not to deviate.

Config AWS ElasticBeanstalk

  1. Login to your AWS Account and go to ElasticBeanstalk and click Create Application which should bring you to a screen similar to below. Name your application here:
    aws-beanstalk-step1
  2. Next, select PHP as your platform and leave the environment type as Load Balancing, Autoscaling.
    aws-beanstalk-step2
  3. Leave sample application selected for now:
    aws-beanstalk-step3
  4. Give your environment a different name if you want and change the subdomain if you want and check the availability. We will later be creating a CNAME in your DNS for the Environment URL so it isn’t a huge deal for what you name it. It won’t be a public URL.
    aws-beanstalk-step4
  5. Ensure you check create an RDS DB instance.
    aws-beanstalk-step5
  6. The default settings on this page are probably fine for most installs and they can all be updated later so we won’t go into detail about them for now.
    aws-beanstalk-step6jpg
  7. Next we’ll need to configure our MySQL database here. Again most of the default values are OK to start and you can always update them later. Most WordPress sites we see need a database well below the 5GB minimum but it is a minimum for AWS RDS so use 5GB unless you have a larger database. Also pick a very secure username and password here and save them for the later steps. You can also pick multiple availability zones if this is an enterprise application that needs to ensure 100% uptime but be aware costs will tip $50 a month for DB usage only.
    aws-beanstalk-step7
  8. Lastly, review everything and create the app. You should see a screen like below. Wait until the environment health is green to proceed to the next steps.
    aws-beanstalk-step8
  9. Next, in the AWS console, go to Services > RDS. Since this is likely your first AWS RDS DB, you should be able to click security group on the left and see only one security group that looks something like “awseb-e-xxxxxxxxxxxxxxxxxxxxxxxxxx-xxxxxxxxxxxxxxxxxxxx”. Edit that security group. You should see a best estimate of your IP Address that Amazon gives you in the small text in the yellow box. Enter that or your correct IP Address with subnet mask. This allows our local WordPress app to access our AWS RDS Cloud Database. We will be sharing one database between our local and AWS EC2 Cloud Environment for this tutorial.* Note: VPC Users Setup Security Differentlyaws-beanstalk-step9
  10. Now while still in the RDS console, switch back to instances and copy the Endpoint field WITHOUT :3306. We’ll want this endpoint URL as well as the username and password we created above for the WordPress install. Also note the Database Name which is normally ebdb.
    aws-beanstalk-step10

Step 3: Install WordPress

  1. Switch back to your browser and go to your configuration. We are now ready to create a configuration file.
    Wordpress create a configuration file
  2. Fill in the database name as found in the RDS console, which is normally ebdb, the endpoint from above as Database Host (remember without :3306) as well as the username and password you created. You can leave the prefix as wp_.
    create-a-config2
  3. Now click install and you should be up and running! Finish the install just as any other WordPress Install
    create-a-config3
  4. But wait, we’re only halfway done!

Step 4: Local WP Conifg and GIT Repository setup

  1. Create a new file: local-config.php and put it in the root folder where wp-config.php is. Make sure to change to whatever your local hostname is below where you can access your install.
    1
    <?php define('WP_HOME','http://localhost/elasticbeanstalk'); define('WP_SITEURL','http://localhost/elasticbeanstalk'); ?>
  2. Open wp-config.php to edit. Add the following code just before define(‘DB_NAME’, ‘ebdb’); This allows us to browse our local code under our local hostname while still using a remote database. This step is very important.
    1
    2
    3
    if ( file_exists( dirname( __FILE__ ) . '/local-config.php' ) ) {
        include( dirname( __FILE__ ) . '/local-config.php' );
    }
  3. Create another file called .gitignore in the same location (root). This file will allow us to ignore files from our GIT repository that we need locally but that we don’t want in the repo.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    #################
    ## WordPress
    #################
    .git-rewrite/
    local-config.php
    .elasticbeanstalk/
    *.local.*
    *.remote.*
    *.base.*
    *.backup.*
    *.orig*
    #ElasticBeanstalk Configs we won't need in the repo
    AWSDevTools/
    AWSDevTools-OneTimeSetup.bat
    AWSDevTools-RepositorySetup.bat
    scripts/
    AWSDevTools-RepositorySetup.sh
    #These are to ignore media files which we again DON'T want in our GIT Repo.
    wp-content/uploads/
    wp-content/cache/

Step 5: Initialize your GIT repo

This assumes you already have GIT setup on your local machine.

  1. From the command line, CD to your local folder where all the website files are and make sure you are in the root where the wp-admin, wp-content, wp-includes folders are.
    1
    git init
  2. Add all the local WP Files to a GIT Repository. Note git will automatically ignore the files we put in our .gitignore file above.
    1
    git add *.*
  3. Finally commit them to the GIT Repository.
    1
    git commit -m "Committing my initial WordPress site into this repo"

Step 6: Push this site up to AWS Beanstalk

Now we finally have our local site in our local GIT Repository and we’re ready to push it up to ElasticBeanstalk to make the site live. You should probably read this first to get a better understanding of what we’re about to do. We’re going to use a slightly slimmed down version of the above to help you get going.

  1. Create a folder in the root of our site called .elasticbeanstalk. Inside that folder we’ll make two files.
    1. config
      1
      2
      3
      4
      5
      6
      [global]
      ApplicationName=YourApplicationNameFromAWSConsole
      AwsCredentialFile=.elasticbeanstalk/aws_credentials
      DevToolsEndpoint=git.elasticbeanstalk.us-east-1.amazonaws.com
      EnvironmentName=EnvironmentNameFromAWSConsole
      Region=us-east-1

      aws-beanstalk-step11

    2. aws_credentials
      These come from your AWS Console Account. Create a new user under IAM Console to get these:

      1
      2
      3
      [global]
      AWSAccessKeyId=AKIAxxxxxxxxxxxxxxxxxxxxx
      AWSSecretKey=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
  2. Go and download the AWS command line tools. Then copy the content inside the Windows folder of that download to your root folder where WP lives. If on Linux or OSX, use that folder instead.
  3. Add and commit everything we just did to the GIT Repository.
    1
    git add *.*
    1
    git commit -m "Adding AWS Configs"
  4. Finally let’s push the Repo up to Beanstalk.
    1
    git aws.push
  5. If all went well, you should be able to see the environment updating inside the AWS Console.

Done! Well with the deploy process.

Now your basic WordPress application is up on AWS but keeping it stateless means you must always install plugins locally, make changes locally to theme files, and otherwise change the sources code in the repo locally and then add, commit and push to AWS. However it allows us to use the power of AWS cloud infrastructure to create self-healing, auto-scaling apps. Although out of the context of this article, you can easily use plugins like Amazon S3 and CloudFront along with Amazon Web Services to send your media files to S3 and keep your codebase free of media. Upload the later plugin first.

Good luck and feel free to leave comments below with questions. If you’ve never installed WordPress on a basic shared server or local server, you may not want to attempt this until you have done that.

How to do Cohort Analysis in Google Analytics

Tags

, ,

Src: http://jonathonbalogh.com/2012/04/01/how-to-do-cohort-analysis-in-google-analytics/2/

Cohort analysis example: engagement

Never use analytics to track information that uniquely identifies a particular person, including their real name, email address or IP. It’s not only against Google Analytics’ terms of service, it’s also a lousy and unnecessary violation of privacy.

Most cohort analysis is based on users grouped by a common date range. We do this to see if their behavior from one period to the next has changed. It’s also possible to group users based on other attributes that they share, such as membership level or achieved goals. The objective is to learn whether users with this attribute tend to achieve our product goals at a significantly different rate than a baseline cohort over time.
What types of data should we track? This depends on the type of product you have and the level of detail you need. Ask yourself: what are the long term attributes of your users that Google Analytics doesn’t provide? Which properties best differentiate your users and are most relevant to your product? What questions are you trying to answer?

USER ATTRIBUTES
Good examples total downloads, donated, sign up date, Klout score, gender, membership type, games played, referred friend, test group
Bad examples number of visits, location, browser, referer, number of pageviews, IP address, last name

Yes, there are exceptions to virtually every one of those examples. Use your judgement. If it’s important for you to know the number of people who started with Internet Explorer last year but are using Chrome this year then go ahead and record the user’s “Initial Browser”, for example.

> AN ASIDE: AREN’T THERE BETTER WAYS TO DO THIS?

In Google Analytics the majority of metrics are associated with a visit or session – this includes goals and events. When selecting trackable cohort attributes you’re making a decision about which user data to track across visits. Want to know how many downloads you had last week? Just use events or virtual pageviews. Cohort tracking doesn’t help with that. Need to track the number of visits in which users opened your pricing page, clicked a Learn More link and then signed up for your premium plan? Use a funnel, that’s what they’re for. Curious if last year’s paying members are as likely to pay this year as new members? Use a cohort analysis and track both sign up date and transactions.
There are, in fact, other ways to get this type of information. The best way is to just query your database directly. If users need to sign in to your product to use it then they likely have an account stored in your database. Want the number of users who’ve signed up in the last month and donated at least once? Just login to your live database and execute the appropriate SQL query. Want to graph that for the last 6 months and compare it against the referring medium? No problem. Just parse your site log file to correlate visits to logins so you can update a new DB table on visitor attributes then run another query, likely involving a join, on a replicated DB (to ensure stability), export the results, import the data into a spreadsheet or something else and then create the graphs. Heck, you can even manage funnel reports if you’re willing to work at it.
A homegrown analytics solution gives you lots of power and flexibility without having to rely on a third party service. And honestly, as involved as it may be, if you know what you’re doing you can automate your solution to the point where it’s just as fast and easy to use as a dedicated service. Maybe better. So why wouldn’t you? If you’re comfortable with this stuff, don’t mind investing the time and believe it’s critical for your product’s success then you probably should. For the rest of us, the investment in learning, building and maintaining this type of solution just isn’t worth it. (Though there are analytics servicesaround that can help you with this.)

Blog example: Guido’s Mosquitos

I find things much easier to understand when looking at a real world situation. Let’s try a quick tutorial showing how you might use cohort analysis in Google Analytics to track engagement. Imagine your product is a blog advocating respect for your friend, the misunderstood mosquito. Your goal for “Guido’s Mosquitos” is to understand how well you retain your readers as well as record a few goals that they might reach on your site. In this case, you need to decide which cohort retention intervals you care about and which goals matter most. Let’s start with something like this:

Data layout:

SLOTS PURPOSE EXAMPLE DATA DESCRIPTION
Slot 1 Signup date 20111019 Date of user’s first visit
Slot 2 Weekly cohort 42 Week of user’s first visit
Slot 3 Ebook downloads 3 Number of ebooks downloaded
Slot 4 Goal tracking RefSent User referred a friend

It’s a new year and you’re considering adding more ebooks for readers to download from your blog. However, you only want to do so if it’s likely to increase donations. How do you proceed? In this case, the cohort, the group of people you’re most interested in, is made up of users who have downloaded at least x of your ebooks. You don’t care when they started coming to your site, or even how long they stayed, just that they engaged in an activity of interest to you.

ADVANCED SEGMENT MATCH CONDITIONS
“Cohort: 0 downloads” Custom var: 3
Matching RegExp: ^0$
“Cohort: 1 download” Custom var: 3
Matching RegExp: ^1$
“Cohort: 2+ downloads” Custom var: 3
Matching RegExp: ^[2-9]$

With this segmentation you can jump over to an appropriately configured custom report and attempt to answer your initial question. For example, you might try to plot the number of goals achieved (donations) by each of the 3 user segments during the last couple months of the year.

Aak! The abundance of ebooks is killing your business! Ok, not really. This is a rather limited analysis and it’s important that we understand exactly what it says. Looking at the “Cohort: 1 download” segment, for example, the results might be read something like this: 14.49% of users who downloaded exactly 1 ebook made a donation in the last 2 months. These users may have downloaded their one ebook during the analysis period or any time before that.

Correlation between users who download ebooks and make donations

What we are trying to do is establish a correlation between our test segments (users who download ebooks) and our target goals (in this case, donations). The graph suggests that those who download ebooks are significantly more likely to donate but that those who download 1 ebook are just as likely to donate (if not more) as those who download 2 or more. The graph says nothing about why this is the case. Perhaps each of the downloaded ebooks repeat the same message and you’re boring your audience to tears. I don’t know. A more detailed attribution analysis would be required. But the investigation here should at least make you stop and think: maybe I should investigate this further before adding more ebooks, or perhaps there’s a better way to increase donations (preferably one with more promising data).

Guide to AJAX crawling for webmasters and developers

Tags

Src:https://support.google.com/webmasters/answer/174992?hl=en

Overview

If you’re running an AJAX application with content that you’d like to appear in search results, we have a new process that, when implemented, can help Google (and potentially other search engines) crawl and index your content. Historically, AJAX applications have been difficult for search engines to process because AJAX content is produced dynamically by the browser and thus not visible to crawlers. While there are existing methods for dealing with this problem, they involve regular manual maintenance to keep the content up-to-date.

In contrast, the scheme below helps search engines to scalably crawl and index your content, and it helps webmasters keep the indexed content current without ongoing manual effort. If your AJAX application adopts this scheme, its content can show up in search results. The scheme works as follows:

  1. The site adopts the AJAX crawling scheme.
  2. Your server provides an HTML snapshot for each AJAX URL, which is the content a user (with a browser) sees. An AJAX URL is a URL containing a hash fragment, e.g., www.example.com/index.html#mystate, where #mystateis the hash fragment. An HTML snapshot is all the content that appears on the page after the JavaScript has been executed.
  3. The search engine indexes the HTML snapshot and serves your original AJAX URLs in its search results.

In order to make this work, the application must use a specific syntax in the AJAX URLs (let’s call them “pretty URLs;” you’ll see why in the following sections). The search engine crawler will temporarily modify these “pretty URLs” into “ugly URLs” and request those from your server. This request for an “ugly URL” indicates to the server that it should not return the regular web page it would give to a browser, but instead an HTML snapshot. When the crawler has obtained the content for the modified ugly URL, it indexes its content, then displays the original pretty URL in the search results. In other words, end users will always see the pretty URL containing a hash fragment. The following diagram summarizes the agreement:

diagram showing the process necessary for AJAX content to be crawled by Google

For more information, see the AJAX crawling FAQ and the developer documentation.

Step-by-step guide

The first step to getting your AJAX site indexed is to indicate to the crawler that your site supports the AJAX crawling scheme. The way to do this is to use a special token in your hash fragments (that is, everything after the #sign in a URL). Hash fragments that represent unique page states must begin with an exclamation mark. For example, if your AJAX app contains a URL like this:

www.example.com/ajax.html#mystate

it should now become this:

www.example.com/ajax.html#!mystate

When your site adopts the scheme, your site will be considered “AJAX crawlable.” This means that the crawler will see the content of your app if your site supplies HTML snapshots.

Suppose you would like to get www.example.com/index.html#!mystate indexed. Your part of the agreement is to provide the crawler with an HTML snapshot of this URL, so that the crawler sees the content. How will your server know when to return an HTML snapshot instead of a regular page? The answer lies in the URL that is requested by the crawler: the crawler will modify each AJAX URL such aswww.example.com/ajax.html#!mystate to temporarily become www.example.com/ajax.html?_escaped_fragment_=mystate. We refer to the former as a “pretty URL” and the latter as an “ugly URL”.

This is important for two reasons:

  • Hash fragments are never (by specification) sent to the server as part of an HTTP request. In other words, the crawler needs some way to let your server know that it wants the content for the URLwww.example.com/ajax.html#!mystate.
  • Your server, on the other hand, needs to know that it has to return an HTML snapshot, rather than the normal page sent to the browser. Remember: an HTML snapshot is all the content that appears on the page after the JavaScript has been executed. Your server’s end of the agreement is to return the HTML snapshot for http://www.example.com/index.html#!mystate (that is, the original URL) to the crawler.

Note: The crawler escapes certain characters in the fragment during the transformation. To retrieve the original fragment, make sure to unescape all %XX characters in the fragment (for example, %26 should become ‘&’, %20 should become a space, %23 should become #, and %25 should become %).

Now that you have your original URL back and you know what content the crawler is requesting, you need to produce an HTML snapshot. Here are some ways to do this:

  • If a lot of your content is produced with JavaScript, you may want to use a headless browser such as HtmlUnitto obtain the HTML snapshot. Alternatively, you can use a different tool such as crawljax or watij.com.
  • If much of your content is produced with a server-side technology such as PHP or ASP.NET, you can use your existing code and replace only the JavaScript portions of your web page with static or server-side created HTML.
  • You can create a static version of your pages offline. For example, many applications draw content from a database that is then rendered by the browser. Instead, you may create a separate HTML page for each AJAX URL. This is similar to Google’s previous Hijax recommendation.

Some of your pages may not have hash fragments. For example, you probably want your home page to bewww.example.com, rather than www.example.com#!home. For this reason, we have a special provision for pages without hash fragments.

In order to get pages without hash fragments indexed, you include a special meta tag in the head of the HTML of your page. Important: Make sure you use this solution only for pages that include Ajax content. Adding this to non-Ajax pages creates no benefit and puts extra load on your servers and Google’s. The meta tag takes the following form:

<meta name="fragment" content="!">

This tag indicates to the crawler that it should crawl the ugly version of this URL. As per the above agreement, the crawler will temporarily map the pretty URL to the corresponding ugly URL. In other words, if you place <meta name="fragment" content="!"> into the page http://www.example.com, the crawler will temporarily map this URL towww.example.com?_escaped_fragment_= and will request this from your server. Your server should then return the HTML snapshot corresponding to www.example.com.

Please note that one important restriction applies to this meta tag: the only valid content is "!". In other words, the meta tag will always take the exact form: <meta name="fragment" content="!">, which indicates an empty hash fragment, but a page with AJAX content.

Crawlers use Sitemaps to complement their discovery crawl. Your Sitemap should include the version of your URLs that you’d prefer to have displayed in search results, so in most cases it would behttp://example.com/ajax.html#!foo=123 (rather than http://example.com/ajax.html?_escaped_fragment_=foo=123), unless you have an entry page to your site—such as your homepage—that you would like displayed in search results without the #!. For instance, if you want search results to displayhttp://example.com/, include http://example.com/ in your Sitemap with <meta name="fragment" content="!"> in the <head> of your document. For more information, check out our additional articles onSitemaps.