Puppeteer.JS – Using Headless Chrome for Site Crawling

PuppeteerJS essentially allows you to automate Chrome. Headless Chrome allows you to run Chrome without actually rendering the webpage. Sounds silly, but has a lot of useful applications, you could for example simply write a test script that ensures that your website is still working correctly.

Installation

npm i puppeteer
# or
yarn add puppeteer

Usage

We are going to look at a quick example of how to Log In to a site and then do some operation.

Initialize Puppeteer

You need to run it in an async function, simply because you do not know how long it will take until chrome has started. so with

const pupeteer = require('puppeteer') const page; (async () => {
    // Init Pupeteer
    const browser = await pupeteer.launch({ headless: false });
    const page = await browser.newPage(); // New Page to be manipulated

    // Automation

    // Close Browser
    await browser.close();
})();

We start our browser. The flag headless is set to ‘true’ as default, however for debugging purposes, you should set it to ‘false’;

Login

To Login to the site we need three things:

  • The URL for the Login Page
  • CSS Selector for the Username Field
  • CSS Selector for the Password Field

To obtain the selectors you can use the Chrome DevTools (F12). Simply select the HTML Field and with Rightclick select Copy Selector.

async function  logIn(){
    let LOGIN_URL = 'https://example.com/login';
    await page.goto(LOGIN_URL);
    await page.focus('#username');
    await page.keyboard.type(USERNAME);
    await page.focus('#password');
    await page.keyboard.type(PASSWORD);
    await page.click('#form-submit');
    await page.waitForNavigation();
    console.log("LOGIN COMPLETE");
}

Now since you are logged in to the site, you can navigate to any site and fetch all the links.

async function analysePage(){
    let PAGE_URL = 'https://example.com/'
    await page.goto(PAGE_URL);
    let links = await page.evaluate(() => {
        return Array.from(document.querySelectorAll('a')).map((val) => val.href);
    });
    console.log(links);
}

Final Code

import pupeteer from 'puppeteer'; 
//Node Version < 9 
//const pupeteer = require('puppeteer')

const USERNAME = 'user';
const PASSWORD = 'user';

const page; 

(async () => {
    // Init Pupeteer
    const browser = await pupeteer.launch({ headless: true});
    const page = await browser.newPage();

    // Automation
    logIn(page);
    analysePage(page);

    // Close Browser
    await browser.close();
})();

async function logIn(){
    let LOGIN_URL = 'https://example.com/login';
    await page.goto(LOGIN_URL);
    await page.focus('#username');
    await page.keyboard.type(USERNAME);
    await page.focus('#password');
    await page.keyboard.type(PASSWORD);
    await page.click('#form-submit');
    await page.waitForNavigation();
    console.log("LOGIN COMPLETE");
}

async function analysePage(){
    let PAGE_URL = 'https://example.com/'
    await page.goto(PAGE_URL);
    let links = await page.evaluate(() => {
        return Array.from(document.querySelectorAll('a')).map((val) => val.href);
    });
    console.log(links);
}

Image: Designed by Freepik

DEV281x: Introduction to ReactJS

I took a quick look at Microsoft Introduction to ReactJS, not to actually learn React, but to evaluate how good the materials are for a beginner to learn React.

tl;dr Avoid the assessment. Maybe use create-react-app instead of codepen. Covers the basics quite well.

Lesson 01 | JSX and React Components

Instead of using “create-react-app” as a starting point, the course tells you to go to codepen to check out react.

I guess that makes it a little more focused on the React concepts and less focused on all the surrounding technologies that are needed in a production environment.

After that, it covers the basics of ReactJS, and JSX.

If you are a beginner you probably want to check out create-react-app.

Lesson 02 | State, Life Cycle and Event Handlers

I think the chapter covers the basics quite well. Nothing really to add here.

Lesson 03 | Lists and Forms

I think it is very nice that they cover the basics of handling multiple Form Elements with a generic handler.

handleChange(event){
  this.setState({[event.target.name]: event.target.value});
}

Assessment

The Assessment is quite theoretical and asks rather complicated questions. These questions are especially hard if you have already worked with React as you would never even try to build something in that way.

As a beginner, I would completely skip the assessment. It does not help your understanding of React. It actually also confronts you with bad practices, while inferring that this would be the way how you would actually do things

Conclusion

If you use create-react-app with vscode I think you would have a much better starting experience. The course is great to get started with ReactJS if you avoid the assessments and rather just experiment with your own code base.

Sources:

ImageDesigned by Freepik

package.json: Updating Fixed Versions with npm-check

One of the common problems when running a larger project is that you need to use fixed versions in your package.json file. But at the same time you need to regularly update your packages. The most elegant way is using npm-check. The small tool allows you to select which packages should get an update and update accordingly.

Installation

npm i -g npm-check

Usage

To update the packages in your project you now simply run npm-check -u. If you want to ensure that you are installing the exact package run it with the additional optional flag -E to ensure exact-versions.

npm-check -u -E

With Space you select the packages and with Enter you install the package.

Sources:

ImageDesigned by Freepik

Disable the package-lock.json file

With NPM 5, npm has started to create a “package-lock.json” file. It ensures that some dependency tree is identical on every developer’s environment. Official Documentation.

Now, that may be important on some projects. In my personal projects, I do not really care. And in the professional projects, I did not have any issues how things were done before.

Actually, with the new way of doing things, you must first remove the package-lock.json file and then run “npm update” to install newer versions of stuff. Then again, this can cause problems as when some other person also creates the package-lock file, or a merge conflict occurs etc. In the end, totally defeating the purpose of the file.

Disabling the Package Lock File for a Project

  1. Create a file called .npmrc
  2. Open the file and add the line package-lock=false

Global Disable

If you currently want to disable the behavior on all projects run

npm config set package-lock false

In the future, the feature may be more easy to use. For now, I at least will stick to the old way of how to do things.

Image: Designed by Freepik

Git: Hooks run `npm install` on checkout

When working on a project you usually install various packages from NPM. Of course, these packages are maintained and updated, adding more features and security fixes, and stability patches.

One person on your team should run npm outdated once per week to see what all has been updated and test if you can integrate the packages into your project. Thus the package.json is updated and causes a grand problem for all other developers on the project. If a package has major breaking changes the code will need to be adjusted, however, that code will not run on the other developer’s environment. The other developers working on the project they have to run npm update to install the missing/outdated packages in their environment.

The solution to this problem is “git hooks”, essentially git can execute code on specific events, like before committing your code, or pre-push etc. git hooks. For my use case, I would like to run npm update after a developer checks out from the git repository, this is the event “post-checkout”.

Native Git Hooks

To create a git hook you need to add a file to your project called .git/hooks/post-checkout (On Linux add the executable bit with chmod +x)

#!/bin/sh
echo "[post-checkout hook: $1]"

changed_files="$(git diff-tree -r --name-only --no-commit-id ORIG_HEAD HEAD)"

check_run() {
  echo "$changed_files" | grep -E --quiet "$1" && eval "$2"
}

check_run package.json "npm update"

exit 0 #Needed so Visual Studio Code does not display an error

You will test this and say, yes this works as intended – let’s commit it to the repository. – Now you will discover that you cannot commit files in .git to the repository. In fact, git does not allow you to do this, due to security concerns as git hooks can execute any shell script.

The workaround for this issue is to simply add it into a folder called git-hooks/ and tell the developers to copy the file when they set up their dev environment.

Husky

As always if there is a Problem for development with javascript there is an npm package to solve the problem. Huskyuses the package.json to define the scripts that are executed via git hooks. Simply Install Husky npm install husky -D

Then edit the package.json:

{ 
    "name" : "test Project"
    "scripts": {
          "postmerge": "npm update"
    }
}

The Husky solution would also allow you to execute your own js file, maybe also doing some cleanup of files or running tests etc.

Image: Designed by Freepik

Web Tooling and Automatisation using gulp 4

Now working with Gulp you will discover that you run into a couple of minor problems. Especially in sequencing the different types of tasks. Let’s take a simple clean task, it should run before all of the other tasks. However, Gulp will run all tasks in parallel.

The team that is working on Gulp has been working on a solution for this problem and in Gulp 4.0 we will get gulp.series() and gulp.parallel(); to distinguish between these two types of operations.

Sadly it looks like that currently the release of Gulp 4.0 is delayed. However, you can use the current Alpha version prior to its release.

Installing Gulp 4

You can simply install the next version of Gulp with NPM (A git client is required, as the package is not in the npm repository and will be cloned from Github) npm install gulpjs/gulp#4.0

Updating the gulpfile.js

The syntax for tasks has changed, so we need to rewrite parts of the file: Our existing Gulp 3 taskgulp.task("default", ["html", "scripts", "styles", "images"]); Needs to be rewritten with gulp parallel as such: gulp.task("default", gulp.parallel(html, scripts, styles, images)); However, we should not stop there and improve upon this task, by running the clean task before all of the other tasks. To ensure that it runs before the other tasks we use gulp.series().

gulp.task("default",
    gulp.series(cleanDist, gulp.parallel(html, scripts, styles, images))
);

Improving on the production flag

With gulp.series() we now can stop using the –production flag. We simply define a production task.

As first operation we pass in a function that simply sets the production boolean to true: gulp.task('production', gulp.series((done) =&gt; { production = true; done(); }, 'default')); The function uses “done” to signal gulp that the function has completed.

Hiding Tasks

Another improvement of Gulp 4.0 is that you can pass functions, as well as tasks to gulp.series and gulp.parallel.

This, in turn, lets you write normal functions that are hidden from the command line, ensuring that everybody on the project runs the default build task instead of only the HTML task.

Here is now our improved Gulpfile.js:

//Gulp Packages
const gulp = require('gulp');
const gulpIf = require('gulp-if');
const browserSync = require('browser-sync').create();
const clean = require('gulp-clean');

const pug = require("gulp-pug");

const imagemin = require('gulp-imagemin');

const sass = require("gulp-sass");
const autoprefixer = require("gulp-autoprefixer");

const ts = require("gulp-typescript");
const tslint = require("gulp-tslint");
const uglify = require('gulp-uglify');

const jasmine = require("gulp-jasmine-phantom")

//Global Variables
var production = false;

//Path Definitions
const paths = {
    html: {
        src: "src/views/**/*.pug",
        dest: "dist"
    },
    styles: {
        src: "src/styles/**/*.scss",
        dest: "dist/styles"
    },
    scripts: {
        src: "src/scripts/**/*.ts",
        dest: "dist/scripts"
    },
    images: {
        src: "src/img/*",
        dest: "dist/img"
    }
}

//Internal Tasks
function cleanDist() {
    return gulp.src('dist')
        .pipe(clean())
}

function html() {
    return gulp.src(paths.html.src)
        .pipe(pug())
        .pipe(gulp.dest(paths.html.dest))
}

function scripts () {
    return gulp.src(paths.scripts.src)
        .pipe(ts({
            out: "output.js"
        }))
        .pipe(tslint({
            formatter: "verbose"
        }))
        .pipe(tslint.report())
        .pipe(gulpIf(production, uglify()))
        .pipe(gulp.dest(paths.scripts.dest));
}

function styles() {
    let sassOptions = {};
    if (production) {
        sassOptions = {
            outputStyle: 'compressed'
        }
    }
    return gulp.src(paths.styles.src)
        .pipe(sass(sassOptions).on('error', sass.logError))
        .pipe(autoprefixer({
            browsers: ['last 2 versions']
        }))
        .pipe(gulp.dest(paths.styles.dest))
        .pipe(browserSync.stream())
}

function images () {
    return gulp.src(paths.images.src)
        .pipe(gulpIf(production, imagemin()))
        .pipe(gulp.dest(paths.images.dest));
}

//External Tasks

gulp.task("tests", () => {
    return gulp.src("src/tests/test.js")
        .pipe(jasmine({
            integration: true,
            vendor: '_build/**/*.js'
        }));
})

gulp.task("default",
    gulp.series(cleanDist, gulp.parallel(html, scripts, styles, images))
);

gulp.task('production', gulp.series((done) => { production = true; done(); }, 'default'));

gulp.task("serve", gulp.series('default', () => {
    browserSync.init({
        server: "./dist"
    });
    gulp.watch(paths.styles.src, gulp.series(styles));
    gulp.watch(paths.scripts.src,  gulp.series(scripts));
    gulp.watch(paths.images.src, gulp.series(images));
    gulp.watch(paths.html.src, gulp.series(html)).on('change', browserSync.reload);
}));

Package.json

{
  "name": "gulp4-test",
  "version": "1.0.0",
  "description": "",
  "main": "index.js",
  "scripts": {
    "test": "echo \"Error: no test specified\" &amp;&amp; exit 1"
  },
  "author": "",
  "license": "ISC",
  "devDependencies": {
    "browser-sync": "^2.18.7",
    "eslint": "^3.15.0",
    "gulp": "github:gulpjs/gulp#4.0",
    "gulp-autoprefixer": "^3.1.1",
    "gulp-clean": "^0.3.2",
    "gulp-eslint": "^3.0.1",
    "gulp-if": "^2.0.2",
    "gulp-imagemin": "^3.1.1",
    "gulp-jasmine-phantom": "^3.0.0",
    "gulp-pug": "^3.2.0",
    "gulp-sass": "^3.1.0",
    "gulp-tslint": "^7.1.0",
    "gulp-typescript": "^3.1.4",
    "gulp-uglify": "^2.1.2",
    "phantomjs": "^2.1.7",
    "tslint": "^4.4.2",
    "typescript": "^2.1.6",
    "typescript-eslint-parser": "^1.0.3"
  }
}

JSP: Passing Variable Data to JavaScript

When you try to marry old JSP Technology with the modern wonders of Typescript/ES6. You will want to expose some data provided by the backend into the JavaScript. If you have the possibility you would use a fetch()call to receive JSON. Sometimes, it is not possible to do a big rewrite of the JSP to fix a minor bug. Here is a very dirty way to pass data from the JSP into the JS code. This code will utilize the HTML5 data-attribute. Learn more about it here

JSP File:

<fmt:message var="someText" key="core.someText" />

<div id="jsVariables"
     data-someText="${self.someText}"
    >
</div>

JavaScript:

let dataContainer = document.getElementById('jsVariables');
let someText = dataContainer.dataset.someText;

JQuery:

let someText = $('#jsVariables').data('someText')

This is a very elegant solution (for well I guess using JSPs), now you can run ESLint on your Javascript Code and will not have any unresolved variables.

Udacity – Web Tooling and Automatisation

I recently took a look at the course materials for Web Tooling and Automatisation.

Overall the course is very well structured and introduces Gulp and a couple of common packages used in web development. Besides their main topic, they cover topics on good engineering practices, like linting and testing to ensure code quality.

While working on the project I ran into several little smaller things that were quite annoying. Thankfully the gulp community is quite big, so somebody already solved some of the issues I was facing.

Passing an “–production” flag

When developing, you will probably create a version of your software that is suited for easily finding bugs and errors and an optimized version that is minified and optimized for optimal performance for the end user.

You would define two different tasks in gulp, one “default” and one “production” task. This, however, would, in turn, cause you to have to duplicate your code – with optimization and without.

I found the package “gulp-if” that allows you to control if a function like compression is active during the task. The remaining issue was to actually set the parameter before the tasks run. (All tasks in gulp run in parallel).

To get a flag from the command line, you can use the process.argv Array. However, you must add “–” before your flag name. If not gulp will assume it is another task name that should run.

In the end, you would use something like this:

//Enable Production Flag
var production = (process.argv.indexOf("--production") !== -1);

//Use in task

gulp.task("default", ()=>{
  gulp.src(path.src);
.pipe(gulpif(production, foo())
.pipe(gulp.dest(path.dest));
});</pre>
**Note:** In Gulp 4, you can use a sequencer and would not need to pass in the flag by command-line, but you would define a task that will run before all the other tasks.

# Dealing with Asset sources and destinations
When using gulp.src() and gulp.dest(), typically people use strings to define the locations. However, this is quite annoying if you want to get a quick overview which locations are used. For a better maintainability, you should create a small variable block that defines these strings. In the long run, it lets you be more flexible where your files are etc.
<pre class="lang:js decode:true ">//Path Definitions
const htmlPaths = {
    src: "src/views/**/*.pug",
    dest: "dist"
}

gulp.task("html", () => {
    gulp.src(htmlPaths.src)
        .pipe(pug())
        .pipe(gulp.dest(htmlPaths.dest))
});

End Result

At the end of the course, I ended up with this gulpfile.js. It adds support for Typescript, Pug(Jade), google-closure-compiler.

The common gulp tasks to run are:

  • gulp serve: Uses browser-sync with CSS injection for live-editing
  • gulp --production: Creates an optimized build

Next steps: Depending on your web server, you would want to add a gulp deploy task.

gulpfile.js

//Gulp Packages
const gulp = require('gulp');
const gulpif = require('gulp-if');
const browserSync = require('browser-sync').create();

const pug = require("gulp-pug");

const imagemin = require('gulp-imagemin');

const sass = require("gulp-sass");
const autoprefixer = require("gulp-autoprefixer");

const ts = require("gulp-typescript");
const eslint = require("gulp-eslint");
const closureCompiler = require('google-closure-compiler').gulp();

const jasmine = require("gulp-jasmine-phantom")

//Flag Definitions
var production = (process.argv.indexOf("--production") !== -1);


//Path Definitions
const htmlPaths = {
    src: "src/views/**/*.pug",
    dest: "dist"
}

const stylesPaths = {
    src: "src/styles/**/*.scss",
    dest: "dist/styles"
}

const scriptsPaths = {
    src: "src/scripts/**/*.ts",
    dest: "dist/scripts"
}

const imgPaths = {
    src: "src/img/*",
    dest: "dist/img"  
}

//Task Definitions
gulp.task("default", ["html", "scripts", "styles", "images"]);

gulp.task("serve", ["default"], () => {
    browserSync.init({
        server: "./dist"
    });
    gulp.watch(srcStyles, ["styles"]);
    gulp.watch(srcScripts, ["scripts"]);
    gulp.watch(srcImages, ["images"]);
    gulp.watch(srcHTML, ["html"]).on('change', browserSync.reload);
})

gulp.task("html", () => {
    gulp.src(htmlPaths.src)
        .pipe(pug())
        .pipe(gulp.dest(htmlPaths.dest))
});

gulp.task("scripts", () => {
    gulp.src(scriptsPaths.src)
        .pipe(ts({
            out: "output.js"
        }))
        .pipe(eslint({
            parser: "typescript-eslint-parser"
        }))
        .pipe(eslint.format())
        .pipe(gulpif(production, closureCompiler({
            compilation_level: 'SIMPLE',
            warning_level: 'VERBOSE',
            language_in: 'ECMASCRIPT6_STRICT',
            language_out: 'ECMASCRIPT5_STRICT',
            output_wrapper: '(function(){\n%output%\n}).call(this)',
            js_output_file: 'output.min.js'
        })))
        .pipe(gulp.dest(scriptsPaths.dest));
});

gulp.task("styles", () => {
    let sassOptions = {};
    if (production) {
        sassOptions = {
            outputStyle: 'compressed'
        }
    }
    gulp.src(stylesPaths.src)
        .pipe(sass(sassOptions).on('error', sass.logError))
        .pipe(autoprefixer({
            browsers: ['last 2 versions']
        }))
        .pipe(gulp.dest(stylesPaths.dest))
        .pipe(browserSync.stream())
});

gulp.task("tests", () => {
    gulp.src("src/tests/test.js")
        .pipe(jasmine({
            integration: true,
            vendor: '_build/**/*.js'
        }))
})


gulp.task("images", () => {
    gulp.src(imgPaths.src)
        .pipe(imagemin())
        .pipe(gulp.dest(imgPaths.dest));
})

Package.json

{
  "name": "udacity-webtooling",
  "version": "1.0.0",
  "description": "",
  "main": "index.js",
  "scripts": {
    "test": "echo \"Error: no test specified\" && exit 1"
  },
  "author": "",
  "license": "ISC",
  "devDependencies": {
    "browser-sync": "^2.18.7",
    "eslint": "^3.15.0",
    "eslint-config-google": "^0.7.1",
    "google-closure-compiler": "^20170124.0.0",
    "gulp": "^3.9.1",
    "gulp-autoprefixer": "^3.1.1",
    "gulp-eslint": "^3.0.1",
    "gulp-if": "^2.0.2",
    "gulp-imagemin": "^3.1.1",
    "gulp-jasmine-phantom": "^3.0.0",
    "gulp-pug": "^3.2.0",
    "gulp-sass": "^3.1.0",
    "gulp-tslint": "^7.1.0",
    "gulp-typescript": "^3.1.4",
    "phantomjs": "^2.1.7",
    "tslint": "^4.4.2",
    "typescript": "^2.1.6",
    "typescript-eslint-parser": "^1.0.3"
  }
}

Sources:

ImageDesigned by Freepik

Continuous Integration (CI) for Gitbook using Gitlab and Gulp

Gitbook is a static site generator, that converts a collection of Markdown files into an HTML Site. Alternatively, it can also convert the markdown files into a PDF or ebook. If you are not writing a book, it is also a great tool to create a quick documentation for a project you are working on. # Initial set up We will need gitbook. Gitbook does not automatically generate a SUMMARY.md file, however, there is an existing gitbook-summary tool to take care of that. Gulp will be our task runner. I will deploy to my server via FTP. Since you are only serving HTML Files, there is usually no need for server restart etc. To integrate it into Gulp I will be using vinyl-ftp.

npm init

Install Gitbook Tools

npm install --save-dev gitbook gitbook-cli gitbook-summary

Install Gulp with Tools

npm install --save-dev gulp gulp-gitbook

Install FTP with needed Gulp Tools`

npm install –save-dev vinyl-ftp gulp-util

Gulp Create a file called `gulpfile.js` and define your gulp tasks. You should test especially the “deploy” task locally if everything is working correctly.

let gulp = require("gulp");
let gitbook = require("gulp-gitbook");
let ftp = require("vinyl-ftp");
let gutil = require("gulp-util");
var summary = require("gitbook-summary/lib/summary");

gulp.task("default", \["build"\]);

//Generates the Summary.md file
gulp.task('summary', ()=> {
    let options = {
        root: "."
    }
    return summary(options);
});

//Generates the Website
gulp.task('build', \["summary"\], (cb) => {
  gitbook('.', cb);
});


//Deploys the output onto your server
gulp.task('deploy', \["build"\], () => {
    var conn = ftp.create( {
        host:     '<url>',
        user:     '<username>',
        password: '<password>',
        parallel: 10,
        log:      gutil.log
    } );
 
    var globs = \[
        '_book/**'
    \];

    return gulp.src( globs, { buffer: false } )
        .pipe( conn.newer( '.' ) ) // only upload newer files 
        .pipe( conn.dest( '<server directory>/' ) ); // Depending on your server you may have to prefix something like /home/www/
});

Gitlab CI Integration

You need to create a YAML File called .gitlab-ci.yml. Gitlab will recognize the file and run the commands in it.

image: node:latest

cache:
  paths:
   - node_modules/

before_script:
  - npm install
  - npm install -g gulp

deploy:
  only:
   - master
  script:
   - gulp deploy

That’s it. If you push something into the master branch, it will automatically run the commands in the yaml file and deploy your static website to your server. When the build completes, you will receive an email, telling you if everything went as planned.