Home Parsing
Parsing
Cancel

Parsing

Learning Outcomes

  • extract information from a document using a parsing library

Resources

Lab

Video of this lab

Get the number of computer labs from a Langara page.

Make a new project direcory, langara-computer-labs.

Change to the project direcory (cd langara-computer-labs).

Use npm init to initialize the projet.

Install the Axios and htmlparser2 dependencies (npm install axios htmlparser2).

Create the following index.js file to get started:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
const axios = require('axios');
const htmlparser2 = require("htmlparser2");

let inTitle = false;

const parser = new htmlparser2.Parser(
    {
        onopentag(name, attribs) {
            if (name === "title") {
                inTitle = true;
            }
        },
        ontext(text) {
            if (inTitle) {
                console.log( "Page title: " + text );
            }
            inTitle = false;
        },
        onclosetag(tagname) {
            if (tagname === "html") {
                console.log("That's it!");
            }
        }
    },
    { decodeEntities: true }
);

if ( process.argv.length > 2 ) {
    axios.get( process.argv[2] )
        .then( response => {
            parser.write( response.data );
            parser.end();
        })
        .catch( error => {
            console.log("Could not fetch page.");
        });
} else {
    console.log( "Missing URL argument" );
}

Test the program as follows:

1
node index.js https://langara.ca/information-technology/it-services/technology-on-campus/a-building.html

You should see the title of the web page in the console.

We will now modify the program to count the number of computer labs in a building. The page above has that information in table. If we count the number of rows in that table, we will get our answer. We have to be careful to count the rows in the right table.

To do so, we will start by creating several state variables at the top of the program:

1
2
3
4
let inTitle = false;
let countRows = false;
let done = false;
let rowCount = 0;

We need to start counting rows when we are in the right table Fortunately, the tables with the room information have the class attribute set to table-responsive table-basic hide-mobile so we can key in on that in onattribute parser event handler:

1
2
3
4
5
        onattribute(name, value) {
            if ( name === "class" && value === "table-responsive table-basic hide-mobile") {
                countRows = true;
            }
        },

Once we are counting rows, we count every tr opening tag, so add this to the onopentag event handler:

1
2
3
            if ( name === "tr" && countRows ) {
                rowCount++;
            }

Finally, once we get to the table closing tag, we print the out the result and set the done flag so we don’t repeat for other tables on the page. This can all be done by adding the following to the onclosetag event handler:

1
2
3
4
            if ( tagname === "table" && countRows && !done ) {
                console.log( "There are " + (rowCount-1) + " computer labs.");
                done = true;
            }

Convice yourself the program by testing it on the 5 building web pages:

1
2
3
4
5
node index.js https://langara.ca/information-technology/it-services/technology-on-campus/a-building.html
node index.js https://langara.ca/information-technology/it-services/technology-on-campus/b-building.html
node index.js https://langara.ca/information-technology/it-services/technology-on-campus/c-building.html
node index.js https://langara.ca/information-technology/it-services/technology-on-campus/g-and-l-building.html
node index.js https://langara.ca/information-technology/it-services/technology-on-campus/t-building.html

Assignment

  1. Write a program that will print out the city, date/time, and temperature from a weather.gc.ca page.
  2. Test with the following pages:
1
2
3
https://weather.gc.ca/city/pages/bc-74_metric_e.html
https://weather.gc.ca/city/pages/on-143_metric_e.html
https://weather.gc.ca/city/pages/nl-24_metric_e.html