Configuring Parsers with gsqlcmd

Configuring Parsers with gsqlcmd

Contents:

Parsing HTML

gsqlcmd allows extracting data from HTML tables.

You can extract the required table data using the /rootPath option.

For example, you can download dividends from the nasdaq.com website like

gsqlcmd download http://www.nasdaq.com/symbol/aapl/dividend-history aapl.htm

Then, get available tables using the command:

gsqlcmd show-html-tables aapl.htm tables.htm /openOutput

The resulting tables.htm file contains data like this:

Example of the show-html-tables mode for dividends from nasdaq.com

You can see that the dividends are in table 3.

So, use the command like this to convert the downloaded file:

gsqlcmd convert aapl.htm dividends.csv /rootPath=3 /quoteChar=

or like this to extact data from the web page:

gsqlcmd convert http://www.nasdaq.com/symbol/aapl/dividend-history dividends.csv /rootPath=3 /quoteChar=

The resulting dividends.csv contains data like the following:

ExEffDate;Type;CashAmount;DeclarationDate;RecordDate;PaymentDate
2019-02-08;Cash;0.73;2019-01-29;2019-02-11;2019-02-14
2018-11-08;Cash;0.73;2018-11-01;2018-11-12;2018-11-15
2018-08-10;Cash;0.73;2018-07-31;2018-08-13;2018-08-16
2018-05-11;Cash;0.73;2018-05-01;2018-05-14;2018-05-17
2018-02-09;Cash;0.63;2018-02-01;2018-02-12;2018-02-15

gsqlcmd detects the correct table in most cases automatically.

So, you can try a simpler command first:

gsqlcmd convert aapl.htm dividends.csv

If the required page contains several tables with the required data, apply the convert mode for every table.

If the tables have the same structure, specify multiple indexes of the tables in the /rootPath option separated by commas like /rootPath=20,21,22.

If the table does not contain headers, use /noSourceHeaders option. In this case, gsqlcmd names columns like F1, F2, and so on.

Parsing XML

gsqlcmd allows extracting table data from hierarchical XML files.

In most cases, gsqlcmd detects the desired data automatically.

However, you can use the /rootPath and other parser options to customize the parser.

For example, test.xml contains the following contents:

<?xml version="1.0" encoding="utf-8"?>
<root>
    <parent>
        <row id="1"><f1>data11</f1><f2>data12</f2></row>
        <row id="2"><f1>data21</f1><f2>data22</f2></row>
        <row id="3"><f1>data31</f1><f2>data32</f2></row>
    </parent>
</root>

The basic convert command looks like this:

gsqlcmd convert test.xml test.csv

The resulting test.csv will have the following content:

id;"f1";"f2"
1;"data11";"data12"
2;"data21";"data22"
3;"data31";"data32"

gsqlcmd selects the "best" root of the data.

This command is equivalent to

gsqlcmd convert test.xml test.csv /rootPath=root.parent.row

You can change the root path to get data from the specified node.

For example:

gsqlcmd convert test.xml test.csv /rootPath=root.parent

The result is the following:

row_id;"row_f1";"row_f2"
1;"data11";"data12"
2;"data21";"data22"
3;"data31";"data32"

You can use the /quoteChar option to remove quotes.

For example:

gsqlcmd convert test.xml test.csv /rootPath=root.parent /quoteChar=

row_id;row_f1;row_f2
1;data11;data12
2;data21;data22
3;data31;data32

Use the show-roots mode to list applicable roots.

For example:

gsqlcmd show-roots test.xml

root
root.parent
root.parent.row
root.parent.row.id
root.parent.row.f1
root.parent.row.f2

Take a look at the following options that allows customizing the parser:

/attributedColumns=<col>.<attr>[;...]
/collapsedNodes=<node>[;...]
/ignoredTags=<tag>[;...]
/keptNodes=<node>[;...]
/requiredColumns=<name>[;...]
/skippedNodes=<node>[;...]

For example, you can skip required nodes using /skippedNodes option like:

gsqlcmd convert test.xml test.csv /quoteChar= /skippedNodes=f2

id;f1
1;data11
2;data21
3;data31

Parsing JSON

Parsing JSON is similar to the parsing XML as both document formats have hierarchical structures.

For example, test.json contains the following contents:

{"root":
  {"parent":[
    {"id":1,"f1":"data11","f2":"data12"},
    {"id":2,"f1":"data21","f2":"data22"},
    {"id":3,"f1":"data31","f2":"data32"}
  ]
}

The basic convert command returns the following:

gsqlcmd convert test.json test.csv

Id;"F1";"F2"
1;"data11";"data12"
2;"data21";"data22"
3;"data31";"data32"

The JSON parser capitalizes the first letters of column names.

Use the show-roots mode to list applicable roots.

For example:

gsqlcmd show-roots test.json

root
root.parent
root.parent.row
root.parent.row.id
root.parent.row.f1
root.parent.row.f2

You can select the required data using the /rootPath option.

For example:

gsqlcmd convert test.json test.csv /rootPath=root /quoteChar=

Parent_id;Parent_f1;Parent_f2
1;data11;data12
2;data21;data22
3;data31;data32

The JSON parser support multiple roots. For example:

gsqlcmd convert test.json test.csv /rootPath=root.parent.f1,root.parent.f2 /quoteChar=

F1;F2
data11;data12
data21;data22
data31;data32

You can use * instead of node names. For example:

gsqlcmd convert test.json test.csv /rootPath=*.*.f1,*.*.f2 /quoteChar=

You can find a lot of ready-to-use samples in the Downloading and converting to CSV folder of the downloaded gsqlcmd package.

Also, take a look at the following options that allows customizing the parser:

/keptNodes=<node>[;...]
/requiredColumns=<name>[;...]
/skippedNodes=<node>[;...]

Parsing Plain Text

gsqlcmd allows getting data from plain text files if the text contains detected columns like these:

id  f1      f2
1   data11  data12
2   data21  data22
3   data31  data32

Use the convert command to extract data:

gsqlcmd convert test.txt test.csv

The resulting test.csv file contains the following:

id;"f1";"f2"
1;"data11";"data12"
2;"data21";"data22"
3;"data31";"data32"

Note you can also use Microsoft ODBC and OLE DB text drivers to select and convert plain text data.

In the initial step, update the schema.ini file using the make-ini command:

gsqlcmd make-ini test.txt schema.ini

Then use the exec command to select data:

gsqlcmd exec test.txt "SELECT * FROM test.txt" test.csv

You can use SQL-supported features like WHERE or GROUP in such commands.

Parsing CSV

gsqlcmd allows extracting data from CSV files and converting to any supported output format.

Take a look at the useful formatting options:

/add=<header>=<value | function>[;...]
/addRowNum
/dateFormat=<format>
/dateTimeFormat=<format>
/escapeChar=<char>
/noBOM
/outputCodepage=<codepage>
/outputCulture=<name>
/[outputFormat=]asText | asTxt | asCsv | asHtm | asHtml | asXml | asJson
/[[output]Separator](formatting-options.htm#option-outputSeparator)=<separator> | Tab
/quoteChar=<char>
/rowNumBase=<integer>
/timeFormat=<format>

For example, you can use the following command to add new columns, like Symbol and Date, and convert the output file to the UTF-8 encoding.

gsqlcmd convert test.csv test-65001.csv /outputEncoding=65001 /add=Symbol=AAPL;Date=Date()

Note you can also use Microsoft ODBC and OLE DB text drivers to select and convert CSV data.

In the initial step, update the schema.ini file using the make-ini command:

gsqlcmd make-ini test.csv schema.ini

Then, use the exec command to select data:

gsqlcmd exec test.csv "SELECT 'AAPL' AS Symbol, :date AS [Date], * FROM test.csv" output.csv /set=date=Date()

You can use SQL-supported features like WHERE or GROUP in such commands.