NovaStar Program Reference / Data Import / Overview

Data import programs are run on the scheduler (cron) or manually to import data into NovaStar.

Introduction
Data Import Using the Scheduler
Importing Time Series Data
Importing Data From Log Files
Using nsdataimport, nsdataimportrpt, sendrpt, and starpt with Data Harvesting Programs
- Using TSTool as a Data Harvester
Other Data Import Programs

Introduction

Data import programs are run on the scheduler (cron) or manually to import data into NovaStar. Import programs are one option to load data:

Data import programs (this page) - typically simple programs that make a connection to the database and will run until the data are loaded (the nsdataimport program provides functionality to performa additional actions such as FTP actions).
Data Collection programs - run as services in order to handle more complex communication protocols such as ALERT and ALERT2, cellular, satellite
Data Loader service - load data from many sources and formats while optimizing database performance

Data import programs are the traditional approach to loading third-party data into NovaStar, whereas the data loader is a new approach that provides more flexibility and can result in better system performance. The data loader is being phased in as of NovaStar 5.4.0.0.

Data import programs harvest and load data as a scheduled process, with updates typically occurring less frequently than data collection because third party data may not be published in real-time. An imported file or query may contain data for multiple stations/points. Data import programs are typically run in two modes:

Run using the scheduler to import a data in near real-time fashion, for example import data every 15 minutes. The frequency and time period for the import depends on the source data.
Run manually as needed, for example to populate data that were missed due to an outage, or to correct data in the system.

Data imports that are run on the scheduler import data for an input period relative to the current time and may import multiple stations, such as from a SHEF file that was exported from a different system.

The following sections discuss how to implement an import for various data sources and formats.

Data Import Using the Scheduler

Data imports are run using the Linux cron service. The following is the general approach used to configure an import.

Define stations and points using the NovaStar Administrator:
- The station (site, location, etc.) identifiers and data types (parameter, variable, etc.) in third party systems generally do not match the identifiers used in NovaStar
- See the documentation for specific import programs for additional details. For example, some import programs require that point properties are defined in the database.
Configure program files:
- If an import program is distributed with NovaStar, no further software installation is required.
- If the import program is a custom program or workflow, install in /usr/ns/cus so that the system can find. For example, use a folder /usr/ns/cus/data-import/somesource for a specific data source.
Configure the import:
- Use the NovaStar Administrator to configure a data import. This will update the Linux cron information.

Importing Time Series Data

Time series consist of location identifier, date/time, and data value. The following data import programs are available for time series data.

Time Series Data Import Programs

Data Source/Format	Data Import Program
General CSV format	`nsdataimport` - requires other software to harvest and format data as CSV, which is then used by `nsdataimport`. See Using `nsdataimport`, `nsdataimportrpt`, `sendrpt`, and `starpt` with Data Harvesting Programs section below.
National Weather Service Standard Hydrologic Exchange Format (SHEF)	`nsshefimport` (see also `shefpars`)
USGS NWIS Web Service	`nspollusgswaterdata` (also can be configured for data collection).

Importing Data From Log Files

Data collection programs and log files from stations can be loaded in cases where the NovaStar database has data gaps due to a communication, database, or other issue. A data collection program will typically continue to collect data and append to its log file even if it is unable to insert into the NovaStar database.

The following programs can be run to load data from log files.

Log File Data Import Programs

Log File Source	Data Import Program
Multiple formats	`nsdataimport` - calls the programs listed below
HydroLynx 50386	`ns50386import`
HydroLynx 50386 ALERT2	`ns50386A2import`
HydroLynx 5096	`ns5096import`
HydroLynx 5096 ALERT2	`ns5096A2import`
`nsrecdata` data collection program and service	`nsrecdatalogimport`. See also `recdatalogrefile`.
`nsautointer` data collection program and service	`nsautointerlogimport`

Using `nsdataimport`, `nsdataimportrpt`, `sendrpt`, `starpt` with Data Harvesting Programs

Several programs load simple delimited files into NovaStar:

sendrpt - loads data for points and therefore requires using the point identifier
starpt - loads data for stations and therefore requires using the station identifier and data type for the point
nsdataimportrpt - loads data for stations and points and is called by nsdataimport
nsdataimport - provides a common command line interface and calls other data import programs to load data, including nsdataimportrpt for CSV data.

The above programs can be utilized to load data rather than writing a completely separate program. Implementing a data harvester with one of the above programs may be required for older systems where the new Data Loader is not available.

The following workflow summarizes the approach (see also Data Import Using the Scheduler section):

Data harvesting program retrieves data and saves as a local CSV file, consistent with one of the above programs.
Run one of the above programs to load the file. The program can be run separately or can be called by the data harvesting program.
If the above two programs are run independently, a simple script can be created to run the first step and then the second step.

Data harvesting programs must meet the following requirements:

Have documentation that explains how to use the software.
Install and run on Debian Linux:
- NovaStar systems are deployed with Java and Python and additional environments can be enabled.
Run in batch mode from the scheduler.
Run as a normal user.
Create the CSV file needed by the data loader:
1. Handle time zone.
2. Format interval data with timestamp at the end of the interval.
3. Do not load missing values (NovaStar only stores actual values).
Manage runs to avoid hung programs:
1. Support timeout so that automated processes do not hang, use the timeout feature of one of the above programs, or use the Linux timeout program when configuring the scheduled process.
2. Optionally, implement logic to avoid running duplicate programs, for example a lock file that can be checked. This must be sophisticated enough to avoid mistaken locks.

The following are known harvester programs for different data sources and formats. The TSTool software is an initial focus due to its ability to run workflows and handle many data sources and formats.

Data Sources/Formats and Available Data Harvesters

Data Source/Format	Available Data Harvester Software
Many climate and hydrology time series	TSTool (Time Series Tool) with plugins. See the Using TSTool as a Data Harvester section below.

Using TSTool as a Data Harvester

The TSTool (Time Series Tool) software, which has been developed by the Open Water Foundation for the State of Colorado, is able to read time series data from many sources and formats. Therefore, using TSTool to harvest data leverages extensive investment. TSTool also provides many commands that can be used to create workflows, for example to perform data manipulations necessary to feed into NovaStar.

TSTool is a Java program that can have a large memory and CPU footprint, depending on which features are used. TSTool data harvesting workflows should be evaluated for performance to make sure that the workflow is completing within the time window allocated for the data import. If many TSTool workflows are run at the same time or the system load is high, it may be best to consolidate TSTool calls into fewer workflows using one of the following techniques:

Use a script to run TSTool the workflows in sequence.
Run workflows in sequence using a command file that runs each workflow using the RunCommands command.

The basic command line syntax for TSTool is as follows (see Running TSTool in Command Line Batch Mode):

tstool --commands /path/to/CommandFile

As of TSTool 14.8.5, the --disabled-datastores=Datastore1,DataStore2,... and --enabled-datastores=Datastore1,DataStore2,... command parameters are available to enable and disable datastores at runtime and override the datastore configuration file Enabled properties. This allows a TSTool session to use only the datastores that are needed for the workflow and avoid unnecessary startup tasks. Running TSTool interactively will use the enabled datastores in the datastore configuration files, as usual.

Because a TSTool workflow to harvest data creates a simple CSV file, TSTool can be run separately from NovaStar when developing the workflow. For example, run TSTool on a Windows computer. The data harvesting workflow can be tested and managed under version control.

Once the workflow functionality is confirmed, it can be copied to a NovaStar server for use with the TSTool software installed on that server. It is recommended to use a separate folder for the workflows files, for example /usr/ns/cus/data-import/tstool-usgs/thecommandfile.tstool. The workflow can create the TSTool logfile and CSV file in the same folder or a results subfolder, and the data file can be loaded by one of the import programs mentioned above.

The following examples illustrate command lines that can be used in the data import scheduler. If TSTool will run the data import program with a RunProgram command, then the scheduler can run the TSTool command.

tstool --commands CommandFile.tstool ...

timeout 5m tstool --commands CommandFile.tstool ...

If the TSTool workflow will not run the data import program, then a simple script can be written to run TSTool and the data import program and that script can be run from the scheduler:

tstool-data-import.bash

timeout 5m tstool-data-import.bash

Another option is to use the semi-colon to run multiple Linux commands on the same command line, as in the following example. The full command line should be tested independent of the scheduler to make sure that it works.

tstool-data-import.bash --commands CommandFile.tstool ... ; data-import-program ...

timeout 5m tstool-data-import.bash --commands CommandFile.tstool ... ; timeout data-import-program ...

In the above examples, appropriate paths should be used to the TSTool software and command file.

Additional considerations include:

TSTool creates a startup log file and the StartLog command can be used to create a log file for the workflow. The log files can be used for troubleshooting. The log file typically has the same name as the TSTool command file, with .log at the end.
Additional logging may need to be implemented in the data import program or run script to support troubleshooting, if the TSTool and data import program log files are not sufficient..

Other Data Import Programs

The following table lists other data import programs.

Other Data Import Programs

Data Source/Format	Data Import Program
	`nscopydata`