NovaStar Program Reference / Data Import / Overview
Data import programs are run on the scheduler (cron) or manually to import data into NovaStar.
- Introduction
- Data Import Using the Scheduler
- Importing Time Series Data
- Importing Data From Log Files
- Using
nsdataimport
,nsdataimportrpt
,sendrpt
, andstarpt
with Data Harvesting Programs - Other Data Import Programs
Introduction
Data import programs are run on the scheduler (cron) or manually to import data into NovaStar. Import programs are one option to load data:
- Data import programs (this page) - typically simple programs that make a connection to the database
and will run until the data are loaded
(the
nsdataimport
program provides functionality to performa additional actions such as FTP actions). - Data Collection programs - run as services in order to handle more complex communication protocols such as ALERT and ALERT2, cellular, satellite
- Data Loader service - load data from many sources and formats while optimizing database performance
Data import programs are the traditional approach to loading third-party data into NovaStar, whereas the data loader is a new approach that provides more flexibility and can result in better system performance. The data loader is being phased in as of NovaStar 5.4.0.0.
Data import programs harvest and load data as a scheduled process, with updates typically occurring less frequently than data collection because third party data may not be published in real-time. An imported file or query may contain data for multiple stations/points. Data import programs are typically run in two modes:
- Run using the scheduler to import a data in near real-time fashion, for example import data every 15 minutes. The frequency and time period for the import depends on the source data.
- Run manually as needed, for example to populate data that were missed due to an outage, or to correct data in the system.
Data imports that are run on the scheduler import data for an input period relative to the current time and may import multiple stations, such as from a SHEF file that was exported from a different system.
The following sections discuss how to implement an import for various data sources and formats.
Data Import Using the Scheduler
Data imports are run using the Linux cron service. The following is the general approach used to configure an import.
- Define stations and points using the NovaStar Administrator:
- The station (site, location, etc.) identifiers and data types (parameter, variable, etc.) in third party systems generally do not match the identifiers used in NovaStar
- See the documentation for specific import programs for additional details. For example, some import programs require that point properties are defined in the database.
- Configure program files:
- If an import program is distributed with NovaStar, no further software installation is required.
- If the import program is a custom program or workflow,
install in
/usr/ns/cus
so that the system can find. For example, use a folder/usr/ns/cus/data-import/somesource
for a specific data source.
- Configure the import:
- Use the NovaStar Administrator to configure a data import. This will update the Linux cron information.
Importing Time Series Data
Time series consist of location identifier, date/time, and data value. The following data import programs are available for time series data.
Time Series Data Import Programs
Data Source/Format | Data Import Program |
---|---|
General CSV format | nsdataimport - requires other software to harvest and format data as CSV, which is then used by nsdataimport . See Using nsdataimport , nsdataimportrpt , sendrpt , and starpt with Data Harvesting Programs section below. |
National Weather Service Standard Hydrologic Exchange Format (SHEF) | nsshefimport (see also shefpars ) |
USGS NWIS Web Service | nspollusgswaterdata (also can be configured for data collection). |
Importing Data From Log Files
Data collection programs and log files from stations can be loaded in cases where the NovaStar database has data gaps due to a communication, database, or other issue. A data collection program will typically continue to collect data and append to its log file even if it is unable to insert into the NovaStar database.
The following programs can be run to load data from log files.
Log File Data Import Programs
Log File Source | Data Import Program |
---|---|
Multiple formats | nsdataimport - calls the programs listed below |
HydroLynx 50386 | ns50386import |
HydroLynx 50386 ALERT2 | ns50386A2import |
HydroLynx 5096 | ns5096import |
HydroLynx 5096 ALERT2 | ns5096A2import |
nsrecdata data collection program and service |
nsrecdatalogimport . See also recdatalogrefile . |
nsautointer data collection program and service |
nsautointerlogimport |
Using nsdataimport
, nsdataimportrpt
, sendrpt
, starpt
with Data Harvesting Programs
Several programs load simple delimited files into NovaStar:
sendrpt
- loads data for points and therefore requires using the point identifierstarpt
- loads data for stations and therefore requires using the station identifier and data type for the pointnsdataimportrpt
- loads data for stations and points and is called bynsdataimport
nsdataimport
- provides a common command line interface and calls other data import programs to load data, includingnsdataimportrpt
for CSV data.
The above programs can be utilized to load data rather than writing a completely separate program. Implementing a data harvester with one of the above programs may be required for older systems where the new Data Loader is not available.
The following workflow summarizes the approach (see also Data Import Using the Scheduler section):
- Data harvesting program retrieves data and saves as a local CSV file, consistent with one of the above programs.
- Run one of the above programs to load the file. The program can be run separately or can be called by the data harvesting program.
- If the above two programs are run independently, a simple script can be created to run the first step and then the second step.
Data harvesting programs must meet the following requirements:
- Have documentation that explains how to use the software.
- Install and run on Debian Linux:
- NovaStar systems are deployed with Java and Python and additional environments can be enabled.
- Run in batch mode from the scheduler.
- Run as a normal user.
- Create the CSV file needed by the data loader:
- Handle time zone.
- Format interval data with timestamp at the end of the interval.
- Do not load missing values (NovaStar only stores actual values).
- Manage runs to avoid hung programs:
- Support timeout so that automated processes do not hang,
use the timeout feature of one of the above programs,
or use the Linux
timeout
program when configuring the scheduled process. - Optionally, implement logic to avoid running duplicate programs, for example a lock file that can be checked. This must be sophisticated enough to avoid mistaken locks.
- Support timeout so that automated processes do not hang,
use the timeout feature of one of the above programs,
or use the Linux
The following are known harvester programs for different data sources and formats. The TSTool software is an initial focus due to its ability to run workflows and handle many data sources and formats.
Data Sources/Formats and Available Data Harvesters
Data Source/Format | Available Data Harvester Software |
---|---|
Many climate and hydrology time series | TSTool (Time Series Tool) with plugins. See the Using TSTool as a Data Harvester section below. |
Using TSTool as a Data Harvester
The TSTool (Time Series Tool) software, which has been developed by the Open Water Foundation for the State of Colorado, is able to read time series data from many sources and formats. Therefore, using TSTool to harvest data leverages extensive investment. TSTool also provides many commands that can be used to create workflows, for example to perform data manipulations necessary to feed into NovaStar.
TSTool is a Java program that can have a large memory and CPU footprint, depending on which features are used. TSTool data harvesting workflows should be evaluated for performance to make sure that the workflow is completing within the time window allocated for the data import. If many TSTool workflows are run at the same time or the system load is high, it may be best to consolidate TSTool calls into fewer workflows using one of the following techniques:
- Use a script to run TSTool the workflows in sequence.
- Run workflows in sequence using a command file that runs each workflow using the
RunCommands
command.
The basic command line syntax for TSTool is as follows (see Running TSTool in Command Line Batch Mode):
tstool --commands /path/to/CommandFile
As of TSTool 14.8.5, the --disabled-datastores=Datastore1,DataStore2,...
and --enabled-datastores=Datastore1,DataStore2,...
command parameters are available to enable and disable datastores at runtime
and override the datastore configuration file Enabled
properties.
This allows a TSTool session to use only the datastores that are needed for the workflow
and avoid unnecessary startup tasks.
Running TSTool interactively will use the enabled datastores in the datastore configuration files, as usual.
Because a TSTool workflow to harvest data creates a simple CSV file, TSTool can be run separately from NovaStar when developing the workflow. For example, run TSTool on a Windows computer. The data harvesting workflow can be tested and managed under version control.
Once the workflow functionality is confirmed,
it can be copied to a NovaStar server for use with the TSTool software installed on that server.
It is recommended to use a separate folder for the workflows files,
for example /usr/ns/cus/data-import/tstool-usgs/thecommandfile.tstool
.
The workflow can create the TSTool logfile and CSV file in the same folder or a results
subfolder,
and the data file can be loaded by one of the import programs mentioned above.
The following examples illustrate command lines that can be used in the data import scheduler.
If TSTool will run the data import program with a
RunProgram
command,
then the scheduler can run the TSTool command.
tstool --commands CommandFile.tstool ...
timeout 5m tstool --commands CommandFile.tstool ...
If the TSTool workflow will not run the data import program, then a simple script can be written to run TSTool and the data import program and that script can be run from the scheduler:
tstool-data-import.bash
timeout 5m tstool-data-import.bash
Another option is to use the semi-colon to run multiple Linux commands on the same command line, as in the following example. The full command line should be tested independent of the scheduler to make sure that it works.
tstool-data-import.bash --commands CommandFile.tstool ... ; data-import-program ...
timeout 5m tstool-data-import.bash --commands CommandFile.tstool ... ; timeout data-import-program ...
In the above examples, appropriate paths should be used to the TSTool software and command file.
Additional considerations include:
- TSTool creates a startup log file and the
StartLog
command can be used to create a log file for the workflow. The log files can be used for troubleshooting. The log file typically has the same name as the TSTool command file, with.log
at the end. - Additional logging may need to be implemented in the data import program or run script to support troubleshooting, if the TSTool and data import program log files are not sufficient..
Other Data Import Programs
The following table lists other data import programs.
Other Data Import Programs
Data Source/Format | Data Import Program |
---|---|
nscopydata |