NovaStar Program Reference / Data Loader / Overview

The data loader is an option for ingesting data into NovaStar and is being phased in starting with NovaStar 5.4.0.0.

Introduction
File Reader Service
Data Filer Service
Data Loader Workflow
Data Harvesting Programs

Introduction

NovaStar data ingestion can occur using different approaches: This documentation focuses on the data loader approach, which is suitable for running workflows to harvest and process data from third party sources and load the data into NovaStar. See also:

Data Collection - for real-time event-driven data
Data Import - for traditional NovaStar data imports, limited to specific sources

Traditional NovaStar data import programs typically involve writing a program that links with the NovaStar software libraries. This requires programming skills and often requires help from TriLynx Systems staff. Some general tools are available to facilitate data imports, such as the nsdataimport program, but have limitations. Limitations of traditional data import programs include:

compiled programs require knowledge about NovaStar libraries and database design
each program makes a connection to the database, which can result in contention for database resources
single purpose data loading programs may be difficult to maintain
simple programs do not provide workflow features to manipulate data
programs that are interrupted may cause data to not be ingested

In contrast, the data loader relies on using a simple CSV format file for data and optimizing data ingestion by using a single service. Benefits of this approach are:

data harvesting programs can be written in any programming language (Bash, Python, Java, C, C++, etc.)
the CSV file for the data loader is a simple format
the data loader service optimizes performance by handling database interactions
programs that are interrupted will not lose data if the load file has been written
third-party programs are maintained and tested by other entities, reducing the maintenance cost for TriLynx Systems and its clients

TriLynx Systems will continue to add data harvester programs. System-specific tools can be developed as needed.

The data loader contains two parts: The file reader and the data filer.

The file reader reads the file output of data harvesting programs, and inserts preliminary data into an intermediate table in the database. When it has finished processing a file, it moves it to an archival folder.

The data filer reads from the intermediate table, does appropriate processing for alarms, calibrations, and validation, and places it into standard Novastar5 data tables.

File Reader Service

The file reader proces is run by the file reader service (nsdataloader-filereader-service). The service runs continuously and reads data from files in the specified hot folder, by default /usr/ns/dataloader/. The data is then put into an intermediary table, and the file moved to a specified archival folder, by default /var/log/novastar/dataloader-archive.

See the service documentation for information about starting and stopping the service.

Data Filer Service

The data filer process is run by the data filer service (nsdataloader-datafiler-service). The service runs continuously and is optimized to ingest data while minimizing load on the database server. This occurs by limiting the number of processes that write data to the database. In contrast, the older data import approach can result in many processes attempting to load data at the same time, which leads to higher contention for database resources.

See the service documentation for information about starting and stopping the service.

Data Loader Workflow

The nsdataloader-filereader documentation provides details for implementing data harvesting software.

Data Harvesting Programs

Data harvesting programs must meet the following requirements:

Install and run on Debian Linux:
- NovaStar systems are deployed with Java and Python and additional environments can be enabled.
Run in batch mode from the scheduler.
Support timeout so that automated processes do not hang.
Run as a normal user.
Create the CSV file needed by the data loader:
1. Handle time zone.
2. Format interval data with timestamp at the end of the interval.
3. Do not load missing values (NovaStar only stores actual values).

The following are known harvester programs for different data sources and formats. The TSTool software is an initial focus due to its ability to run workflows and handle many data sources and formats.

Data Sources/Formats and Available Data Harvesters

Data Source/Format	Available Data Harvester Software
Many climate and hydrology time series	TSTool (Time Series Tool) with plugins.