wiki:ShmDocDatabaseMiniseed

Version 1 (modified by MarcusWalther, 14 years ago) (diff)

--

Database concept for MiniSEED data files

On archive systems containing ten thousands of (MiniSEED-) data files or more the file referencing system based on sfdfiles (see reading MiniSEED data into SH/SHM) is not effective any more. For this purpose a database concept for accessing waveform data in such cases has been developped. The location and supplementary information of all files of the archive are stored in a database. SH and SHM then retrieve the necessary information from the database instead of using sfdfiles. When many information entries are present, database requests are much faster than reading from plain files. As an alternative, lookup files (directory files for sfdfiles) may help up to some extent, but they tend to make things complicated and confusing.

Such a database has been implemented at the SZGRF. The name for the database is sfdb (seed file database). It is implemented on a mysql database system, but actually it should run on any standard database system since no API (application programming interface) is used. Instead, all calls from SH/SHM to the database are based on shell commands which makes it pretty flexible.

The main idea of the database concept is that the data files are divided into three classes:

  1. files from online data transmission protocols like seedlink which are usually available for a limited time on a specific directory tree;
  2. intermediate term archive files, collected from various data sources usually available with a time delay of a day or a few days;
  3. permanently archived files which have been analysed, quality checked and put to a final location on a RAID system or CD/DVD jukebox. These (C)-data usually are put into pieces of moderate size (e.g. DVD capacity) and are available after some weeks.

To keep the database up to date, it is necessary to update the tables every minute or every few minutes for (A)-type files, once a day for (B)-type files and for (C)-type files the database should be updated as soon as a new archive path is found. The data paths of an archive should be categorised in this way, then the database can be set up as described in the following sections.

Based on this concept on (A)-, (B)- and (C)-type files an archiving system is implemented at the SZGRF.

Tables used in sfdb

The sfdb database uses the following tables:

  • pathtab (rootpaths for data directories)
  • sftab (pointing to MiniSEED data files)
  • newfilescmd (finding new data files)
  • surveytab (archive directories to survey)
  • integrity (data integrity check via md5 checksums)
  • qualtab (data quality statistics)
  • backup (backup administration)
  • rsynctab (backup of paths using rsync)

The tables pathtab and sftab contain the information where the data are stored. The other tables are used for maintenance of the database.

Setting up the database

After creating the tables (see $SH_UTIL/sfdb/create_tables.s, contains valid commands for a mysql database) the main archive paths must be inserted manually into the tables pathtab and surveytab.

Please use pathid numbers from 1 to 99 for (A)-type files, from 101 to 999 for (B)-type files and numbers above 1000 for permanent archives ((C)-type files). Define an empty entry for pathid zero and a dummy entry for pathid 1000 as a separator between temporary and permanent archive paths.

Paths to (C)-type files may be omitted in the table sftab at the beginning, as they can be managed automatically using the table surveytab. Examples for data entries can be found in $SH_UTIL/sfdb/create_tables.s.

New or modified data files of (A)- and (B)-type files are detected using commands of the table newfilescmd. For each data path with an id between 1 and 999 this table should hold commands for finding such files. In the current setup for the id's 1 to 99 only command m_cmd and for the id's 101 to 999 only command d_cmd is used. The root paths of all permanent archive directories should be put to the table surveytab. All directories one level below these paths (i.e. subdirlev=1, other values are currently ignored) are regarded as permanent archive paths and inserted to the table sftab automatically (after starting the cron jobs).

The table integrity contains information about data integrity, i.e. results of checksum tests. All archive paths with id's above 1000 ((C)-type files) have to have an md5 checksum file in its top level directory, named checksum.md5. This checksum file is regularly checked to verify that the data are still readable. The time of the last check and its result is stored in the table entries. If such a checksum file does not exist, it is created (needs write permission for the sfdb cron job on the data directories).

Cron Jobs

The cron jobs of the sfdb system keep the tables of the database up to date and perform checks for data integrity. The jobs to be setup are:

  • update_sfdb.py <backtime> <unit>: Finds new files on temporary archive paths. Should run every few minutes with <unit> 'm' and once a day with <unit> 'd'.
  • sfdb_void_entry_check.py <maxpathid>: Checks for outdated (void) entries in the temporary archive paths. Should run at least once a day.
  • sfdb_survey_paths.csh: looks for new data directories in the permanent archive. Should run once per hour.
  • integrity_check.csh: data integrity check. Should run several times per day.
  • new_tmpfiles.csh, clean_tmpfiles.csh: manage temporary data files.

A typical example of a crontab file would be:

*/2 * * * * /usr/local/SH/sh/util/sfdb/update_sfdb.py 3 m >/dev/null 2>/dev/null
1,4,7,10,13,16,19,22,25,28,31,34,37,40,43,46,49,52,55,58 * * * * /usr/local/SH/sh/util/sfdb/new_tmpfiles.csh >/dev/null 2>/dev/null 
8 7,9 * * * /usr/local/SH/sh/util/sfdb/update_sfdb.py 3 d >/dev/null 2>/dev/null
9 3 * * * /usr/local/SH/sh/util/sfdb/sfdb_void_entry_check.py 1000 >>log/sfdb_void_entry_check.log 2>>log/sfdb_void_entry_check.err
29 * * * * /usr/local/SH/sh/util/sfdb/sfdb_void_entry_check.py 100 >>log/sfdb_void_entry_check_x.log 2>>log/sfdb_void_entry_check_x.err
13 * * * * /usr/local/SH/sh/util/sfdb/sfdb_survey_paths.csh >>log/sfdb_survey_paths.log 2>>log/sfdb_survey_paths.err
09 0,10,21 * * * /usr/local/SH/sh/util/sfdb/integrity_check.csh >>log/integrity_check.log 2>1
15 18 * * * /usr/local/SH/sh/util/sfdb/clean_tmpfiles.csh >/dev/null 2>/dev/null

A more detailed description of the cron jobs and utility routines to be found in $SH_UTIL/sfdb follows:

  • integrity_check.csh

Create missing checksum files (checksum.md5) on all paths from 'pathtab' one by one. Check one checksum file and put result to table 'integrity'.

  • sfdb_check_lost_files.csh <rootpath> [<priority>] [<interactive>] Finds all files matching wildcard '*.??[zne]' under a given rootpath and checks the existance of a corresponding entry in the table sftab. If no file entry is found and a priority is given it is added to the table. For the given rootpath an entry in the table pathtab must exist otherwise the program exits. Called by sfdb_survey_paths.csh.
  • sfdb_list.csh Gives summary of all (C)-type archive paths of sfdb database (pathid>1000).
  • sfdb_survey_paths.csh Loops all paths of table surveytab with subdirlev=1. Data files found in new directories one level below these paths are inserted into database. Insertion is done using an sfdfile if one is found with $ROOT text elements in it. If no sfdfile is found or the sfdfile does not contain $ROOT the data files are inserted if the directory is older than one day without using a possibly existing sfdfile. This job does not check for new/changed data files in old (=already existing) permanent archive paths! Priority of paths with id lower than 100 is set to 3, paths with id's between 101 and 999 are set to priority 6.
  • sfdb_void_entry_check.py <maxpathid> Loops all paths of pathtab with a pathid between 0 and <maxpathid> and removes outdated sftab entries found. Outdated means that an sftab entry points to a non existing file.
  • update_sfdb.py <backtime> <unit> Finds modified files on all archive paths with low path id's and updates their database entries. <unit> may be 'm' (minutes) or 'd' (days). On 'm'-unit archive paths with id's lower than 100 are searched for files modified within the last <backtime> minutes, on 'd' archive paths with id's lower than 1000 are searched for files modified within the last <backtime> days. The commands for finding files are taken from the table newfilesmcd. Usually run from cron.
  • new_tmpfiles.csh, clean_tmpfiles.csh Takes SEED or MiniSEED data files from an input directory and moves it to a temporary directory (pointed to by sftab entry 1000). The content of the id1000 directory is surveyed by the database routines and is automatically deleted after 4 days by clean_tmpfiles.csh. With this option data files maybe temporarily inserted to the sfdb system.

Remarks on priority entries

Table sftab has an entry priority and the table surveytab has an entry defpri which control the file priority, i.e. which file is used for data input if more than one is available. Usually there will be some overlap in time intervals between the (A)-, (B)- and (C)-type files. E.g. if a time window is contained in the online directory ((A)-type) and in the intermediate term archive ((B)-type) where the data have been quality checked already, then the latter should have a higher priority to get the best possible data and to make use of the higher access speed on RAID systems where the (B)- and (C)-type data usually are.

In the setup of the SZGRF the data paths with id's between 1 and 99 have priority 3, the path id's 101 to 999 have priority 6 and the id's above 1000 have priority 9. If the highest priority data file is not available the read request will fail. A read retry on files with lower priority is currently not implemented.

SH/SHM interface

The database inerface is implemented since versions 5.0c (SH) and 2.4g (SHM) from 28-Dec-2006. To access the database from SH or SHM enter DB or DB: as sfd path. SH/SHM will launch a shell process to retrieve possible data file candidates for the read command. The result of the database request is stored in a temporary file, currently /tmp/sql_sfdb_*.000. A new read time or read length issues a new database request and will overwrite this file. The name resolutions of the root paths (entries of table pathtab) are cached in SH/SHM to minimise the number of calls to this table. For this reason the first read procedure in SH/SHM usually takes a little more time than all following. SH and SHM only access the tables pathtab and sftab. All other tables are not seen by SH/SHM.

The command to read from the sfdb tables has to be specified in the configuration file of SHM. There are two parameters responsible for this interface: sfdb_command and sfdb_exec_qual. sfdb_command specifies the access command, for a local mysql installation this would be mysql sfdb, a remote installation, e.g. mysql -h szpc35 sfdb. The execution qualifiers go into sfdb_exec_qual. For mysql these are -B -e. The Sun/Solaris? interface is done currently with a small script since there is no mysql installed in a standard system. The settings for sfdb_command and sfdb_exec_qual would be e.g. $SH_UTIL/sol_sql_call.csh szpc35 and <NULL>, respectively, if szpc35 is the database server.


back to documentation index