Manual Reference Pages  - extract (1)

NAME

extract - extract character ranges or tokens from text files.

CONTENTS

Synopsis
Description
Options
Examples
See Also
License
Copyright
Acknowledgements
Authors

SYNOPSIS

extract [ -h -? -help --help --? ]
extract [options...] <inputfile >outputfile

DESCRIPTION

extract reads a text file from stdin and extracts a range of rows and columns (character positions) and sends them to stdout. Alternatively, it can process tokens instead of character columns. Alternatively, it can remove the selected range instead of emitting it.

extract is much simpler to use than awk or perl and is sufficient for most column/row extraction tasks.

extract may be obtained from:

Use of extract is subject to the License terms.

OPTIONS

-all
  Emit unprocessed the text rows outside of the range specified with -sr,-er,-nr. (Default is not to emit these rows.)
-bs
  Add backslashes (unix escape characters) before any character other than alphabet, numeric, underscore, period, or slash. Note that this only applies within a field, so that, for instance, if the program is running in token mode a token range [1,3] would apply the backslashes between characters within each token but not between tokens. To work around that limitation use [dv\\:1,3]. (Default is not to add backslashes.)
-cols format
  Specify in great detail the format of the output line. Using other command line options one column is singled out and those options are applied to it (subject to the logical changes indicated by -rm or -ins). When -cols is used the other command line options specify the default values for all column fields and multiple column fields (indicated by [] brackets within format) may be specified. Between column fields static strings may be introduced. These static strings may contain any symbol, escaped characters (\char), and/or may use [[ and ]] to represent [ and ] (which would otherwise be intrepreted as the limits of a column field. Within a column field a colon (:) separated set of options are allowed. Characters Within a column field [ and ] are not allowed but all other characters are and escapes may be used to include colons. Arbitrary combinations of static strings and column fields may be employed, freely mixing token and character mode columns, and emitting columns in any order, including emitting a single column multiple times. Typically format must be quoted or escaped on the command line so that the shell does not mangle it before passing it into the program. The options for a column field are:

    + set_as = match command line specifications
    p default = match program defaults (overrides -pd,-lj,-uc,etc.)
    - disable = disable options
    If employed as a single character it applies to all settings and must be the first option within a column field. As a suffix these may be applied singly to each of the -cols options.

    mt/mc/m-/mp/m+ token mode/character mode/disable/default/set_as. Also sets the delimit state in some instances to match the command line, but this may be overridden again by a subsequent :d*: clause in the same column field. (overrides -mt/-mc)

    jl/jr/jc/j-/jp/j+ justify left/right/center/disable/default/set_as (overrides -j*)

    cu/cl/c-/cp/c+ case upper/lower/disable/default/set_as (overrides -c*)

    bs/c-/cp/c+ backslashes apply(as needed)/disable/default/set_as (overrides -c*)

    dt/dvN/d-/dp/d+ emit delimit from token/with char N/disable/default/set_as. Restriction: the delimit character N must be escaped if it is a colon or a backslash, ie \: and \\. (overrides -d*)

    pd###/pd-/pdp/pd+ pad with ### spaces/disable/default/set_as (overrides -pd or -fw)

    fw###/fw-/fwp/fwd field width to ### spaces/disable/default/set_as (overrides -pd or -fw)

    rsSTR/rs-/rsp/rsd replacement string is STR/disable/default/set_as. Restriction: STR may not contain a colon. (overrides -rs)

    [c] [s,e] [s,] [,e] range values for single column, column range (start,end), open ended(start ,or range), and tail (offset, count) ranges. The single range for each column field is employed instead of that specified by -sc. The range values must be the final option in a column field. Both the s and e values may be positive or negative. If positive, they are column/token positions measured from the front of the line. If negative, they are column/token positions measured from the end of the line. Mixing modes like [10,-10] is possible but can generate fatal errors if the line is too short or has too few tokens to satisfy the range.

-cu -cl
  In selected characters/tokens change case to upper or lower. (Default is to leave case unmodified.)
-dbg
  Emit state and parsing information as each input line is processed. (Default is to not emit this information.)
-dl delimiter_string
  Change the delimiters used to define tokens. Typically delimiter_string must be quoted or escaped on the command line so that the shell does not interpret it. Use \t for tab, \\ for \, and \19 for the character with value 19. (Default string is space,colon,tab.)
-dt
  When tokens are emitted followed by delimiters use as that delimiter that which defined the end of the current token. (Default). See also -d- and -dv.
-dq -dqs
  While parsing tokens ignore delimiters within double quotes. -dq returns the token with the surrounding double quotes, -dqs returns the token without the quotes. (Default is to recognize delimiters no matter where they occur.)
-dv delimit_character
  When tokens are emitted followed by delimiters use -dv delimit_character. (Default is -dt).
-d-
  Do not emit a delimiter following a token. This is most often used in combination with the -s, -pd/fw, and -j* switches. (Default is -dt, see also -dv).
-ec end_column
  The last character column to select. (Defaults to -1, the last column.)
-er end_row
  The last text row to process. (Defaults to the final row in the file.)
-fw number_of_characters
  Specifies in number_of_characters the field width. The input field is either padded or truncated as required. When fields are processed they are padded, then justified, then the character cases adjusted. See also -pd. (Default is 0 - no change to field sizes.)
-h -help --help -? --??
  Print the help message. (Default - do not print help message.)
-i
  Emit version, copyright, license and contact information.( Default - do not emit information.)
-in input_file
  Read input from the specified file. (Default is to read from stdin.)
-is
  in situ modify the indicated character or token range and emit them and the unmodified surrounding region. This option may not be used with -rm or -cols. (Default is to emit only the selected character/token range.)
-jl -jc -jr
  Justify field left, center, or right. (Default is to not change justification.)
-mc
  Process lines as character columns. See also -mt. (Default.)
-mt
  Process lines as tokens. In this mode -sc,-ec, and -nc values refer to token numbers.(Default is character columns = -mc )

If a single token is emitted then no delimiters is emitted with it. However, two or more tokens are emitted as:
    token1 delim1 token2 delim2 token3 etc. tokenN
where delim1 is the first delimiter following token1. When -s is also used delim1 will be the only delimiter after token1 but if -s is not specified there may be other delimiters after delim1 and these will not be emitted. The last token emitted is not followed by a delimiter.
-nc number_of_columns
  Number of columns to select. Do not specify both -nc and -ec.
-nr number_of_rows
  Number of text rows to process starting from sr. Do not specify both -nr and -er.
-out output_file
  Write output to the specified file. (Default is to write to stdout.)
-pd number_of_characters
  Specifies the number_of_characters (spaces) to be added to the right side of the field. When fields are processed they are padded, then justified, then the character cases adjusted. See also -fw. (Default is 0 - no padding.)
-rm
  Remove the selected character columns/tokens instead of emitting them. This option may not be used with -is or -cols. (Default is to emit only the selected character/token range.)
-rs replacement_string
  replacement_string substitutes for empty fields. Typically employed to insert NA or 0 in a tab delimited file which left unspecified values as empty fields. (Default leave empty fields empty.)
-s
  Emit a token for each delimiter encountered. When -s is specified tokens may consist of empty strings. This mode is for use with delimited data as from a spreadsheet. (Default is to emit one token for each run of delimiters.)
-sc start_column
  The first character column to select. Columns are numbered from 1. Negative values are allowed and represent columns measured from the end of the line, where -1 is the last column. (Default start_column=1.)
-sr start_row
  The first text row (line of text) to process. Rows are numbered from 1. (Default start_row=1.)
-wl widest_line
  Widest input line in characters. (Default widest_line=16000.)
-xc maXimum_Columns
  Maximum number of column fields ([] in -cols) and/or tokens that may be referenced. (Default maXimum_columns=8192.)

EXAMPLES

% extract
% extract -h % extract -sc 50 <infile.txt >outfile.txt % extract -sr 4 -sc 5 -ec 10 <infile.txt >outfile.txt % extract -sc 5 -nc 10 <infile.txt >outfile.txt % extract -sc 2 -ec 3 -mt -dl ':,;' <infile.txt >outfile.txt % extract -sr 4 -er 40 -sc 2 -ec 3 -mt -dl ':,;' -s -all -rm <infile.txt >outfile.txt % cd / ; du -k | extract -cols '[jr:fw14:1] [2]' -mt % ls -al | extract -cols '[ch:1,32][fw14:jr:5] [6] [fw2:7] [jr:fw5:8] [9]' -mt -dl ' ' % extract -cols 'foo[cu:lj:fw20:3,5]blah[-:ch:10,30]er[1]' -mt -fw30 <infile.txt

SEE ALSO

none

LICENSE

You may run this program on any platform. You may redistribute the source code of this program subject to the condition that you do not first modify it in any way. You may distribute binary versions of this program so long as they were compiled from unmodified source code. There is no charge for using this software. You may not charge others for the use of this software.

COPYRIGHT

Copyright (C) 2002 David Mathog and Caltech.

ACKNOWLEDGEMENTS

This program was inspired by Pat Rankin's EXTRACT utility for VMS.

AUTHORS

David Mathog, Biology Division, Caltech <mathog@caltech.edu>


extract (1) 21 Feb 2002