Searching and Extracting Data from Files

«« Previous
Next »»

Piping is sending the output from one command to the input of another command. Redirection is similar but works with files rather than commands; along with these we take the time to visit a little of the menagerie of the file reporting tools that Linux supplies. We visit pipes first.

Piping


As mentioned in Linux Essentials objective 2.4: we have two types of pipes, unnamed and named pipes. Mainly, we see unnamed pipes but named pipes are commonly used between processes on your PC, one application talking to another. To make use of an unnamed pipe we use the vertical bar between two commands as shown below.

ls -l | wc -1

The above command in a command line pipe, the output of ls is sent to the input the command wc, in this case, counting the lines of output from ls. This is also known as an unnamed pipe as it is created on the fly without the existence of a pipe file. This is convenient for us on the command line but not so convenient for applications to be able to communicate. This is where named pipes can be used. Where the applications can create pipe files and connect to each other by means of these pipes. The pipe files will never store data but marshals  data, (controls the data movement), from one application to another. We can create our own named pipes using mkfifo (/usr/bin/mkfifo). These are known as named pipes as they are represented by files of type PIPE in the file system and as such have a name. As an example the LastPass password manager uses a named pipe to communicate with Firefox on my Ubuntu system. This can be seen by using the find (/usr/bin/find) command to search for files of type p:

find /home -type p 2> /dev/null


Redirection


Unlike piping, redirection takes the output of a command and sends the output to a text file. Alternatively a command may redirect a text file to its input. Using the file as input.

Each command has three channels that can be used for redirection:

◉ Standard Input : Channel 0
◉ Standard Error : Channel 1
◉ Error Output : Channel 2

We only need to to use the channel number when redirecting error output; the symbol < indicates standard input when used without a number and > represents standard output without a number in use. As such:

◉ cat < file1 : file1 is read into standard input for the command cat
◉ ls /etc > file1 : the standard output of ls is sent to file1, errors are shown on the screen and not redirected
◉ ls /etc 2> file1 : Standard output is shown to the screen but errors are written to file1

We can use the >> symbols to append to files are create the files if they do not exist. When using the single greater than symbol > we can create the file and overwrite the file if it exists. If you are concerned about overwriting existing files in error you may set the shell option noclobber. When set, new files can be created but if the file exists normal operation will not permit you to overwrite the existing file. Using >| allows the file to be over-written. The noclobber option may usually be set in a login script or from the command line.

set -o noclobber

The -o option sets the option to on

set +o noclobber

The option +o turns the option off. To view the current setting you can use the the command

set -o

The above command will show all settings and their current state. Taking what we have learned about piping; we now know that it is possible to pipe the output of the set (shell built-in command) command to grep(/bin/grep) which can search then for the particular option we wish to view.

set -o | grep noclobber


Currently the option is disabled on my system as we can see from the above graphic. If you want this permanently added you your environment consider turning the option on in your personal login script: .bashrc in your home directory.

Grep and Regular Expressions


The command grep (/bin/grep) becomes a simple tool that we can make use of both practically in every day Linux usage as well as here in the course to help demonstrate regular expressions. To test regular expressions fully we may want to use egrep (/bin/egrep) or more simply grep -E to allow for extended regular expression matches.

In the previous graphic we can see that we search for the text string noclobber in the output of the set command. We literally search for the string noclobber. We may think we are looking for the word noclobber but computers think differently to us. Consider the following text file test.txt: It is shown in the following graphic using the command cat (/bin/cat).


If we use grep to search for the string color we we return 3 line matches that contain color:

1. no color
2. color
3. colored


The command grep will always return complete line matches but often people are surprised that the line colored is returned. We have not specified to search for words, so the string match does apply. If we need to search for the word color in the line then we can use the \b operators in the regular expression to include word boundaries. Of course the boundaries will surround the word so that is how we must use b. We use the -E option with grep to allow for the enhanced regular expression looking for the word boundaries.

grep -E ‘\bcolor\b’ test.txt

Using the the above command we can view the output in the following graphic:


To search for color at the end of the line we could use the character $, designating the end of line marker, not we do not need the exteded regular expressions with this seach

grep color$ test.txt


Reversing this a little we could use the carat, ^ , to search for lines beginning with :

grep ^color test.txt


Note that in the results we again see coloured, this is an option that is turned on in my version of grep with an alias.

Again we can use the word boundary with an extended search to exclude the additional line:

grep -E ‘^color\b’ test.txt

We only use the one word boundary in this case as the word starts at the beginning of the line. We can see the the output showing this is the following screen-shot:


You may need characters to be optional in the grep search with regular expressions; using the ? Character in a regular expression, we instruct grep to include none or one of those immediately preceding character. In the following we look for occurrences of the word color or colours. Using the ? after the u we make it optional, testing for zero occurrences or one occurrence of the letter u.

grep -E ‘\bcolou?r\b’ test.txt


Should we want to search for ranges on characters in the regular expression we can use square brackets. If we want to search for lines that begin with n or N then we could use the -i option with grep for a case insensitive search; alternatively:

grep ^[nN] test.txt


But be careful, a misplaced ^ can easily reverse the search. A carat inside the brackets indicates that we are not looking for lines that start with n or N. We could also use the -v option to grep in invert the complete search. In the example note that we use the ^ outside the brackets as before denoting the lines starts with and then the carat inside that denotes not n or N. So the line must start with anything other than n or N.

grep ^[^nN] test.txt


Find

If you have not already found the command find (/usr/bin/find) then you will need to find it soon. We can use find in a similar way to ls. If used on its own, just the word find, find will list all files in the current directory and below. The behavior of find is to recurse automatically, listing subdirectory content. The output can be extensive especially if run higher up in the file-system. So we can run find with options to set criteria for the search, we can also control the recursion, limiting the depth of directories searched.

find -type d

The command above will list only directories , (-type d ), within the current directory and below

find /var -maxdepth 1 -type d -perm +2000

The find syntax above with search the /var directory for directories, the maxdepth option limits the search to this directory only, 1 level down. The additional criteria searches for the file permissions including the special group permissions, These two criteria are ANDed together; we could use -o to OR them together. We are then looking for directories which have the SGID bit set that are directly below the /var directory.

The search for permissions can be run as above or as the following for is the preferred method:

find /var -maxdepth 1 -type d -perm /g+s


To demonstrate the use of ORing criteria together we could use find to search for files and check the file size is greater than 50M, -size +50M, or it has not been accessed in the last 30 day, -atime +30.

find $HOME/Downloads -size +50M -o -atime +30


The criteria that we can use in are searches include:
  • -type: file type being a value of:
    • f for regular files
    • l for symbolic links
    • d for directories
    • c for character devices
    • b for block devices
    • p for pipes
    • s for sockets
  • -perm | files with certain permissions
  • -atime | last accessed time
  • -mtime | last modified time
  • -size | file size
  • -inum | find files based on the inode number
  • And many more. The man page for find is very good with lots of examples


Find also has actions, the default action for find is to print to the screen. It is optionals but the following two commands are the same, displaying symbolic links from the /etc directory down:

find /etc -type l -print
find /etc -type l

Another simple action is -delete, you are not prompted to delete files; but those files meeting the criteria are deleted:

find $HOME/Documents/ -type f -atime +365 -delete

The Documents directory in the current users home directory is searched for files that have not been accessed in the last 365 days.
Very powerfully we can use -exec or -ok to run any command against the found files. The exec action will proceed without any prompts whereas the ok option will prompt for each file before any actions

find $HOME/Documents/ -type f -atime +365 -ok rm {}\ ;
find $HOME/Documents/ -type f -atime +365 -exec rm {} \;

The two commands above are similar: the first will prompt and the second will not. Both will run the action to delete (rm) the file name in the place holder {}. For each file located it is placed in the braces awaiting its imminent expunging. Any command can be used in place of rm, this is just an example. The next example removes the execute permissions only from files amd nor directories or links:

find $HOME/Documents/ -type f -exec chmod -x {}\ ;

Heads or Tails?


If we need to view the top of a file we can use head (/usr/bin/head) and should we need to view the end of a file we can use tail (/usr/bin/tail). The following command will list the first two lines of the file test.txt.

head -n 2 test.txt

Using the similar command tail we can display the last two lines:

tail -n 2 test.txt


When reading log files it is common to follow the end of the log. This uses the -f option to tail and will continue to display the current last 10 lines of the log. Use control + c to stop following the file.

tail -f /var/log/syslog


Note: The log file used here is on Ubuntu on other systems the more general log file is /var/log/messages

Similarly we can use cat (/bin/cat) and tac (/usr/bin/tac), cat list or concatenates the file from top to bottom and tac from bottom to top. If you focus is on the bottom of the file use cat, you will be left at the bottom of the file. If your focus is on the top use tac as you will be left at the top of the file.

Wc

The command wc (/usr/bin/wc) can be used to count the lines, words and characters in a file. wc used without options and the filename as the argument will show all three.

wc test.txt
wc -l test.txt
wc -w test.txt
wc -c test.txt

The first counts the lines, words and characters. The second line counts just lines, then just words and then just characters. The output from ls -l is shown below note that it shows both the line count and the file name.


Cut

The command cut (/usr/bin/cut) can be useful where viewing every filed in a file is not required. We may only want to see certain fields. Even with the output of a command we can pipe the output to cut. Suppose we only need the line count from wc not the file name:

wc -l test.txt | cut -d “ “ -f1

With cut we use the -d option to say that the output is space delimited and the -f option to display only the first field:



«« Previous
Next »»

0 comments:

Post a Comment