CLI byte

read log files in portions by bytes

You may want to read a log file for newly appended data. You could do it by knowing how many lines you read before and how many lines you have “now”. But when you have large log files, this approach will become very slow – try to run “wc -l” on file which has tens of millions of lines and you’ll see how slowly it counts. And you would have to do something like that each time you are reading new portion of data. What to do? Read log in bytes instead. Here is how.

Say you will run your script each minute to get new lines from some log file. First, you have to memorize how much data you had previously and how much data is there now. Then, all you have to do is just get that portion of data between “previous” and “now”. Just don’t forget log rotation! If “now” is smaller than “previous” – it must mean your log was rotated and you have to reset that “previous” to zero.

Let’s create a small script first which will simulate writing to log:

[root@linux ~]# cat random.sh
#!/bin/bash

now=$(date "+%Y-%m-%d %H:%M:%S")
random=$(tr -dc 'a-zA-Z0-9 ' < /dev/urandom | fold -w 64 | head -1)

echo "${now} ${random}"
[root@linux ~]# ./random.sh
2023-10-06 13:29:19 IKdlgDTNYxmVx2EN3kFsu0gLX4Y2rSFvzK1pZcLDk1AKGksIhtAnXYys48ORO9et
[root@linux ~]# ./random.sh
2023-10-06 13:29:20 RC1bOv6h01UEwXa0HFD33gCLyiLuKHG9sgIPu73ELkMKpyBU9xVIq AipE2ro7Lw
[root@linux ~]# ./random.sh
2023-10-06 13:29:21 MuD9OqJ9cRDxx0Zvr7Q7b6O57SWpDlMXLrd44u92YDHFQjScLPpqouN853TH83vp
[root@linux ~]# 

Now, let’s write our simple log reading script:

#!/bin/bash

my_log="${1}"

[[ "${my_log}" == "" || ! -f "${my_log}" ]] && exit 1

log_read=$(dirname "${0}")/.$(basename "${my_log}").read

# get current file size in bytes

current_size=$(wc -c < "${my_log}")

# remember how many bytes you have now for next read
# when run for first time, you don't know the previous

[[ ! -f "${log_read}" ]] && echo "${current_size}" > "${log_read}"

bytes_read=$(cat "${log_read}")
echo "${current_size}" > "${log_read}"

# if rotated, let's read from the beginning

[[ ${bytes_read} -gt ${current_size} ]] && bytes_read=0

# get the portion

tail -c +$((bytes_read+1)) "${my_log}" | head -c $((current_size-bytes_read))

exit 0

You may wonder, why there is “head” in combination with “tail” – it might seem excessive on the first glimpse. However, it is needed if your file grows really quick – don’t forget our goal – we want to read file portion by portion, exactly unique (or no) data each time. If you just use “tail”, it would take all lines since the last byte you read, meaning that all lines that appeared in log after the script’s 8th line was executed, would be displayed both this and next time the script runs.

Let’s try it out! When you first run the script on the new log file, it will not write anything into output, as it doesn’t know yet, what is the portion you want to read. You will just notice it memorizes the current size of the file:

[root@linux ~]# ./random.sh >> /tmp/my.log
[root@linux ~]# cat /tmp/my.log
2023-11-28 14:32:18 4VyImPqmYHgJRRatl9Y0rGFJP98eABwJPY84lr Tc6jrY8ptvUtNJM0o981AZXAA
[root@linux ~]# ./read.sh /tmp/my.log
[root@linux ~]# cat .my.log.read
85
[root@linux ~]#

During next run however, it will output any new lines that appeared since last run and again remember the current size of the file being read:

[root@linux ~]# ./random.sh >> /tmp/my.log
[root@linux ~]# ./random.sh >> /tmp/my.log
[root@linux ~]# ./random.sh >> /tmp/my.log
[root@linux ~]# ./read.sh /tmp/my.log
2023-11-28 14:36:38 fQbPkoEqJ5Nc58daypLMg31FfzbaEF217W9BiFimne5AbZ 4a5ipcOVnKVzOSW0i
2023-11-28 14:36:38 IqAIV9hkGemhhM0Elez3U8Nq6jRSlDBc68vXdBMBvY3eTL FMIXYiUwO6VD6UIUy
2023-11-28 14:36:39 p5crLu4sUitWdbSdkcD6kDQzBH9Cq6 x1XzuNXd7DkcCioXH3JlxssoNO3JxAUb5
[root@linux ~]# cat .my.log.read
340
[root@linux ~]#

You can also see that it works if log is rotated:

[root@linux ~]# > /tmp/my.log
[root@linux ~]# ./random.sh >> /tmp/my.log
[root@linux ~]# ./read.sh /tmp/my.log
2023-11-28 14:42:50 WZ4Qh2m1VUgsTHTKz1lVr8nN7XnJl3FlVX k8G0lfYL44hPdF6MewbHVeCupdVrX
[root@linux ~]#

This is extremely fast even with large files – bottleneck of “wc -l” is gone!

Use cases for this script are extensive – you can use it to track any log file for desired patterns (say, errors) by giving it to be run periodically by your monitoring system, or simple cron.