Professional programmer; amateur home handyman (on our home only); tinkerer; husband; father of 3; attempting to be a renaissance guy (to know at least a little about a lot of subjects, a doomed pursuit in an information age); geek-arts-and-sciences enthusiast. Interest areas: Science fiction, wind turbines, electric cars, renewable energy, making things.
Thursday, August 28, 2014
SOLVED: diff 2 files, get results without markup
$ fullLoglist="/tmp/full_loglist"
$ keepLoglist="/tmp/keep_loglist"
$ rmLoglist="/tmp/rm_loglist"
$ ls -lrt ${d}/rs*log* > ${fullLoglist}
First, I create a file with all the files listed:
$ cat ${fullLoglist}
graphdb_1j/rs-47-a/rs-47-a.log
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T13-25-13
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T14-46-46
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T14-48-44
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T15-14-35
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T15-35-11
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T19-48-56
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-28T13-44-21
I only want the top 3 files, though, and to rm the older ones. So, I extract the ones I want to keep:
$ cat ${fullLoglist} | head -3 > $keepLoglist
$ cat $keepLoglistgraphdb_1j/rs-47-a/rs-47-a.log
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T13-25-13
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T14-46-46
Now, I need to get a diff of the rest of them so I know what to call rm on. But how? If I just do a diff, I get:
$ diff /tmp/full_loglist /tmp/keep_loglist
4,8d3
< graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T14-48-44
< graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T15-14-35
< graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T15-35-11
< graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T19-48-56
< graphdb_1j/rs-47-a/rs-47-a.log.2014-08-28T13-44-21
I DON'T WANT the extra markup. I want diff without markup. I try to google:
linux diff without markup
linux diff without greater-than less-than
linux diff line format markup
linux diff supress markup
linux diff only different lines
I try various options, like:
$ diff --line-format "%L" --suppress-common-lines /tmp/full_loglist /tmp/keep_loglist
graphdb_1j/rs-47-a/rs-47-a.log
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T13-25-13
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T14-46-46
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T14-48-44
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T15-14-35
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T15-35-11
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T19-48-56
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-28T13-44-21
This is WRONG, it prints all the lines, not just the different ones. Ug.
SOLUTION ONE: comm -3
I found two solutions. The first is to use comm, and pass a -3 option. This prints just the different lines. I hadn't ever heard of comm before, but it's nice:
$ comm -3 /tmp/full_loglist /tmp/keep_loglist
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T14-48-44
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T15-14-35
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T15-35-11
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T19-48-56
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-28T13-44-21
SOLUTION TWO: sort | uniq -u
I don't care about the ordering of these things. So, I can use the simple solution of sort and uniq -u. The uniq command normally prints all unique lines, removing duplicates. But, an option is -u, which only prints lines that occur once-and-only-once.
$ cat /tmp/full_loglist /tmp/keep_loglist | sort | uniq -u
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T14-48-44
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T15-14-35
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T15-35-11
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T19-48-56
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-28T13-44-21
SOLVED! No >, no < signs with diff and no line number markup.
Script to implement this process, for your edification:
#!/bin/bash
cd /opt/storage
dirlist=`ls -d graphdb_[1234][ij]/rs-*`
fullLoglist="/tmp/full_loglist"
keepLoglist="/tmp/keep_loglist"
rmLoglist="/tmp/rm_loglist"
for d in $dirlist
do
echo "---------------------------"
echo "dir: ${d}"
ls ${d}/rs*log* > ${fullLoglist}
ls -al ${fullLoglist}
echo "full files: `cat ${fullLoglist}`"
cat ${fullLoglist} | head -3 > $keepLoglist
echo "keep files: `cat ${keepLoglist}`"
cat $keepLoglist $fullLoglist | sort | uniq -u > $rmLoglist
echo "rm files: `cat ${rmLoglist}`"
if [ -s $rmLoglist ]
then
echo "Removing files...."
cat $rmLoglist | xargs rm -v
else
echo "No files to remove this time."
fi
done
Done.
Tuesday, August 26, 2014
SOLVED: Multiprocess updates of shared dict-of-dicts
I've run into my very first bug in Python. It's a known, existing bug, but it's still the first time I've actually encountered one spontaneously.
I should caveat that I occassionally forget things, and I've been using Python 8 years, so it could be this has happened before but I've forgotten it.
The problem is as follows:
* I'm working in multiple processes, so I'm sharing a data structure across processes. it's a dict. So, I do
from multiprocessing import manager
d = manager.dict()
and share the sucker around. Then, I do:
self.ddict.setdefault(mname, {})
#self.ddict[mname][ts] = val
self.log.error("updating mname: %s w/ ts: %s, val: %s" % (mname, ts, val))
self.ddict[mname].update({ts:val})
self.log.error("ddict %s, mname at %s" % (self.ddict, self.ddict[mname]))
output:
2014-08-26 11:20:57,221 MainThread ERROR updating mname: a.b.c.d w/ ts: 1409070057, val: 1.0
2014-08-26 11:20:57,221 MainThread ERROR ddict {'a.b.c.d': {}}, mname at {}
2014-08-26 11:20:57,222 MainThread ERROR ddict: {'a.b.c.d' {}}
According to http://bugs.python.org/issue6766, this is a known bug. I'm using Python2.6 and cannot upgrade.
Damnit. Looking for a workaround.
SOLVED:
Problem is that the shared memory 'watcher' manages the data structure and shares around updates to everyone using it. This means that this manager has to know when the data struct is updated in one process and tell the other processes to pull in the updates.
If you're using a dict-of-dicts DOD, the inner dict just looks like a blob that process A changes by direct access to the object, without informing DOD of the change. So, to fix this do:
dod = Manager.dict() process ONE: dod['a'] = { 'x' : 12 } # process TWO sees this update of a = x:12 process TWO: dod['a']['x'] = 13 # process ONE has no idea this happened, manager doesn't bring in the update. # to fix: process TWO: y=dod['a'];y['x']=13;dod=y # this informs manager that dod is different.