Thursday, August 28, 2014

SOLVED: diff 2 files, get results without markup

So, I want do a logrotate and remove files that are old, keeping only the latest 3 files.  So, script starts with:

$ fullLoglist="/tmp/full_loglist"
$ keepLoglist="/tmp/keep_loglist"
$ rmLoglist="/tmp/rm_loglist"

$ ls -lrt ${d}/rs*log* > ${fullLoglist}

 
First, I create a file with all the files listed:


$ cat ${fullLoglist}
graphdb_1j/rs-47-a/rs-47-a.log
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T13-25-13
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T14-46-46
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T14-48-44
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T15-14-35
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T15-35-11
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T19-48-56
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-28T13-44-21


I only want the top 3 files, though, and to rm the older ones.  So, I extract the ones I want to keep:

$ cat ${fullLoglist} | head -3 > $keepLoglist
$ cat $keepLoglistgraphdb_1j/rs-47-a/rs-47-a.log
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T13-25-13
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T14-46-46



Now, I need to get a diff of the rest of them so I know what to call rm on.  But how?  If I just do a diff, I get:

$ diff /tmp/full_loglist /tmp/keep_loglist
4,8d3
< graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T14-48-44
< graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T15-14-35
< graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T15-35-11
< graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T19-48-56
< graphdb_1j/rs-47-a/rs-47-a.log.2014-08-28T13-44-21



I DON'T WANT the extra markup.  I want diff without markup.  I try to google:

linux diff without markup
linux diff without greater-than less-than
linux diff line format markup
linux diff supress markup
linux diff only different lines

I try various options, like:

$ diff --line-format "%L" --suppress-common-lines /tmp/full_loglist /tmp/keep_loglist
graphdb_1j/rs-47-a/rs-47-a.log
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T13-25-13
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T14-46-46
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T14-48-44
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T15-14-35
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T15-35-11
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T19-48-56
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-28T13-44-21

This is WRONG, it prints all the lines, not just the different ones. Ug. 

SOLUTION ONE:  comm -3

I found two solutions.  The first is to use comm, and pass a -3 option.  This prints just the different lines.  I hadn't ever heard of comm before, but it's nice:

$ comm -3 /tmp/full_loglist /tmp/keep_loglist
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T14-48-44
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T15-14-35
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T15-35-11
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T19-48-56
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-28T13-44-21

SOLUTION TWO:  sort | uniq -u

I don't care about the ordering of these things.  So, I can use the simple solution of sort and uniq -u.  The uniq command normally prints all unique lines, removing duplicates.  But, an option is -u, which only prints lines that occur once-and-only-once.

$ cat /tmp/full_loglist /tmp/keep_loglist | sort  | uniq -u
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T14-48-44
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T15-14-35
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T15-35-11
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T19-48-56
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-28T13-44-21


SOLVED!   No >, no < signs with diff and no line number markup.


Script to implement this process, for your edification:

#!/bin/bash

cd /opt/storage
dirlist=`ls -d graphdb_[1234][ij]/rs-*`
fullLoglist="/tmp/full_loglist"
keepLoglist="/tmp/keep_loglist"
rmLoglist="/tmp/rm_loglist"

for d in $dirlist
do     
  echo "---------------------------"
  echo "dir: ${d}"
  ls ${d}/rs*log* > ${fullLoglist}
  ls -al ${fullLoglist}
  echo "full files: `cat ${fullLoglist}`"
  cat ${fullLoglist} | head -3 > $keepLoglist
  echo "keep files: `cat ${keepLoglist}`"
  cat $keepLoglist $fullLoglist | sort | uniq -u > $rmLoglist
  echo "rm   files: `cat ${rmLoglist}`"
  if [ -s $rmLoglist ]
    then
        echo "Removing files...."
        cat $rmLoglist | xargs rm -v
    else
        echo "No files to remove this time."
    fi
done


Done.








Tuesday, August 26, 2014

SOLVED: Multiprocess updates of shared dict-of-dicts

Earlier, I posted the following:

I've run into my very first bug in Python.  It's a known, existing bug, but it's still the first time I've actually encountered one spontaneously.

I should caveat that I occassionally forget things, and I've been using Python 8 years, so it could be this has happened before but I've forgotten it.

The problem is as follows:

* I'm working in multiple processes, so I'm sharing a data structure across processes.  it's a dict.  So, I do

from multiprocessing import manager
d = manager.dict()

and share the sucker around.  Then, I do:

self.ddict.setdefault(mname, {})
#self.ddict[mname][ts] = val
self.log.error("updating mname: %s w/ ts: %s, val: %s" % (mname, ts, val))
self.ddict[mname].update({ts:val})
self.log.error("ddict %s, mname at %s" % (self.ddict, self.ddict[mname]))


output: 

2014-08-26 11:20:57,221 MainThread ERROR updating mname: a.b.c.d w/ ts: 1409070057, val: 1.0
2014-08-26 11:20:57,221 MainThread ERROR ddict {'a.b.c.d': {}}, mname at {}
2014-08-26 11:20:57,222 MainThread ERROR ddict: {'a.b.c.d' {}}


According to http://bugs.python.org/issue6766, this is a known bug.  I'm using Python2.6 and cannot upgrade. 

Damnit.  Looking for a workaround.


SOLVED:

Problem is that the shared memory 'watcher' manages the data structure and shares around updates to everyone using it. This means that this manager has to know when the data struct is updated in one process and tell the other processes to pull in the updates.

If you're using a dict-of-dicts DOD, the inner dict just looks like a blob that process A changes by direct access to the object, without informing DOD of the change. So, to fix this do:

dod = Manager.dict()
process ONE: dod['a'] = { 'x' : 12 }
# process TWO sees this update of a = x:12
process TWO: dod['a']['x'] = 13
# process ONE has no idea this happened, manager doesn't bring in the update.
# to fix:
process TWO: y=dod['a'];y['x']=13;dod=y
# this informs manager that dod is different.