Thursday, August 28, 2014

SOLVED: diff 2 files, get results without markup

So, I want do a logrotate and remove files that are old, keeping only the latest 3 files. So, script starts with:

$ fullLoglist="/tmp/full_loglist"
$ keepLoglist="/tmp/keep_loglist"
$ rmLoglist="/tmp/rm_loglist"
$ ls -lrt ${d}/rs*log* > ${fullLoglist}

First, I create a file with all the files listed:

$ cat ${fullLoglist}
graphdb_1j/rs-47-a/rs-47-a.log
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T13-25-13
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T14-46-46
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T14-48-44
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T15-14-35
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T15-35-11
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T19-48-56
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-28T13-44-21

I only want the top 3 files, though, and to rm the older ones. So, I extract the ones I want to keep:

$ cat ${fullLoglist} | head -3 > $keepLoglist
$ cat $keepLoglistgraphdb_1j/rs-47-a/rs-47-a.log
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T13-25-13
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T14-46-46

Now, I need to get a diff of the rest of them so I know what to call rm on. But how? If I just do a diff, I get:

$ diff /tmp/full_loglist /tmp/keep_loglist
4,8d3
< graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T14-48-44
< graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T15-14-35
< graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T15-35-11
< graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T19-48-56
< graphdb_1j/rs-47-a/rs-47-a.log.2014-08-28T13-44-21

I DON'T WANT the extra markup. I want diff without markup. I try to google:

linux diff without markup
linux diff without greater-than less-than
linux diff line format markup
linux diff supress markup
linux diff only different lines

I try various options, like:

$ diff --line-format "%L" --suppress-common-lines /tmp/full_loglist /tmp/keep_loglist
graphdb_1j/rs-47-a/rs-47-a.log
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T13-25-13
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T14-46-46
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T14-48-44
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T15-14-35
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T15-35-11
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T19-48-56
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-28T13-44-21
This is WRONG, it prints all the lines, not just the different ones. Ug.

SOLUTION ONE: comm -3

I found two solutions. The first is to use comm, and pass a -3 option. This prints just the different lines. I hadn't ever heard of comm before, but it's nice:

$ comm -3 /tmp/full_loglist /tmp/keep_loglist
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T14-48-44
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T15-14-35
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T15-35-11
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T19-48-56
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-28T13-44-21
SOLUTION TWO: sort | uniq -u

I don't care about the ordering of these things. So, I can use the simple solution of sort and uniq -u. The uniq command normally prints all unique lines, removing duplicates. But, an option is -u, which only prints lines that occur once-and-only-once.

$ cat /tmp/full_loglist /tmp/keep_loglist | sort | uniq -u
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T14-48-44
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T15-14-35
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T15-35-11
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T19-48-56
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-28T13-44-21

SOLVED!   No >, no < signs with diff and no line number markup.

Script to implement this process, for your edification:

#!/bin/bash

cd /opt/storage
dirlist=`ls -d graphdb_[1234][ij]/rs-*`
fullLoglist="/tmp/full_loglist"
keepLoglist="/tmp/keep_loglist"
rmLoglist="/tmp/rm_loglist"

for d in $dirlist
do
echo "---------------------------"
echo "dir: ${d}"
ls ${d}/rs*log* > ${fullLoglist}
ls -al ${fullLoglist}
echo "full files: `cat ${fullLoglist}`"
cat ${fullLoglist} | head -3 > $keepLoglist
echo "keep files: `cat ${keepLoglist}`"
cat $keepLoglist $fullLoglist | sort | uniq -u > $rmLoglist
echo "rm   files: `cat ${rmLoglist}`"
if [ -s $rmLoglist ]
    then
        echo "Removing files...."
        cat $rmLoglist | xargs rm -v
    else
        echo "No files to remove this time."
    fi
done

Done.

Tuesday, August 26, 2014

SOLVED: Multiprocess updates of shared dict-of-dicts

Earlier, I posted the following:

I've run into my very first bug in Python. It's a known, existing bug, but it's still the first time I've actually encountered one spontaneously.

I should caveat that I occassionally forget things, and I've been using Python 8 years, so it could be this has happened before but I've forgotten it.

The problem is as follows:

* I'm working in multiple processes, so I'm sharing a data structure across processes. it's a dict. So, I do

from multiprocessing import manager
d = manager.dict()

and share the sucker around. Then, I do:

self.ddict.setdefault(mname, {})
#self.ddict[mname][ts] = val
self.log.error("updating mname: %s w/ ts: %s, val: %s" % (mname, ts, val))
self.ddict[mname].update({ts:val})
self.log.error("ddict %s, mname at %s" % (self.ddict, self.ddict[mname]))

output:

2014-08-26 11:20:57,221 MainThread ERROR updating mname: a.b.c.d w/ ts: 1409070057, val: 1.0
2014-08-26 11:20:57,221 MainThread ERROR ddict {'a.b.c.d': {}}, mname at {}
2014-08-26 11:20:57,222 MainThread ERROR ddict: {'a.b.c.d' {}}

According to http://bugs.python.org/issue6766, this is a known bug. I'm using Python2.6 and cannot upgrade.

Damnit. Looking for a workaround.

SOLVED:

Problem is that the shared memory 'watcher' manages the data structure and shares around updates to everyone using it. This means that this manager has to know when the data struct is updated in one process and tell the other processes to pull in the updates.

If you're using a dict-of-dicts DOD, the inner dict just looks like a blob that process A changes by direct access to the object, without informing DOD of the change. So, to fix this do:

dod = Manager.dict()
process ONE: dod['a'] = { 'x' : 12 }
# process TWO sees this update of a = x:12
process TWO: dod['a']['x'] = 13
# process ONE has no idea this happened, manager doesn't bring in the update.
# to fix:
process TWO: y=dod['a'];y['x']=13;dod=y
# this informs manager that dod is different.

Wednesday, January 15, 2014

WANTED: Complete List of WhiteHouse.Gov Petitions

I've created a petition on http://whitehouse.gov to modify copyright law to prevent copyright of any law enacted in the United States at the Federal, State, and Local level. This prevents jerk corporations from claiming copyright when someone tries to put the code/law/edict on a webpage.

But, there aren't enough signatures yet. So I tried advertising to my FB friends. I have to get to 150 before the petition is visible on the list of open petitions on the website.

That got me thinking: how do I find all the other petitions that aren't available for viewing yet due to being not-advertised enough?

They provide a short-URL for these petitions. This shortener probably is incremental. Mine is http://wh.gov/lInfx PLEASE SIGN IT and I figure the other petitions might have urls near it.

So, here's my program to find them:

#!/usr/bin/python

#http://wh.gov/lInfx
#http://wh.gov/lInfx

nums1 = [x for x in range(65, 90)]
nums2 = [x for x in range(97, 122)]
nums = []
nums.extend(nums1)
nums.extend(nums2)
print nums

with open("/tmp/urls.wh", "w") as OFH:
    for i in nums:
        for j in nums:
            for k in nums:
                for m in nums:
                    OFH.write("http://wh.gov/l%c%c%c%c\n" % (i, j, k, m))
                    

# wget -a out.log -t 1 --read-timeout=3 --max-redirect 1  --save-headers

No luck yet, it takes a long time to run.

https://petitions.whitehouse.gov/petition/holidays-muslim/6T6csRph
https://petitions.whitehouse.gov/petition//95WplFSK
https://petitions.whitehouse.gov/petition/muslim-should-have-holiday-their-holiday/dH8ZTMSf https://petitions.whitehouse.gov/petition/please-protect-peace-monument-nassau-county-new-york-eisenhower-park/hkXX3082

Tuesday, January 07, 2014

Solved: How to install Python ibm_db DB2 driver on RHEL

Solved: Installing Python's DB2 module (ibm_db) on Linux

Problem 1: I kept getting a message about not having include files installed. First, I had to get a sysadmin to install the files. But, he didn't finish the job, quite. I had to soft link it myself. That is, the include files are installed by default in /opt/ibm/db2/V10.1, but linked to from /opt/db2inst/sqllib. So:

$ sudo ln -s /opt/ibm/db2/V10.1/include /opt/db2inst/sqllib/include

That solved the include problem.

Problem 2: I was installing IBM's DB2 python driver on a RHEL 6.2 linux box, but I kept getting the message:

Detected 64-bit Python
Environment variable IBM_DB_HOME is not set. Set it to your DB2/IBM_Data_Server_Driver installation directory and retry ibm_db module install.

This was despite executing the userprofile script in /opt/db2inst/sqllib/userprofile, which set the environment vars properly:

xxxxx@xxxxxx:~/making/ibm_db-2.0.4.1$ env | egrep -i ibm
IBM_DB_LIB=/opt/db2inst1/sqllib/lib
IBM_DB_DIR=/opt/db2inst1/sqllib
IBM_DB_HOME=/home/db2inst1/sqllib
IBM_DB_INCLUDE=/opt/db2inst1/sqllib/include

Dammit! I couldn't get past this. I tried both

sudo easy_install ibm_db
sudo pip install ibm_db

This was to no avail. I even tried putting the whole thing in a bash script and doing the export IBM_DB_LIB stuff there. Again, failure.

Finally, I downloaded the source and did the python setup.py build, which worked, so I was halfway there. Then, I had to do the sudo setup.py install (since it was to be installed in system directories): Failure (with above message about missing IBM_DB_HOME environment variable). But, I could do the env and see the vars there! So, since I'm doing this from source, I edited setup.py and put a pprint in, showing the os.envoron. This made it obvious. It showed I was executing as sudo, so as root, and root didn't have the userprofile being executed, so not var was being set.

Quick: "man sudo" !!

SOLUTION: It shows that to do this right, you must invoke sudo with -E to keep the environment variables from the current environment.

$ sudo -E python setup.py install

Hurray! Installed!

Friday, December 13, 2013

Python version of iostat.c

Ok, so I have a problem. I'm trying to create metrics for a Linux system, an insert them into a local database. Constraints:

Runs on lots of machines;
machines may be heavily loaded;
Sometimes the kick-off time is delayed beyond a 1-minute standard (typically due to load);
I don't like having long-running subprocesses: they might stop/fail and I'd have to restart, they use memory, etc.;
I'm doing this in Python;
I can store state between runs in a pickle file;
I want to replicate the existing fields coming out of vmstat -s and iostat -x -D n (where n is number of seconds of sample size);
I want the values of these fields to match likewise.

So, I need to replicate iostat.c in python. I can get absolute numbers from iostat and vmstat and stat, and do the math myself, storing state from the last time I ran it and subtracting to get a diff.

Problem 1: Where is the source code for iostat.c ? In ubuntu at least (hoping RHEL / CENTOS is similar) it turns out it's in the sysstat package. I found systat source at: http://freecode.com/projects/sysstat.

Inside this package, there's source code for iostat as a file named iostat.c. I have yet to find it online, so here it is:

General plan:

Read current values from /proc/stats, /proc/diskstats, and vmstat -s
read values from previous run from disk
find diffs
use diffs to compute values needed
write current values to disk for next time.

This will involve significant coding, don't know if I'll have space to post it all here...

Sunday, October 27, 2013

SOLVED: Ubuntu installation of Canon MX452 Inkjet Printer

Unlike many printers and Linux, this install went simply.

Unbox printer.
remove orange packing tape.
unbox power cords and usb cable, install.
open front pull-down cover, and one under it - push down on grey loops gently and put in inkjet cartridges.
Put paper in at bottom, only hold 50 sheets or so.
turn on, wait.
open terminal window, type sudo ls and enter pw.
open browser, get download from http://support-sg.canon-asia.com/contents/SG/EN/0100515301.html
in terminal, cd ~/Downloads
tar -xzvf cnijfilter*
cnijfilter*
On printer, turn off and on again just in case.
sudo ./install.sh
follow prompts accepting defaults.
in browser, open google.com and print open page as test. Should hear printer working.

Done.

Tuesday, October 22, 2013

Optimizing Python - getting data out of memcache with struct.unpack

So, I have this Memcache data store that holds timestamps and values from a monitoring application. Since each memcache key corresponds to an hour's data, I only need to store 2 bytes for the number of seconds past the hour. I don't care about duplicate data being stored, but on retrieval I'd like to eliminate it if it exists.

Input data is: (ts,val), (ts, val), ... encoded using Python's struct.pack command. The ts (timestamp) is (as noted) packed with format h (unsigned int). The val (value) is a floating point number of 4 bytes, packed with format f.

The original version of this encoding was:

    def OLD_rawDataToTsVals(self, timeOffset, raw):
        tsVals = []
        while raw:
            ts, val, raw = raw[:2], raw[2:6], raw[6:]
            ts = timeOffset + struct.unpack('h', ts)[0]
            val = struct.unpack('f', val)[0]
            tsVals.append((ts, val))
        return tsVals

I found this version:

ran really slowly;
didn't eliminate duplicate values;
would really choke the longer the input data (as in 33,000 datapoints in an hour).

For the second version, I knew I had to stop with the copying of the data string over and over again, which I knew was eating major cycles.

Doing some math, I figured out I could iterate over the string, extracting each element and converting the two parts to python numbers.

    def rawDataToTsVals(self, timeOffset, raw):
        tsVals = []
        for i in range(0, len(raw), 6):
            rawtime = raw[i:i+2]
            ts = timeOffset + struct.unpack('h', rawtime)[0]
            val = struct.unpack('f', raw[i+2:i+6])[0]
            tsVals.append((ts, val))
        return tsVals

This was better timewise, but didn't remove duplicate data. I made the 'seen it yet' test occur even before the conversion to (int, float), which saved a bit of time doing useless conversions.

    def rawDataToTsVals(self, timeOffset, raw):
        tsVals = []
        seenTimes = set()
        for i in range(0, len(raw), 6):
            rawtime = raw[i:i+2]
            if rawtime in seenTimes:
                continue
            seenTimes.add(rawtime)
            ts = timeOffset + struct.unpack('h', rawtime)[0]
            val = struct.unpack('f', raw[i+2:i+6])[0]
            tsVals.append((ts, val))
        return tsVals

Yet, it was STILL TOO SLOW. Where was the time going? I timed the various parts and found the slow bit was the conversion to int/float. That unpack was happening a lot and the time added up.

I tried the following but FAILED.

        # BAD DON'T USE ** BAD DON'T USE **
        elems  = rawLen / 6.0  # 6 bytes per - 2=time + 4=data.
        intElems = int(elems)
        if (elems != intElems):
            self.log.warning("elems non-integer: len: %s" % (rawLen))
            return []
        unp = struct.unpack("hf"*intElems, raw)
        # BAD DON'T USE ** BAD DON'T USE **

The above fails because if we pack these things together, there's a word-alignment problem that unpack is unable to cope with. It would have to be something like (int, zeroes, float) to make the float align on a word boundary.

But, I couldn't give up, this had to work better. So, I extract all the ints, string those together and unpack them, then do the same thing with the floats.

HERE IS THE FINAL VERSION:

    def rawDataToTsVals(self, timeOffset, raw):
        tsVals = []
        seenTimes = set()
        try:
            rawLen = len(raw)
            times = ""
            vals  = ""
            for i in range(0, rawLen, 6):
                rawtime = raw[i:i+2]
                if rawtime in seenTimes:
                    continue
                times += rawtime
                vals  += raw[i+2:i+6]
            timesList = struct.unpack('h'*(len(times)/2), times)
            valsList  = struct.unpack('f'*(len(vals)/4),  vals)
            assert len(timesList) == len(valsList), "Lens of times and vals unequal, t=%s, v=%s" % (len(timesList), len(valsList))
            for i in range(0, len(timesList)):
                tsVals.append((timeOffset+timesList[i], valsList[i]))
            #self.log.debug("unpacked %d vals, len ts %s." % (len(timesList), len(tsVals)))
        except:
            tb = traceback.format_exc()
            self.log.debug("tb in rawDataToTsVals(): rawSize: %s, %s" % (rawLen, tb))
            pass
        
        if 0:  # debugging
            self.log.debug("tsvals: %s" % ( tsVals))
        return tsVals

Unpacking all the h's at the same time, and likewise the floats, makes everything align, and since it's one function call to struct, is very fast.

Enjoy!