Professional programmer; amateur home handyman (on our home only); tinkerer; husband; father of 3; attempting to be a renaissance guy (to know at least a little about a lot of subjects, a doomed pursuit in an information age); geek-arts-and-sciences enthusiast. Interest areas: Science fiction, wind turbines, electric cars, renewable energy, making things.
Thursday, August 28, 2014
SOLVED: diff 2 files, get results without markup
$ fullLoglist="/tmp/full_loglist"
$ keepLoglist="/tmp/keep_loglist"
$ rmLoglist="/tmp/rm_loglist"
$ ls -lrt ${d}/rs*log* > ${fullLoglist}
First, I create a file with all the files listed:
$ cat ${fullLoglist}
graphdb_1j/rs-47-a/rs-47-a.log
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T13-25-13
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T14-46-46
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T14-48-44
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T15-14-35
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T15-35-11
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T19-48-56
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-28T13-44-21
I only want the top 3 files, though, and to rm the older ones. So, I extract the ones I want to keep:
$ cat ${fullLoglist} | head -3 > $keepLoglist
$ cat $keepLoglistgraphdb_1j/rs-47-a/rs-47-a.log
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T13-25-13
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T14-46-46
Now, I need to get a diff of the rest of them so I know what to call rm on. But how? If I just do a diff, I get:
$ diff /tmp/full_loglist /tmp/keep_loglist
4,8d3
< graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T14-48-44
< graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T15-14-35
< graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T15-35-11
< graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T19-48-56
< graphdb_1j/rs-47-a/rs-47-a.log.2014-08-28T13-44-21
I DON'T WANT the extra markup. I want diff without markup. I try to google:
linux diff without markup
linux diff without greater-than less-than
linux diff line format markup
linux diff supress markup
linux diff only different lines
I try various options, like:
$ diff --line-format "%L" --suppress-common-lines /tmp/full_loglist /tmp/keep_loglist
graphdb_1j/rs-47-a/rs-47-a.log
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T13-25-13
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T14-46-46
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T14-48-44
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T15-14-35
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T15-35-11
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T19-48-56
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-28T13-44-21
This is WRONG, it prints all the lines, not just the different ones. Ug.
SOLUTION ONE: comm -3
I found two solutions. The first is to use comm, and pass a -3 option. This prints just the different lines. I hadn't ever heard of comm before, but it's nice:
$ comm -3 /tmp/full_loglist /tmp/keep_loglist
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T14-48-44
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T15-14-35
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T15-35-11
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T19-48-56
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-28T13-44-21
SOLUTION TWO: sort | uniq -u
I don't care about the ordering of these things. So, I can use the simple solution of sort and uniq -u. The uniq command normally prints all unique lines, removing duplicates. But, an option is -u, which only prints lines that occur once-and-only-once.
$ cat /tmp/full_loglist /tmp/keep_loglist | sort | uniq -u
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T14-48-44
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T15-14-35
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T15-35-11
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-27T19-48-56
graphdb_1j/rs-47-a/rs-47-a.log.2014-08-28T13-44-21
SOLVED! No >, no < signs with diff and no line number markup.
Script to implement this process, for your edification:
#!/bin/bash
cd /opt/storage
dirlist=`ls -d graphdb_[1234][ij]/rs-*`
fullLoglist="/tmp/full_loglist"
keepLoglist="/tmp/keep_loglist"
rmLoglist="/tmp/rm_loglist"
for d in $dirlist
do
echo "---------------------------"
echo "dir: ${d}"
ls ${d}/rs*log* > ${fullLoglist}
ls -al ${fullLoglist}
echo "full files: `cat ${fullLoglist}`"
cat ${fullLoglist} | head -3 > $keepLoglist
echo "keep files: `cat ${keepLoglist}`"
cat $keepLoglist $fullLoglist | sort | uniq -u > $rmLoglist
echo "rm files: `cat ${rmLoglist}`"
if [ -s $rmLoglist ]
then
echo "Removing files...."
cat $rmLoglist | xargs rm -v
else
echo "No files to remove this time."
fi
done
Done.
Tuesday, August 26, 2014
SOLVED: Multiprocess updates of shared dict-of-dicts
I've run into my very first bug in Python. It's a known, existing bug, but it's still the first time I've actually encountered one spontaneously.
I should caveat that I occassionally forget things, and I've been using Python 8 years, so it could be this has happened before but I've forgotten it.
The problem is as follows:
* I'm working in multiple processes, so I'm sharing a data structure across processes. it's a dict. So, I do
from multiprocessing import manager
d = manager.dict()
and share the sucker around. Then, I do:
self.ddict.setdefault(mname, {})
#self.ddict[mname][ts] = val
self.log.error("updating mname: %s w/ ts: %s, val: %s" % (mname, ts, val))
self.ddict[mname].update({ts:val})
self.log.error("ddict %s, mname at %s" % (self.ddict, self.ddict[mname]))
output:
2014-08-26 11:20:57,221 MainThread ERROR updating mname: a.b.c.d w/ ts: 1409070057, val: 1.0
2014-08-26 11:20:57,221 MainThread ERROR ddict {'a.b.c.d': {}}, mname at {}
2014-08-26 11:20:57,222 MainThread ERROR ddict: {'a.b.c.d' {}}
According to http://bugs.python.org/issue6766, this is a known bug. I'm using Python2.6 and cannot upgrade.
Damnit. Looking for a workaround.
SOLVED:
Problem is that the shared memory 'watcher' manages the data structure and shares around updates to everyone using it. This means that this manager has to know when the data struct is updated in one process and tell the other processes to pull in the updates.
If you're using a dict-of-dicts DOD, the inner dict just looks like a blob that process A changes by direct access to the object, without informing DOD of the change. So, to fix this do:
dod = Manager.dict()
process ONE: dod['a'] = { 'x' : 12 }
# process TWO sees this update of a = x:12
process TWO: dod['a']['x'] = 13
# process ONE has no idea this happened, manager doesn't bring in the update.
# to fix:
process TWO: y=dod['a'];y['x']=13;dod=y
# this informs manager that dod is different.
Wednesday, January 15, 2014
WANTED: Complete List of WhiteHouse.Gov Petitions
But, there aren't enough signatures yet. So I tried advertising to my FB friends. I have to get to 150 before the petition is visible on the list of open petitions on the website.
That got me thinking: how do I find all the other petitions that aren't available for viewing yet due to being not-advertised enough?
They provide a short-URL for these petitions. This shortener probably is incremental. Mine is http://wh.gov/lInfx PLEASE SIGN IT and I figure the other petitions might have urls near it.
So, here's my program to find them:
#!/usr/bin/python
#http://wh.gov/lInfx
#http://wh.gov/lInfx
nums1 = [x for x in range(65, 90)]
nums2 = [x for x in range(97, 122)]
nums = []
nums.extend(nums1)
nums.extend(nums2)
print nums
with open("/tmp/urls.wh", "w") as OFH:
for i in nums:
for j in nums:
for k in nums:
for m in nums:
OFH.write("http://wh.gov/l%c%c%c%c\n" % (i, j, k, m))
# wget -a out.log -t 1 --read-timeout=3 --max-redirect 1 --save-headers
No luck yet, it takes a long time to run.
https://petitions.whitehouse.gov/petition/holidays-muslim/6T6csRph
https://petitions.whitehouse.gov/petition//95WplFSK
https://petitions.whitehouse.gov/petition/muslim-should-have-holiday-their-holiday/dH8ZTMSf https://petitions.whitehouse.gov/petition/please-protect-peace-monument-nassau-county-new-york-eisenhower-park/hkXX3082
Tuesday, January 07, 2014
Solved: How to install Python ibm_db DB2 driver on RHEL
Solved: Installing Python's DB2 module (ibm_db) on Linux
Problem 1: I kept getting a message about not having include files installed. First, I had to get a sysadmin to install the files. But, he didn't finish the job, quite. I had to soft link it myself. That is, the include files are installed by default in /opt/ibm/db2/V10.1, but linked to from /opt/db2inst/sqllib. So:
$ sudo ln -s /opt/ibm/db2/V10.1/include /opt/db2inst/sqllib/includeThat solved the include problem.
Problem 2: I was installing IBM's DB2 python driver on a RHEL 6.2 linux box, but I kept getting the message:
Detected 64-bit PythonThis was despite executing the userprofile script in /opt/db2inst/sqllib/userprofile, which set the environment vars properly:
Environment variable IBM_DB_HOME is not set. Set it to your DB2/IBM_Data_Server_Driver installation directory and retry ibm_db module install.
xxxxx@xxxxxx:~/making/ibm_db-2.0.4.1$ env | egrep -i ibmDammit! I couldn't get past this. I tried both
IBM_DB_LIB=/opt/db2inst1/sqllib/lib
IBM_DB_DIR=/opt/db2inst1/sqllib
IBM_DB_HOME=/home/db2inst1/sqllib
IBM_DB_INCLUDE=/opt/db2inst1/sqllib/include
- sudo easy_install ibm_db
- sudo pip install ibm_db
Finally, I downloaded the source and did the python setup.py build, which worked, so I was halfway there. Then, I had to do the sudo setup.py install (since it was to be installed in system directories): Failure (with above message about missing IBM_DB_HOME environment variable). But, I could do the env and see the vars there! So, since I'm doing this from source, I edited setup.py and put a pprint in, showing the os.envoron. This made it obvious. It showed I was executing as sudo, so as root, and root didn't have the userprofile being executed, so not var was being set.
Quick: "man sudo" !!
SOLUTION: It shows that to do this right, you must invoke sudo with -E to keep the environment variables from the current environment.
$ sudo -E python setup.py installHurray! Installed!
Friday, December 13, 2013
Python version of iostat.c
- Runs on lots of machines;
- machines may be heavily loaded;
- Sometimes the kick-off time is delayed beyond a 1-minute standard (typically due to load);
- I don't like having long-running subprocesses: they might stop/fail and I'd have to restart, they use memory, etc.;
- I'm doing this in Python;
- I can store state between runs in a pickle file;
- I want to replicate the existing fields coming out of vmstat -s and iostat -x -D n (where n is number of seconds of sample size);
- I want the values of these fields to match likewise.
Problem 1: Where is the source code for iostat.c ? In ubuntu at least (hoping RHEL / CENTOS is similar) it turns out it's in the sysstat package. I found systat source at: http://freecode.com/projects/sysstat.
Inside this package, there's source code for iostat as a file named iostat.c. I have yet to find it online, so here it is:
General plan:
- Read current values from /proc/stats, /proc/diskstats, and vmstat -s
- read values from previous run from disk
- find diffs
- use diffs to compute values needed
- write current values to disk for next time.
Sunday, October 27, 2013
SOLVED: Ubuntu installation of Canon MX452 Inkjet Printer
- Unbox printer.
- remove orange packing tape.
- unbox power cords and usb cable, install.
- open front pull-down cover, and one under it - push down on grey loops gently and put in inkjet cartridges.
- Put paper in at bottom, only hold 50 sheets or so.
- turn on, wait.
- open terminal window, type sudo ls and enter pw.
- open browser, get download from http://support-sg.canon-asia.com/contents/SG/EN/0100515301.html
- in terminal, cd ~/Downloads
- tar -xzvf cnijfilter*
- cnijfilter*
- On printer, turn off and on again just in case.
- sudo ./install.sh
- follow prompts accepting defaults.
- in browser, open google.com and print open page as test. Should hear printer working.
Tuesday, October 22, 2013
Optimizing Python - getting data out of memcache with struct.unpack
So, I have this Memcache data store that holds timestamps and values from a monitoring application. Since each memcache key corresponds to an hour's data, I only need to store 2 bytes for the number of seconds past the hour. I don't care about duplicate data being stored, but on retrieval I'd like to eliminate it if it exists.
Input data is: (ts,val), (ts, val), ... encoded using Python's struct.pack command. The ts (timestamp) is (as noted) packed with format h (unsigned int). The val (value) is a floating point number of 4 bytes, packed with format f.
The original version of this encoding was:
def OLD_rawDataToTsVals(self, timeOffset, raw):
tsVals = []
while raw:
ts, val, raw = raw[:2], raw[2:6], raw[6:]
ts = timeOffset + struct.unpack('h', ts)[0]
val = struct.unpack('f', val)[0]
tsVals.append((ts, val))
return tsVals
I found this version:- ran really slowly;
- didn't eliminate duplicate values;
- would really choke the longer the input data (as in 33,000 datapoints in an hour).
For the second version, I knew I had to stop with the copying of the data string over and over again, which I knew was eating major cycles.
Doing some math, I figured out I could iterate over the string, extracting each element and converting the two parts to python numbers.
def rawDataToTsVals(self, timeOffset, raw):
tsVals = []
for i in range(0, len(raw), 6):
rawtime = raw[i:i+2]
ts = timeOffset + struct.unpack('h', rawtime)[0]
val = struct.unpack('f', raw[i+2:i+6])[0]
tsVals.append((ts, val))
return tsVals
This was better timewise, but didn't remove duplicate data. I made the 'seen it yet' test occur even before the conversion to (int, float), which saved a bit of time doing useless conversions.
def rawDataToTsVals(self, timeOffset, raw):
tsVals = []
seenTimes = set()
for i in range(0, len(raw), 6):
rawtime = raw[i:i+2]
if rawtime in seenTimes:
continue
seenTimes.add(rawtime)
ts = timeOffset + struct.unpack('h', rawtime)[0]
val = struct.unpack('f', raw[i+2:i+6])[0]
tsVals.append((ts, val))
return tsVals
Yet, it was STILL TOO SLOW. Where was the time going? I timed the various parts and found the slow bit was the conversion to int/float. That unpack was happening a lot and the time added up.
I tried the following but FAILED.
# BAD DON'T USE ** BAD DON'T USE **
elems = rawLen / 6.0 # 6 bytes per - 2=time + 4=data.
intElems = int(elems)
if (elems != intElems):
self.log.warning("elems non-integer: len: %s" % (rawLen))
return []
unp = struct.unpack("hf"*intElems, raw)
# BAD DON'T USE ** BAD DON'T USE **
The above fails because if we pack these things together, there's a word-alignment problem that unpack is unable to cope with. It would have to be something like (int, zeroes, float) to make the float align on a word boundary.
But, I couldn't give up, this had to work better. So, I extract all the ints, string those together and unpack them, then do the same thing with the floats.
HERE IS THE FINAL VERSION:
def rawDataToTsVals(self, timeOffset, raw):
tsVals = []
seenTimes = set()
try:
rawLen = len(raw)
times = ""
vals = ""
for i in range(0, rawLen, 6):
rawtime = raw[i:i+2]
if rawtime in seenTimes:
continue
times += rawtime
vals += raw[i+2:i+6]
timesList = struct.unpack('h'*(len(times)/2), times)
valsList = struct.unpack('f'*(len(vals)/4), vals)
assert len(timesList) == len(valsList), "Lens of times and vals unequal, t=%s, v=%s" % (len(timesList), len(valsList))
for i in range(0, len(timesList)):
tsVals.append((timeOffset+timesList[i], valsList[i]))
#self.log.debug("unpacked %d vals, len ts %s." % (len(timesList), len(tsVals)))
except:
tb = traceback.format_exc()
self.log.debug("tb in rawDataToTsVals(): rawSize: %s, %s" % (rawLen, tb))
pass
if 0: # debugging
self.log.debug("tsvals: %s" % ( tsVals))
return tsVals
Unpacking all the h's at the same time, and likewise the floats, makes everything align, and since it's one function call to struct, is very fast.
Enjoy!