Friday, June 16, 2017

Merge A Data Set With A Template File To Generate Output Files

For something I'm working on I need to be able to create a large number of files by filling in fields in a template file with entries from a data set. You'd think that would be easy with Linux but I couldn't find a way to do it. (This will be where people tell me a thousand different ways to do it) I didn't think what I wanted was complicated so I wrote SimpleMerge to take care of it. It is a basic Python script that takes data from a tab delimited data file and fills in data fields in a template file.

The first row of the data file are the field identifiers to find and replace and the other rows are just data. This file can be easily generated from a spreadsheet program. The template file contains the structure of the file you intend to create, just with field identifiers in the place of real data.

I haven't done extensive testing on the program but it seems to work fine.  It handles UTF-8 file encoding and maintains the line endings of the template file for both UNIX and Windows systems. The following command generates the two files File1.txt and File2.txt as seen in the block diagram below. template.txt Data.txt

Block Diagram
Simple Merge Block Diagram
You can use this method on any file really, even SVG files.  Hint hint wink wink.  You can go from this template file.....

Periodic Table Symbols
SVG Template

to this in a matter of minutes. Just by replacing colour and three text fields.

Periodic Table Symbols
Generated Images

I make no guarantee as to how well this works. So my advice is to back things up before using it. Have fun.
Get The Code!

Saturday, June 3, 2017

Generating Stippled Images with Stiptacular

A few weeks ago I posted about how to generate stippled images from regular input images. The code was garbage at the time so I've improved it and posted it for people to use or learn from.  I only just remembered why I started this project. I found the StippleGen program from EvilMadScientist (EMS) but it was written in the processing development environment which I didn't have much luck with. I thought it'd be great to have a Python version out there as well and along the way added my own tweaks.

  • Better initial distribution of seed points via a PseudoHilbert Curve
  • Dithering to make points distribute more evenly
  • A term that sets how much points are attracted to darker areas

Since I'm using trying to replicate the work of EMS It thought I'd use their test image of Grace Kelly. Their interface is beautiful and has a few bells and whistles that mine doesn't because I figured that I could do any pre-processing in something like GIMP.

Grace Kelly
Grace Kelly test image 943x834
For an initial test I'll use 2000 points with 5 rounds of dithering and 5 rounds without dithering.  The simplest way to think of the adjustment parameter is like a contrast setting.  In this case it is three. This will cube the value of each pixel and re-scale all the data to a 0-255 range.

number_of_points = 2000
dot_radius = 2.5
non_dithering_iterations = 5
dithering_iterations = 5
adjustment_parameter = 3

Processing the image with the above settings took about 90 seconds and produces the following SVG file.

Grace Kelly
Large Grace Kelly test image - 2000 points

To demonstrate another quirk I noticed, take a look at the smaller image below.  It contains the same number of dots as the image above.  As a matter of fact, it's the exact same image just scaled down. However the dots start to look more like a face. So scale is important.  If I were to do an 8 foot print of this for a wall in my house it wouldn't look that good because I couldn't get far enough away from it for the image to emerge. I would need to use more points.
Grace Kelly
Small Grace Kelly test image - 2000 points
The image below uses 30000 points and takes about 25 minutes to generate. Writing this in something like C would help a lot..  Calculating the centroids of the Voronoi regions can be done in parallel to speed things up as well.
Grace Kelly
Grace Kelly test image - 30000 points
You may notice that some of the darker regions look a little strange. This is caused by points aligning, as can be seen in the close up below. This can be solved by reducing the number of points, dropping the adjustment parameter so the area isn't so crowded, performing more dithering steps, or just enlarging the source image before processing.

Stipple Error
Alignment artefacts
Get the Code!
You'll also need to install the PseudoHilbert Curve Module for Stiptacular to work.  It's a dependency that I'd like to eventually remove, but for now it's needed.

The posts below are my train of thought while developing this script.  It may help if you get a bit lost.

Voronoi Stippling
Entry and Exit Points for Space Filling Paths on a Grid
Hilbert Curve Generation With Lookup Tables
Converting Binary to Gray Code with XOR
Calculating Hilbert Curve Coordinates
Pseudo Hilbert Curve for Arbitrary Rectangular Regions - Part 1
Pseudo Hilbert Curve for Arbitrary Rectangular Regions - Part 2
Efficient Centroid Calculation for Discrete Areas
Generating Seed Points For Voronoi Stippling
Generating Stippled Images with Stiptacular

Here's a picture of a Atlantis during the STS-132 shuttle launch made of 30000 points. To infinity and beyond!
I wish I had the time to do an even deeper dive on this type of problem.

Saturday, May 20, 2017

Linux Text To Speech with Saved Audio

In my last blog post I described a procedure to find a forgotton PIN for 10 digit mechanical lock boxes where you enter a specific sequence of button presses to efficiently test all the combinations. The generated sequence was supplied in the form of a text file, and although this works, it's a little cumbersome moving your eyes between the buttons and paper all the time. It occurred to me that this would be a lot easier if the numbers were read to me. I then imagined how easy it would be if I had a pair of headphones and the instruction in an audio file on my phone. This seemed like the perfect application for a text to speech application and Linux.

After a little bit of research I decided to use the eSpeak speech synthesizer. It has many options for different voices for different languages and countries and allows quite a bit of customisation of the the way the text is read.

The command below that converts the text in "lockbox.txt" to audio in "lockbox.wav" uses the english voice (-ven), pronounces capital letters in a certain way (-k20), leaves a certain gap between words (-g4), reads back at a certain words per minutes (-s90), and  has a certain pitch (-p29). It's that easy!

espeak -ven -k20 -g4 -s90 -p20 -f lockbox.txt -w lockbox.wav

Before processing the file I made some slight alterations to it by replacing some of the commands in the lock box opening sequence. Originally the commands were zero thought nine, open, and clear. I replaced open with test as it was only one syllable and easier to hear.  It's also important to leave spaces between numbers otherwise it will read 11 as "eleven" instead of "one one".

Here is the instructional WAV file converted to an MP3.  It goes for 30 minutes or so and with a little bit of practice you should be able to follow along at that speed.  If you can't, that's fine, just slow the speed down in your music player.  If you screw up just go back 15 or twenty seconds to catch up.

For the YouTube fans out there, here is another version. It might possibly be the most boring and monotonous video on YouTube. That's my speciality though :-)

To be serious though I'd like to try eSpeak on a Raspberry Pi.  I think it'd be great to read out status updates and events.  Compared to some of the other synthesized voices I've heard it's actually pretty good.

Friday, May 19, 2017

Testing All The PINs On A Lock Box With A Forgotten Code

A long time ago I did a blog post on PIN coded lock boxes that you put on your house to hold a spare set of keys. Their main function is to give easy access for emergency services in case they need keys to get in if an elderly relative has a fall and can't get up. The point of my post was to demonstrate that although they have 10 buttons, look secure, and you can choose how many digits are in the PIN, they only really have 1024 PIN combinations. This is because the numbers can only be used once in each PIN.

PIN coded key lock box
The math is explained in my initial blog post but the results are below. I opined it would take about 4 hours to try all the combinations, and in this case it would require pressing the open button 1024 times, the clear button 1024 times, and pressing digits (0*1 + 1*10 + 2*45 + 3*120 + 4*210 + 5*252 + 6*210 + 7*120 + 8*45 + 9*10 + 10*1) = 5120 times.  In total there are 7168 button presses. That works out at about one button press every 2 seconds.
Number of PINs vs PIN length
At some point in the 5 years since I first wrote about these locks (the time flies doesn't it) it occurred to me that I was being inefficient. Testing if the box opens doesn't clear the current code. So I  can test multiple PINs at once by just chaining them together. It's similar to the De Bruijn sequence, a mathematical tool that can be used to brute force another type of PIN. In this case however you have to reset the lock after a few buttons are pressed. To illustrate the process I'm trying to explain, imagine that the clear button has been pressed and I then enter the numbers 0 through 9, testing if the lock will open in between numbers. I've actually just tested the codes (Null) (0) (01) (012) (0123) (01234) (012345) (0123456) (01234567) (012345678) (0123456789). I've tested 11 codes with one press of the clear button, 11 presses of the open button, and just 10 digit presses. The whole process won't be this efficient but it'll be better than nothing.

But where do we start? In a perfect world, you may see by looking at the chart below and table above, that the best we can do is 1 sequence testing PINs of length 0-10,  followed by 9 sequences testing PINs of length 1-9, 35 sequences testing PINs of length 2-8, 75 sequences testing PINs of length 3-7, 90 sequences testing PINs of length 4-6, and 42 sequences testing PINs of length 5.

This will still result in 1024 presses of the test button, only 252 presses of the clear button, and (1*10)+(9*9)+(35*8)+(75*7)+(90*6)+(42*5) = 1646 presses of the number buttons. For a new total of 2922 button presses, or about 41% of the original estimate.
Distribution of PIN lengths
There's no guarantee that these sequences exist though and I had no deep understanding of how to create them, so I tried the first thing that came to mind and got lucky.

First, generate all possible combinations for each PIN length and then sort the numbers in each PIN. Then sort the PINs for each length comparing them element by element.  Start in the middle with the PINs of length 5 as there are more of these than the others. Then take the PINs of length 4 and go through them one by one from the start and place them to the left of first 5 digit PIN that could follow it. For instance (1, 5, 6, 7) might go to the left of (1, 2, 5, 6, 7) as only a 2 would have to be pressed to get to the 5 digit PIN. Do the same for the 6 digit PINs and place them on the right of the 5 digits. In this case (1, 2, 3, 5, 6, 7) might go to the right of (1, 2, 5, 6, 7) as only a 3 needs to be pressed. Repeat this left right procedure until all PINs are processed. It will look like a mess, but if you sort the sequences by length you will get a spreadsheet that looks like the one below (zoomed and rotated to fit). Look familiar? It's reflects the distribution graph above, and it shows that the sequences can be generated.

Rotated spreadsheet of the sequences
I've placed my code in a Github Gist for you to generate your own sequences in case there are a different number of buttons on your lock. If you have a lock with 10 buttons I've already generated the file for you. It looks a little different from the output of the Python file because I've done some some find and replace operations in Notepad++. I've used the word test instead of unlock as well.

Update1 There is a MP3 file of the instructions in this blog post, Linux Text To Speech With Saved Audio.

Update2 I've sorted the sequences by how often the 4 digit PINs appear in the 2009 Rock You data breach. There are two new files that cover the all the PINs. One of them has the sequences sorted by the 4 digit PIN popularity but still groups the sequences lengths, while the other sorts by PIN popularity only. An Excel file with all this data is also included so you can sort the data however you want. The associated files are in this Google drive folder. It could be argued that Rock You users and the users of lock boxes are a completely different demographic, and it's true. However, their users will exhibit similar behaviours like using birthdays and years that make the effort worthwhile. Although the instructions for the lock suggest selecting a PIN between 4 and 7 digits long, 4 digit PINs have been targeted as they are what people tend to think of when you mention PINs, even though 5 digit PINS are more secure.

So what about those locks you see in banks and other offices that have 14 buttons.  I've done the maths so you don't have to, but they can be broken with 50316 button presses. So 4 extra buttons buys you an increase in security by a factor of about 17. Not that it matters, whenever I've seen these used, people aren't too discreet about entering the PIN.

Door Lock
PIN Door Lock

Friday, May 5, 2017

Generating Seed Points For Voronoi Stippling

I've been working on a way to generate an initial distribution of seed points for a Voronoi stippling program that I played with a while ago.  The problem is that for the final image to look good you really need a lot of points, and when you have a lot of points they are very slow to converge to a solution.  The process I used in the past is to generate a few points and allow them to be processed, split them in two, process them and so on and so on.  Each time the number of points is doubled and the solution is eventually found but it's still not great.  An improvement I thought I could try is to linearize the image via a Pseudo-Hilbert (PH) curve and apply a rolling threshold to the data and place points at locations where the counter rolls over.  You may see a major optimisation that I've know all along but haven't implemented.  I'll give you a clue, it a PH curve really necessary?  Find out at the end of the post.😉

The implementation of the threshold is a little abstract so I'll just explain the general concept. First of all we always work on an inverse grey scale version of the image as we are distributing black dots and the background is white. Imagine that the sum of all the pixels in the image is equal to 1,000,000 and you want 2000 points.  This means that each point represents 500.  So as you travel along the PH curve you add the pixel values and every time you accumulate another 500 place a point.

The points generated from this process are then iterated on via the Voronoi stippling process to produce a result. Another small feature I've added is a dithering step. For a certain number of iterations a random displacement is added to each point.  This can be seen in the GIF below as a jiggling of the points. What is the purpose of this? Imagine a jar of jelly beans that won't quite close. What do you do?  You pick the jar up and shake it to settle the contents.  That's what we're doing here.  What you end up with is a number of points that when randomly perturbed, return to their original position.  i.e. a stable arrangement.

The results in the GIF aren't that great.  The main reason for this is that I've kept the features large so you can see what's happening.  If you really wanted to process this image the best way to do it is to increase the size of the image and then process it.  When the calculated Voronoi regions are small and they only contain a few pixels, due to quantization issues the points won't move as the centroid calculations will always yield the same result.  So if you notice herding of points in the final image try enlarging the image and reprocessing it.
Distribution of Seed Stippling Points on a Pseudo-Hilbert Curve
To test the algorithm I've used an image of a plant from the paper I'm working from.  Weighted Voronoi Stippling, by Adrian Secord.  If this link becomes dead, contact me, I have a copy of the paper.
Test Image Used in Adrian Secord's Paper
In the image below I've started modestly with 4000 points to test that everything is working properly, and as you can see, everything is working.  On closer inspection though, you can see that the dynamic range, for want of a better term, of the image has been reduced.  In other words, the areas that are meant to be dark aren't as dark as they're meant to be, and in comparison, the areas that are meant to be light aren't as light.
4000 points - power 1
The dynamic range problem can be solved by pre-compensating the image.  Specifically raising each pixel to a power or exponent.  In the image below the value of each pixel has been squared before processing.  I've referred to this as a power parameter.  I've re-scaled the resulting data back to a range of 0-255 just for consistency but it's really not required.
4000 points - power 2
In the next image the number of points has been doubled and the power parameter has been pulled back slightly to 1.9 and things are starting to look even better.
8000 points - power 1.9
Let's step things up once again and go for 20000 thousand points to match the example in the paper. I've also pulled back the power parameter to 1.6. Why? It's subjective. I think it looks better.
20000 points - power 1.6

I think that looks pretty spectacular.  Of course I've had to resize the image and scale it down a bit so downloading this page isn't a 100 MB exercise, so there's a little bit of quality degradation, but the original vector files are even better.

The point of this post however is to discuss the PH curve approximation for the seed points and how close it gets us to the final image.  What does the initial distribution of seed points look like before any processing?
20000 points - power 1.6 - no processing
The structure is immediately visible, but there is a fuzziness to the image. This is more apparent when looking at an enlarged section. The points are distributed on a grid so you will see more lines and square corners in the arrangement.

Close Up Section of Seed Point Distribution
So what happens after processing?  The points are allowed to find their own space and curved features start to appear.  I can't explain the exact reasoning but in sections the points form curved lines running in multiple directions.  If you can't see what I'm talking about, go and have a look at something called a Fermat Spiral (also a way to distribute points to evenly cover an area) and you'll see what I mean.  In all things just look more pleasing to the eye.
Close Up Section of Points After Processing
I'm sorry but I'm not releasing the code on this one just yet. Why? Because it's terrible. There are very few comments and a lot of sections where code has been commented out. I know the theory is to develop in public and let the code be scrutinised, but I try to make a program so that it's not only functional but it also explains the concept I'm trying to describe to the user.  At this point I think it would just confuse people.

It's also hard to develop things cleanly when you don't exactly know where they are going to go. Unlike a lot of programming where you have a clearly defined goal and a general way to get there. I start not usually knowing either.  Think of this type of coding as a science experiment.  I'll try different things and see what the results are, change things until I get what I want and then go back and clean up a section and lock it down.

I will release this soon though.  There are couple things to fix but it's mostly there.  If for some reason I don't post the code, let me know and I'll help you out.

Here is the optimization I didn't implement.  You don't need a PH curve.  Just add a white border to the source image to make it a square so that the length of a side is a power of 2 and then use an actual Hilbert curve and crop the result.  They are way more efficient to generate than a PH curve.  I did it this way because I was interested in the problem of covering arbitrary rectangular regions with a curve.  As this blog is a labour of love I reserve the right to go off on the occasional tangent to satisfy my curiosity.

Friday, April 21, 2017

Efficient Centroid Calculation for Discrete Areas

A project I'm working on requires the repeated calculation of weighted centroids for arbitrary regions in an image.  Calculation of centroids is fairly straight forward but computationally expensive.  To calculate the x and y coordinates you first need to calculate three other values, the denominator, x-numerator and y-numerator.  The denominator is simply equal to the sum of all the values in the region, the x-numerator is the sum of all the values in the region multiplied by their x coordinate, and the y-numerator is the sum of all the values in the region multiplied by their y coordinate.  From these values the x coordinate of the centroid is equal to the the x-numerator divided by the denominator likewise for the y coordinate of the centroid.  

You can see from this description that there are a lot of time consuming multiplication operations to perform for each area that you want to calculate the centroid for.  As it turns out though, you can perform two cumulative summations along the rows of the input data as a first step, store this and then perform the centroid calculation by only accessing values at the boundary of the region. For reasons I'll explain later a column of zeros with x coordinate -1 is added to the start of the data.  A cumulative summation is performed along each row to obtain $$$P$$$. A cumulative summation along each row of $$$P$$$ is performed to generate $$$Q$$$ The following equations describe the process.  Before we get started, the derivation of these  equations can be found in this small article I wrote.
Centroid Calculations
It may not be immediately obvious what these mean so I'll give a quick explanation.  First of all each of the above calculations are done row by row and added together. This is what the summation symbol over $$$y$$$ represents.  For the denominator the value to be summed is the difference between two values on a row of $$$P$$$. For the y-numerator the calculation is the same as the denominator but each row is multiplied by the y coordinate.  The calculation of the x-numerator is a little different. It's similar in that values of $$$P$$$ at the boundary are subtracted and multiplied by $$$x+1$$$ but the additional step of subtracting the difference of the values of $$$Q$$$ at the boundaries is now added.  Maybe an example will help.

Example Calculation

In the first data set on the top left an area is marked green.  The centroid of this region is to be found. A column of zeros is added to the start of the data.  This is seen in the data set on the top right.  These are added because when performing the operation in software you end up accessing array outside of their bounds.

Typically when calculating centroids each green cell would have to be accessed for each calculation. However when using the new method described above only the cells in blue are accessed.  You may also notice that the width of these doesn't change the number of operations.  The complexity of the calculation is only dependant on the height of the region.  I've included this spreadsheet for you to play around with and get a better understanding of the process.

One last this to note is that the addition of the column of zeros while maintaining the original $$$x$$$ indices is silly as it creates a column with an index of negative one.  This is where the adjusted x index comes in.  Using this and the following adjusted equations allow the calculation to be performed easily on computers.

Let's for example calculate the centroid of the green section on only row 5.  Its $$$x$$$ bounds are $$$x_1=4$$$ and $$$x_2=7$$$, but in the above equations $$$P$$$ and $$$Q$$$ are accessed via the adjusted $$$x$$$ index at 4 and 8 (8 because of +1 on the upper index).  This means the denominator is equal to $$$(356-170) = 186$$$, the y-numerator is equal to $$$5(356-170)=930$$$, and the x-numerator is equal to $$$(8 \times 356-4 \times 170)-(1560-420)=1028$$$.  This leads to centroid coordinates of (5.526, 5). This is what you would expect as it's a single row the y value is obviously 5 and the x value is a little to the right of centre due to the increasing trend in the numbers. The centroid coordinates calculated are given in the original x index to allow accessing the original data.