Python: Help optimize this algorithm

Childs · Aug 25, 2014

As a side project I was trying to understand closed caption encoding, and wrote a simple function to encode strings into SCC compliant data. It works, but I've rewritten it twice trying to reduce it and it comes out about the same every time. Granted, I suck at developing algorithms, so no need to state it. Any suggestions is appreciated.

Requirements:

# each line must be less than 28 chars (32 - 4)
# words should not be split on two lines. If a word exceeds the number of chars, start a new line
# must be grouped in two byte hex words, separated by a space
# if only one char left in two byte hex word, use '0x80' to complete the word

So, "This is sample closed caption text." would be converted to

5468 6973 2069 7320 7361 6d70 6c65 2063 6c6f 7365 6480 13f2 13f2 6361 7074 696f 6e20 7465 7874 2e80

13f2 13f2 is the code for a new line. Here's the code:

Code:

import re, binascii

class ClosedCaption():
    
    def __init__(self):
        self.next_line = "13f2 13f2 " # Position cursor at row 13, column 04, with plain white text.
        
    def strToSCC(self, s):

        # each line must be less than 28 chars (32 - 4)
        # words should not be split on two lines
        # must be grouped in two byte hex words, separated by a space
        # if only one char left in two byte hex word, use '0x80' to complete the word
        new_str = "" # final converted string
        two_byte_count = 0 #
        print "*** two_byte_count = %d" % two_byte_count
        chars = 1
        word_index = 0
        for word in s.split():
            print 20 * "#"
            print "(+w) new word \"%s\"" % word
            word_str = ""
            
            if word_index == len(s.split()) -1:
                last_word = True
            else:
                last_word = False
            if not last_word: # increase char count to account for addition of space to each word
                word_length = len(word) + 1
                chars += 1
            else:
                word_length = len(word)
            if word_length > 27 - chars:
                print "(+l) new line"
                word_str += self.next_line
                chars = 1
                two_byte_count = 0
                print "*** two_byte_count = %d" % two_byte_count
            else:
                print "(l) word chars = %d, chars remaining = %d" % (word_length, 28 - chars)
            char_index = 0
            for c in word:
                if char_index == word_length:
                    last_char_in_word = True
                else:
                    last_char_in_word = False
                word_str += binascii.hexlify(c) # add character
                two_byte_count += 1
                if two_byte_count > 1:
                    print "(w)(c) add space \" \""
                    word_str += " "
                    two_byte_count = 0
                    print "*** two_byte_count = %d" % two_byte_count
                chars += 1
                char_index += 1
            if not last_word:
                print "(w)(e) add \"0x20\""
                word_str += binascii.hexlify(" ")
                two_byte_count += 1
            if two_byte_count > 1 and not last_word:
                    print "(w) add space \" \""
                    word_str += " "
                    two_byte_count = 0
                    print "*** two_byte_count = %d" % two_byte_count
            new_str += word_str
            print "*** increase word_index to %d" % s.split().index(word)
            print "(w) word = %s" % word_str
            word_index += 1
        if last_word: # end of word add 0x20, unless its the last word in the list
            if two_byte_count == 1:
                print "(w) add (0x80)"# add filler to unpaired character
                new_str += "80"
                two_byte_count = 0
                print "*** two_byte_count = %d" % two_byte_count
            
        print "Contents of new_str = \n%s" % new_str#re.sub("0x", "", new_str)
        for word in new_str.split(): # debug
            try:
                print binascii.unhexlify(word)
            except:
                print "%s skipped" % word
        return new_str

# start

test = ClosedCaption()

sample_str = "This is sample closed caption text."

test.strToSCC(sample_str)

Aluvus · Aug 26, 2014

Childs said:
So, "This is sample closed caption text." would be converted to

5468 6973 6973 7361 6d70 6c65 636c 6f73 6564 13f2 13f2 6361 7074 696f 6e74 6578 742e

13f2 13f2 is the code for a new line.

Your example output appears to have dropped all space characters (0x20).

Also: does the requirement that a line must be less than or equal to 28 characters still apply if that line is the last or only line (i.e. no need to reserve 4 characters for the newline)?

Aluvus · Aug 26, 2014

Here is an example of an alternate approach. Because it's late, and Python's obsession with whitespace is irritating, and because it was too annoying trying to sort out Python 2.x vs. 3.x differences, this example is in Perl.

Code:

#!/usr/bin/perl
use warnings;

sub strToSCC
{
	my $input = shift;
	# divide the input up into chunks of between 1 and 28 characters,
	# followed by either whitespace or end of input
	my @chunks = split /(.{1,28})(?:\s|$)/, $input;

	foreach my $chunk (@chunks)
	{
		# skip chunks that contain only whitespace or are empty
		next unless ($chunk =~ m/\S/);
		# convert the entire chunk to hex
		$chunk = unpack "H*", $chunk;
		# append 0x80 if we have an odd number of characters
		$chunk .= '80' if (length($chunk) % 4);
		# add this hex chunk to our array of hex chunks
		push(@hexchunks, $chunk);
	}
	
	# join chunks together, place 0x13f213f2 between them
	my $hexstring = join('13f213f2', @hexchunks);
	# pretty-print output by inserting spaces
	$hexstring = join(' ', split(/(.{4})/, $hexstring));
	# return output
	return $hexstring;
}

my $hexoutput = strToSCC('This is sample closed caption text.');

# print our hex string
print "$hexoutput\n";
# as a diagnostic, print the hex string converted to ASCII characters
my $temp = $hexoutput;
$temp =~ s/\s//g;
print pack "H*", $temp;
print "\n";

Output:

5468 6973 2069 7320 7361 6d70 6c65 2063 6c6f 7365 6480 13f2 13f2 6361 7074 696f 6e20 7465 7874 2e80
This is sample closedÇ‼≥‼≥caption text.Ç

Much of this will not be directly applicable to Python. But mostly I wanted to demonstrate that there are solutions that don't require you to grovel through character-by-character. Regular expressions can be very powerful.

Note that the output of this script will have an extra leading space at the beginning of the pretty-printed hex, which I am too tired to fix at the moment.

Childs · Aug 26, 2014

Aluvus said:
Your example output appears to have dropped all space characters (0x20).

Oops, I edited the code several times this morning, and forgot to update the sample output. I saw there was no reason to encode a hex string when I would just convert it to a hex int later.

Also: does the requirement that a line must be less than or equal to 28 characters still apply if that line is the last or only line (i.e. no need to reserve 4 characters for the newline)?

The newlines don't count as characters since they are not written to the buffer. Its just 28 actual characters being written to the screen.

Childs · Aug 26, 2014

Aluvus said:
Here is an example of an alternate approach. Because it's late, and Python's obsession with whitespace is irritating, and because it was too annoying trying to sort out Python 2.x vs. 3.x differences, this example is in Perl.

Output:

5468 6973 2069 7320 7361 6d70 6c65 2063 6c6f 7365 6480 13f2 13f2 6361 7074 696f 6e20 7465 7874 2e80
This is sample closedÇ‼≥‼≥caption text.Ç

Much of this will not be directly applicable to Python. But mostly I wanted to demonstrate that there are solutions that don't require you to grovel through character-by-character. Regular expressions can be very powerful.

Note that the output of this script will have an extra leading space at the beginning of the pretty-printed hex, which I am too tired to fix at the moment.

Bold is why I don't do perl! :biggrin: But yeah, I see how I could replicate your code. I don't know why I didn't think to just divide the string from the beginning into lines. Splitting the string into words, which removed the spaces, so I need to add the space back when I eval the word, but also add a space to the string for formatting thats separate from the space I just added that was there in the first place. And then that friction padding character! D: I ran into so many edge cases depending on the string I tested with my brain exploded a couple of times. I was actually working on this during the earthquake and thought I was having a stroke or something.

Anyways, thanks a lot! I'll update the post with the new code tomorrow morning.

Sequences · Aug 26, 2014

Childs said:
Bold is why I don't do perl! :biggrin:

Regex is everywhere, not just in perl. I'd recommend using it more.

Childs · Aug 26, 2014

Sequences said:
Regex is everywhere, not just in perl. I'd recommend using it more.

Yeah, I know. I spend a lot of time trying to work around it. I should just bite the bullet and do a bunch of exercises.

Childs · Aug 26, 2014

Aluvus said:
Much of this will not be directly applicable to Python. But mostly I wanted to demonstrate that there are solutions that don't require you to grovel through character-by-character. Regular expressions can be very powerful.

OK, converted to Python:

Code:

import re, binascii

def strToSCC(str):
    lines = [ item for item in re.split("(.{1,28})(?:\s|$)", str) if not item == "" ]
    hex_list = []
    for item in lines:
        hex_str = binascii.hexlify(item)
        if len(hex_str) % 4:
            hex_str += '80'
        hex_list.append(hex_str)
    new_str = '13f213f2'.join(hex_list)
    return re.sub('  ', ' ', ' '.join(re.split("(.{4})", new_str)))
    
string = 'This is sample closed caption text.'

print "encoded string:\n%s" % strToSCC(string)

print "utf-8 string:\n%s" % re.sub(' ', '', strToSCC(string)).decode('utf-8').decode('hex')

Not quite as good as yours, but good enough. Thanks again! I don't know what the regex pattern means, but I'll figure it out next.

Childs · Aug 26, 2014

So to be sure I have this right:

Code:

"(.{1,28})(?:\s|$)"

basically means match any character from 1-28, then any character to any whitespace char or EOF? What I don't get is what

Code:

(?:\s|$)

is for. The logical group

Code:

(.{1,28})

will just split the string into as many 28 character strings as available, but whats the need for the second logical group? The code seems to work without it.

EDIT: I see there actually is a difference. Without it letters can be chopped off from words during the split. It must be taking the first group as an option to the second group to ensure there is a space at the end of the string of characters.

EDIT2:

Code:

?:

is called a non capturing group. So I guess its looking at the first 28 characters, then within that group finding the last space or EOL char. If it doesn't find it, it moves to the right a character at a time until it finds it, then splits it into one 28 char chunk, with the remainder in the first cell of the array. Then repeat for the rest of the string. Although the results seem strange if it goes too far without getting the delimiter, as the first item in the array can have way more than 28 chars.

Python: Help optimize this algorithm

Childs

Lifer

Aluvus

Platinum Member

Aluvus

Platinum Member

Childs

Lifer

Childs

Lifer

Sequences

Member

Childs

Lifer

Childs

Lifer

Childs

Lifer

TRENDING THREADS