Manipulating Text

What will we cover?

Handling text is one of the most common things that programmers do. As a result there are lots of specific tools in most programming languages to make this easier. In this section we will look at some of these and how we might use them in performing typical programming tasks.

Some of the most common tasks that we can do when working with text are:

We will look at how to do each of these tasks using Python and then briefly consider how VBScript and JavaScript handle text processing.

In Python we use string methods to manipulate text strings. You might recall, from the Raw Materials topic, that methods are like functions attached to data. We can access the methods using the same dot notation we use to access functions in a module, but instead of using a module name we use the data itself. Let's see how that works.

Splitting strings

The first task we consider is how to split a string into its constituent parts. This is often necessary when processing files since we tend to read a file line by line, but the data may well be contained within segments of the line. An example of this is our Address Book example, where we might want to access the individual fields of the entries rather than just print the whole entry.

The Python method we use for this is called split() and it is used like this:

>>> aString = "Here is a (short) String"
>>> print( aString.split() )
['Here', 'is', 'a', '(short)', 'String']

Notice we get a list back containing the words within aString with all the spaces removed. The default separator for ''.split() is whitespace (ie. tabs, newlines and spaces). Let's try using it again but with an opening parenthesis as the separator:

>>> print( aString.split('(') )
['Here is a ', 'short) String']

Notice the difference? There are only two elements in the list this time and the opening parenthesis has been removed from the front of 'short)'. That's an important point to note about ''.split(), that it removes the separator characters. Usually that's what we want, but just occasionally we'll wish it hadn't!

There is also a ''.join() method which can take a list (or indeed any other kind of sequence) of strings and join them together. One confusing feature of ''.join() is that it uses the string on which we call the method as the joining characters. You'll see what I mean from this example:

>>> lst = ['here','is','a','list','of','words']
>>> print( '-+-'.join(lst) )
>>> print( ' '.join(lst) )
here is a list of words

It sort of makes sense when you think about it, but it does look wierd when you first see it. (It's also rather confusingly the opposite of JavaScript which has join as a method of an array and the joining string as a parameter!)

Counting words

Let's revisit that word counting program I mentioned in the functions topic. Recall the pseudo Code looked like:

def numwords(aString):
    list = split(aString) # list with each element a word
    return len(list) # return number of elements in list

for line in file:
    total = total + numwords(line) # accumulate totals for each line
print( "File had %d words" % total )

Now we know how to get the lines from the file let's consider the body of the numwords() function. First we want to create a list of words in a line. That's nothing more than applying the default ''.split() method. Referring to the Python documentation we find that the builtin function len() returns the number of elements in a list, which in our case should be the number of words in the string - exactly what we want.

So the final code looks like:

def numwords(aString):
    lst = aString.split() # split() is a method of the string object aString
    return len(lst)       # return number of elements in the list

with open("menu.txt","r") as inp:
   total = 0  # initialize to zero; also creates variable
   for line in inp:
      total += numwords(line)  # accumulate totals for each line
print( "File had %d words" % total )

That's not quite right of course because it counts things like an ampersand character as a word (although maybe you think it should...). Also, it can only be used on a single file (menu.txt). But it's not too hard to convert it to read the filename from the command line ( argv[1]) or via input() as we saw in the Talking to the user section. I'll leave that as an exercise for the reader.

Searching Text

The next common operation we will look at is searching for a sub-string within a longer string. This is again supported by a Python string method, this time called ''.find() It's basic use is quite simple, you provide a search string and if Python finds it within the main string it returns the index of the first character of the substring, if it doesn't find it, it returns -1:

>>> aString = "here is a long string with a substring inside it"
>>> print( aString.find('long') )
>>> print( aString.find('oxen') )
>>> print( aString.find('string') )

The first two examples are straightforward, the first returns the index of the start of 'long' and the second returns -1 because 'oxen' does not occur inside aString. The third example throws up an interesting point, namely that find only locates the first occurrence of the search string, but what do we do if the search string occurs more than once in the original string?

One option is to use the index of the first occurrence to chop the original string into two pieces and search again. We keep doing this until we get a -1 result. Like this:

aString = "Bow wow says the dog, how many ow's are in this string?"
temp = aString[:] # use slice to make a copy
count = 0
index = temp.find('ow')
while index != -1:
    count += 1
    temp = temp[index + 1:]  # use slicing
    index = temp.find('ow')
print( "We found %d occurrences of 'ow' in %s" % (count, aString) )

Here we just counted occurrences, but we could just as well have collected the index results into a list for later processing.

The find() method can speed this process up a little by using a one of its extra optional parameters. That is, a start location within the original string:

aString = "Bow wow says the dog, how many ow's are in this string?"
count = 0
index = aString.find('ow')  # use default start
while index != -1:
    count += 1
    index = aString.find('ow', index+1)  # set new start
print( "We found %d occurrences of 'ow' in %s" % (count, aString) )

This solution removes the need to create a new string each time, which can be a slow process if the string is long. Also, if we know that the substring will definitely only be within the first so many characters (or we aren't interested in later occurrences) we can specify both a start and stop value, like this:

>>>  # limit search to the first 20 chars
>>> aString = "Bow wow says the dog, how many ow's are in the string?"
>>> print( aString.find('the',0,20) ) 

To complete our discussion of searching there are a couple of nice extra methods that Python provides to cater for common search situations, namely ''.startswith() and ''.endswith(). From the names alone you probably can guess what these do. They return True or False depending on whether the original string starts with or ends with the given search string, like this:

>>> print( "Python rocks!".startswith("Perl") )
>>> print( "Python rocks!".startswith('Python') )
>>> print( "Python rocks!".endswith('sucks!') )
>>> print( "Python rocks!".endswith('cks!') )

Notice the boolean result. After all, you already know where to look if the answer is True! Also notice that the search string doesn't need to be a complete word, a substring is fine. You can also provide a start and stop position within the string, just like ''.find() to effectively test for a string at any given location within a string. (This latter feature is not a one that is used much in practice.)

And finally, for a simple test of whether a substring exists anywhere within another string you can use the Python in operator, like this:

>>> if 'foo' in 'foobar': print( 'True' )
>>> if 'baz' in 'foobar': print( 'True' )
>>> if 'bar' in 'foobar': print( 'True' )

That's all I'll say about searching for now, let's look at how to replace text next.

Replacing text

Having found our text we often want to change it to something else. Again the Python string methods provide a solution with the ''.replace() method. It takes two arguments: a search string and a replacement string. The return value is the new string as a result of the replacement.

>>> aString = "Mary had a little lamb, its fleece was dirty!"
>>> print( aString.replace('dirty','white') )
"Mary had a little lamb, its fleece was white!"

One interesting difference between ''.find() and ''.replace is that replace, by default, replaces all occurrences of the search string, not just the first. An optional count argument can limit the number of replacements:

>>> aString = "Bow wow wow said the little dog"
>>> print( aString.replace('ow','ark') )
Bark wark wark said the little dog
>>> print( aString.replace('ow','ark',1) ) # only one
Bark wow wow said the little dog

It is possible to do much more sophisticated search and replace operations using something called a regular expression, but they are much more complex and get a whole topic to themselves in the "Advanced" section of the tutorial.

Changing the case of characters

One final thing to consider is converting case from lower to upper and vice-versa. This isn't such a common operation but Python does provide some helper methods to do it for us:

>>> print( "MIXed Case".lower() )
mixed case
>>> print( "MIXed Case".upper() )
>>> print( "MIXed Case".swapcase() )
mixED cASE
>>> print( "MIXed Case".capitalize() )
Mixed case
>>> print( 'MIXed Case'.title() )
Mixed Case
>>> print( "TEST".isupper() )
>>> print( "TEST".islower() )

Note that ''.capitalize() capitalizes the entire string not each word within it - that's title()'s job!. Also note the two test functions (or predicates) ''.isupper() and ''.islower(). Python provides a whole bunch of these predicate functions for testing strings, other useful tests include: ''.isdigit(), ''.isalpha() and ''.isspace(). The last checks for all kinds of whitespace not just literal space characters!

We will be using many of these string methods as we progress through the tutorial, and in particular the Grammar Counter case study uses several of them.

Text handling in VBScript

Because VBScript descends from BASIC it has a wealth of builtin string handling functions. In fact in the reference documentation I counted at least 20 functions or methods, not counting those that are simply there to handle Unicode characters.

What this means is that we can pretty much do all the things we did in Python using VBScript too. I'll quickly run through the options below:

Splitting text

We start with the Split function:

<script type="text/vbscript">
Dim s
Dim lst
s = "Here is a string of words"
lst = Split(s) ' returns an array
MsgBox lst(1)

As with Python you can add a separator value if the default whitespace separation isn't what you need.

Also as with Python there is a Join function for reversing the process.

Searching for and replacing text

Searching is done with InStr, short for "In String", obviously.

<script type="text/vbscript">
Dim s,n
s = "Here is a long string of text"
n = InStr(s, "long")
MsgBox "long is found at position: " & CStr(n)

The return value is normally the position within the original string that the substring starts. If the substring is not found then zero is returned (this isn't a problem because VBScript starts its indices at 1, so zero is not a valid index). If either string is a Null a Null is returned, which makes testing error conditions slightly more tricky with a combined test required.

As with Python we can specify a sub range of the original string to search, using a start value, like this:

<script type="text/vbscript">
Dim s,n
s = "Here is a long string of text"
n = InStr(6, s, "long") ' start at position 6
If n = 0 or n = Null Then ' check for errors
   MsgBox "long was not found"
   MsgBox "long is found at position: " & CStr(n)
End If

Unlike Python we can also specify whether the search should be case-sensitive or not, the default is case-sensitive.

Replacing text is done with the Replace function. Like this:

<script type="text/vbscript">
Dim s
s = "The quick yellow fox jumped over the log"
MsgBox Replace(s, "yellow", "brown")

We can provide an optional final argument specifying how many occurrences of the search string should be replaced, the default is all of them. We can also specify a start position as for InStr above.

Changing case

Changing case in VBScript is done with UCase and LCase, there is no equivalent of Python's capitalize or title methods.

<script type="text/vbscript">
Dim s
s = "MIXed Case"
MsgBox LCase(s)
MsgBox UCase(s)

And that's all I'm going to cover in this tutorial, if you want to find out more check the VBScript help file for the list of functions.

Text handling in JavaScript

JavaScript is the least well equipped for text handling of our three languages. Even so, the basic operations are catered for to some degree, it is only in the number of "bells & whistles" that JavaScript suffers in comparison to VBScript and Python. JavaScript compensates somewhat for its limitations with strong support for regular expressions (which we cover in a later topic) and these extend the apparently primitive functions quite significantly, but at the expense of some added complexity.

Like Python JavaScript takes an object oriented approach to string manipulation, with all the work being done by methods of the String class.

Splitting Text

Splitting text is done using the split method:

<script type="text/javascript">
var aList, aString = "Here is a short string";
aList = aString.split(" ");

Notice that JavaScript requires the separator character to be provided, there is no default value. The separator is actually a regular expression and so quite sophisticated split operations are possible.

As mentioned above, joining text is done with the join method of an array. So to reverse the split above we would do this:

  aList.join(" ")  // join array items separated by space

Searching Text

Searching for text in JavaScript is done via the search() method:

<script type="text/javascript">
var aString = "Round and Round the ragged rock ran a rascal";
document.write( "ragged is at position: " +"ragged"));

Once again the search string argument is actually a regular expression so the searches can be very sophisticated indeed. Notice, however, that there is no way to restrict the range of the original string that is searched by passing a start position (although this can also be simulated using regular expression tricks).

JavaScript provides another search operation with slightly different behaviour called match(), I don't cover the use of match here.

Replacing Text

To do a replace operation we use the replace() method.

<script type="text/javascript">
var aString = "Humpty Dumpty sat on a cat";

And once again the search string can be a regular expression, you can begin to see the pattern I suspect! The replace operation replaces all instances of the search string and, so far as I can tell, there is no way to restrict that to just one occurence without first splitting the string and then joining it back together.

Changing case

Changing case is performed by two functions: toLowerCase() and toUpperCase()

<script type="text/javascript">
var aString = "This string has Mixed Case";
document.write(aString.toLowerCase()+ "<BR>");
document.write(aString.toUpperCase()+ "<BR>");

There is very little to say about this pair, they do a simple job simply. JavaScript, unlike the other languages we consider provides a wealth of special text functions for processing HTML, this revealing it's roots as a web programming language. We don't consider these here but they are all described in the standard documentation.

That concludes our look at text handling, hopefully it has given you the tools you need to process any text you encounter in your own projects. One final word of advice: always check the documentation for your language when processing text, there are often powerful tools included for this most fundamental of programming tasks.

Things to remember

Previous  Next