Address Search OS OpenNames with PostGIS, SQLAlchemy and Python – PART 2

Part 1 of this post outlined how to configure a PostGIS database to allow us to run Full Text searches against the OS OpenNames dataset.

In Part 2 we look at writing a simple Python 3 CLI app that will show you how easy it is to integrate this powerful functionality into your apps and APIs.  Other than Python the only dependency we need is the  SQLAlchemy ORM to let our app communicate with Postgres.

address-search

Installing SQLAlchemy

SQLAlchemy can be installed using pip.  It is dependent on psycopg2, which you may struggle to install on Mac without Postgres present, which is frustrating (however solutions can be found on Stack Overflow)

A simple address search CLI


import argparse
from sqlalchemy import create_engine, Column, Integer, String
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.dialects.postgresql import TSVECTOR
from sqlalchemy.orm import sessionmaker
# Create DB Session
engine = create_engine('postgresql://iain:password@localhost:5432/Real-World')
Session = sessionmaker(bind=engine)
session = Session()
Base = declarative_base()
class OpenNames(Base):
__tablename__ = 'open_names'
# Map DB columns we're interested in
ogc_fid = Column(Integer, primary_key=True)
text = Column(String)
textsearchable = Column(TSVECTOR)
def search_address(self, search_for: str):
print(search_for)
or_search = search_for.replace(' ', ' | ') # Append OR operator to every word searched
results = session.query(OpenNames.text).filter(OpenNames.textsearchable.match(or_search, postgresql_reconfig='english'))
for result in results:
print(result.text)
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('address', help='Address you want to search for')
args = parser.parse_args()
open_names = OpenNames()
open_names.search_address(args.address)

Let me draw your attention to…

Hopefully this script is fairly easy to follow, but there are a couple of lines to draw your attention to

  • Line 4 – Note we have to tell SQLAlchemy we’re using the Postgres dialect so it understands TSVECTOR
  • Lines 8 – 12 is simply SQLAlchemy boiler plate that sets up our connection and session for the app.  You’ll need to swap out the connection details for your own
  • Lines 17-20 I’ve chosen to map only 3 columns, you’ll probably want to map more.
  • Line 25 – is very important, here we append the OR operator to every word the user has supplied, meaning we’re returning addresses.  You could extend this to allow the user to specify on exact match operator and change this to an & search.
  •  Line 26 – Finally note we ask SQLAlchemy to match our search, and importantly we must supply the postgresql_reconfig param to say we’re searching in English.  This is vital or you wont get the matches you expect.

Running our app

We can run our app from the command line simply by entering the following command

python address_search.py 'forth street'

And we see our app print out all matching addresses that contain either Forth or Street 🙂

Ends

Hopefully you can see how easy it would be take the above code and integrate it into your apps and APIs.  I hope you’ve found these tutorials useful.  Happy text searching.

Advertisement

Address Search OS OpenNames with PostGIS, SQLAlchemy and Python – PART 1

In this two part post we’ll look at implementing an address search using the Ordnance Survey Open Names dataset.  We’ll use the power of Postgres with the PostGIS extension leveraging it’s built in Full Text Search, and use Python and the SQLAlchemy ORM to create a simple CLI.

address-search

Part 1 – Data Load and DB Config

Address Data

The UK is very badly served for free address data.  The best we have is the Ordnance Survey OpenNames dataset.  It will work as a Postcode lookup or a street finder (at a push), but the dataset would require a lot of additional processing to be a useful address search.  OS really want you to purchase AddressBase

That said, OpenNames will suffice for this example and it should be easy to extend the example to a fuller dataset if you’re lucky enough to have one.

Loading Data to PostGIS

You can download OpenNames as either CSV, or GML.  I’d recommend GML as it’s simpler to load it into PostGIS using OGR2OGR.

Once you unzip the archive you’ll see that the files are referenced according to the British National Grid, so you can load as much or as little as you want.

We’ll load NS68 which contains addresses in my home town of Stirling, as follows (swap out the values for your db):

ogr2ogr -f PostgreSQL PG:"host=localhost dbname=Real-World port=5432 user=iain password=password" NS68.gml -progress -nln open_names --config PG_USE_COPY YES 

You should now have a new table called open_names containing the addressing info.

Note if you want to load more gml files just use the -append flag:

ogr2ogr -f PostgreSQL PG:"host=localhost dbname=Real-World port=5432 user=iain password=password" NS88.gml -append -progress -nln open_names --config PG_USE_COPY YES 

Setting up Full Text Search

We now have our open_names table, but no text search column.  So we can add a textsearchable column which must be of type TSVECTOR as follows:

ALTER TABLE open_names ADD COLUMN textsearchable TSVECTOR;

We can populate the column by using the built in function TO_TSVECTOR, this tokenises the words based on the supplied config, in our case english.  However, multiple configs are supported.

UPDATE open_names SET textsearchable = TO_TSVECTOR('english', text || ' ' || localid);

If you look at the data in your new column you’ll see that it now contains text tokens representing the address data.

Increase accuracy by concatenating multiple columns

Note that we’re concatenating 2 columns together in this update statement – text and localid.  In our case the reason for doing this is that the postcode in the localid column is stored without a space, meaning our search will return a result if the user enters a postcode without a space.

However, it should be clear if we had better address data, we could concat multiple columns.  Meaning if a user searched for “1 Main St, Stirling, FK3 4GG” we would be able to return an accurate match.

Add an Index for faster searching

Now that we have data set up we can add an index to our new column which will ensure searches are fast:

CREATE INDEX textsearch_idx ON open_names USING GIN (textsearchable);

Let’s do some searches

Now lets query our new column to see if we can find some matches using the TO_TSQUERY function

SELECT COUNT(1) FROM open_names WHERE textsearchable @@ TO_TSQUERY('english', 'avenue')

Here we find we have 41 streets in Stirling area containing the word avenue.  You’ll note that I don’t need to worry about lowercase, uppercase or where the word might appear in the string.  Full text search takes care of that for me 🙂

The @@ operator basically means that the query matches the tsvector column.

Using AND and OR for better matches

A very powerful feature of Postgres’ Full Text Search is the ability to find matches contain all or some of the words in the query using the AND & operator or the OR | operator, as these examples show:

select * from open_names where textsearchable @@ to_tsquery('english', 'forth & view');

Here we only return one result Forth View which contains both Forth and View, if we change this to an OR search:

select * from open_names where textsearchable @@ to_tsquery('english', 'forth | view')

We get 7 results including Forth View, Bruce View, Forth Place.

Again it should be easy to see how powerful text searches could be built for complex text documents.

A final note on Triggers

While our address data is fairly static, if you had a table where users were regularly editing address data, or any other columns you wanted to run a full text search on, you should consider adding a trigger to keep the TSVECTOR column up to date, as outlined here.

So for our example the trigger would look like:

CREATE TRIGGER tsvectorupdate BEFORE INSERT OR UPDATE
ON open_names FOR EACH ROW EXECUTE PROCEDURE
tsvector_update_trigger(textsearchable, 'pg_catalog.english', localid, text);

Up Next

Hopefully Part 1 has demonstrated how it is very easy to set up powerful text searching in Postgres.  In Part 2 we’ll look at how we can use Python and SQLAlchemy to allow you to integrate this functionality into your apps and APIs.

 

HOWTO Install Python3, pip3 & Tornado on Mac

I recently needed to install Python3 on my Mac.  While the bearded Linux masses just seem to know this stuff, or it’s already part of their distro, in Mac-Land by default we’re stuck on Python 2.7.2 and guidance is lacking.

So to save people doing the digging I had to do, here’s a quick HOWTO on installing Python3 on your Mac.  I like understanding who things work, so this post also details where things are installed.

Install Latest Python 3

Download the lastest Python3 installer (v3.3.0 at time of writing) take care you get the version appropriate for your OSX version:

Download Python3 here

Follow the on-screen instructions, Python3 should be successfully installed into /Library/Frameworks/Python.framework/Versions/3.3 the installer will also create symlinks for python3 in /usr/local/bin – which makes python3 available from the command line.

Check it works

From the command line type:


python3

You should see the following:


Python 3.3.0 (v3.3.0:bd8afb90ebf2, Sep 29 2012, 01:25:11)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>

Install pip3

Like all good languages Python has a package manager.  Python’s is called pip.  Pip can be used to install packages into the Python framework so they can be used in your programs.

Something that is not at all obvious to the uninitiated is that to use pip  with Python3, you need to compile run the various install scripts against Python3, otherwise everything just installs in the Python 2.7 directory (this is the voice of bitter experience speaking).

To work with pip3 we need to first install distribute_setup.py.  As far I understand it, distribute_setup.py parses the setup.py script in the python package and ensure everything is compatible with Python3 (correct me if I’m wrong Python community 🙂

Download and run the script as follows, you’ll need to use sudo on mac, as the script will need privs to write into /Library dir


curl -O http://python-distribute.org/distribute_setup.py
sudo python3 distribute_setup.py

Now you should be able to install pip, again you’ll need to run the script with python3 (not sure if sudo is definitely required this time)


curl -O https://raw.github.com/pypa/pip/master/contrib/get-pip.py
sudo python3 get-pip.py

pip should successfully installed.  Now you might be thinking you’re home and dry, however, if you type pip on the command line you’ll probably get Command not found, you’re going to have to alter your Path variable to make pip3 available.  So add the following line to your .bash_profile:


export PATH=/Library/Frameworks/Python.framework/Versions/3.3/bin:$PATH

view raw

bash_profile.sh

hosted with ❤ by GitHub

Restart you terminal and check if it’s working by doing, the following, and checking that (Python 3.3) is appended at the end.


pip --version

If you have an older version of pip installed you should still be able to use it by entering pip-2.7

Install Tornado webserver using pip3

So now we can give our new pip a testdrive by installing the Tornado web server.  Again on mac we appear to need to use sudo, otherwise strange errors occur:


sudo pip install tornado

You should see tornado being installed successfully into Python3’s site_packages dir /Library/Frameworks/Python.framework/Versions/3.3/lib/python3.3/site-packages

So now we can use Python3 to run our Tornado hello world app, and see it running in the browser


import tornado.ioloop
import tornado.web
class MainHandler(tornado.web.RequestHandler):
def get(self):
self.write("Hello, Python3")
application = tornado.web.Application([
(r"/", MainHandler),
])
if __name__ == "__main__":
application.listen(8888)
tornado.ioloop.IOLoop.instance().start()

So we can run our hello-world.py using Python3 and we should see it running successfully in the browser at http://localhost:8888

Good luck