Starting with Spatial Data – Enriching your dataset

The biggest issue when you start working with spatial data, is asking yourself the question: “What do I want to visualize?”. Do you want to visualize the areas your clients come from? Or do you want to calculate in which areas your new potential clients live? The next hurdle to take, is thinking about how to enrich the data you have now.

The easy way out
Almost every company has a accessible database with client information. The easy way out is to download a free spatial data set (like described in an earlier post), and visualize your data by joining your data set on the downloaded data set, based on country-, province-, or city name. This way, you don’t alter your data, but you’re still able to visualize your data.

Enriching your data
The way to go (in my opinion), is to permanently enrich your data. In that case, you’re more flexible in visualizing your data. Another advantage is that you can always group your data in the same way as mentioned above. If you want to group it by country, province or city, you can still do it. But remember: you can always group your data, but mining your data is difficult without accurate coordinates. And there is a free option for enriching your data, if you don’t want to spend a lot of money and time on it.

Last year, I spend a lot of time working with spatial data. One of the biggest brainteasers was how to enrich our data set, without actually disrupting the normal running processes. You don’t want to change your underlying data set, and then find out you need to rewrite all your software.

No seriously, it’s free…
One of the options I thought of, was converting a postal code, or zip code for my American friends (Seriously USA?! Do you have a different word for EVERYTHING?! :)). The conversion can’t be done by number crunching, but needs to be done by looking it up. One of the options is to look up every postal code by hand. But if you have a few million addresses in your database, that’s not really an option.

One of the options I came across was the free Google Geocoding API. This allows you to translate a postal code into a Latitude and Longitude. And these coordinates can be plotted on a map.

And if you have a Latitude and Longitude, you can use Reverse Geocoding to translate your coordinates into a postal code.

A quick example of this, is a tool I’ve written. It translates a postal code into Lat/Long:

Or translates a Lat/Long into postal code, using the Reverse GeoCoding:

I know this isn’t a pretty tool yet, but I’m still working on it. Until now, it was only a trial to see if the Google API returns the data I need. This tool is also still based on the 2.0 API, which is deprecated since March 8, 2010. It will still work until March 8, 2013, but I need to rebuild it soon.

The next step of course, is to expand the tool to allow the user to translate a whole list of postal codes, all at once. You don’t want to copy-paste the millions of postal codes by hand into the tool, and copy-paste the result back into an Excel workbook or SQL Server table.

But since I’m not a .NET developer, this could take a while. If you’re a developer, and interested in expanding and developing the postal code finder, contact me. Then we can make it a publicly available tool.

Conclusion
Looking at all of this, you’ll see that it’s possible to enrich your data for free, and without a lot of work. Okay, you still need to figure out if your postal code information is correct. Or let the API respond back with the message “G_GEO_UNKNOWN_ADDRESS”, which tells you that the address might be incorrect. Another limit is the number of API calls: 15.000 calls per day. But then again, it’s free, and you can’t have it all…

Advertisements

Geometry vs Geography

In the last few months I often received the question: “What datatype do I need to use? Geometry or Geography?”. The answer I gave them, is the most used answer in the IT world: “It depends…”. Because the answer isn’t as easy as you might think…

 
Storage
The storage of both Geometry and Geography is the same. Even though they’re functionality is different, the system datatypes are the same:

SELECT *
FROM sys.types
WHERE system_type_id = 240

The content you will find in a Geometry or Geography column is binary. But by using a number of different methods, you can easily convert this to readable text.

 
Performance
Looking at performance, it could be an easy choice: Geometry will win this. The basic polygons aren’t that different. The extended methods on the other hand, do make a big difference! This is a whole different world, so I’ll write an article about this in the upcoming weeks.

Bob Beauchemin (Blog | @bobbeauch) also gave a session about this, and he had some great examples to illustrate this. You might want to watch his Spatial Data sessions!

 
Functionality
If you’re looking for the biggest difference between the two datatypes, you’ll see it’s the functionality. A Geometry object is just a 2D, flat polygon. This means, it doesn’t matter if you take a country on top of the earth (for example Canada, which is “curved”), or a country close to the Equator (for example Brazil, which is “flat”).

A Geography object on the other hand, is a 3D (or even 4D) polygon, that has the same curve as the shape of the earth. That means that the difference between 2 point isn’t calculated in a straight line, but it’s need to consider the curvature of the earth.

A great example of this concept, is shown by Morten Nielsen (Blog | @dotMorten) on his blog. He made a tool (Shape2Sql, downloadable from his his blog), that can be used to import shape files into SQL Server.

He explains the difference between Geometry and Geography with 2 pictures, where he draws a straight line between Europe and the USA:


Geometry


Geography

As you can see, the straight line is actually a straight line if you use Geometry (planar system). But if you use Geography (spherical system), the line isn’t straight but curved. This could become an issue if you want to know the distance between Berlin and Los Angeles, and you use Geometry data to calculate this. If you use that to calculate the amount of fuel for your plane, you might end up swimming the last part of your journey!

Popping The big question
The big question you need to ask yourself is: “want do I want to do with this?”. If you want to use it just for visualization purposed, you might want to stick with Geometry data. It’s accurate, fast, and doesn’t require difficult calculations. But if you need to calculate distances across the globe, or data that represents the earth, then Geography is the way to go.

Advanced spatial data querying

In this post (part 2 of this series), I’ll try to explain the power of spatial data with real-life polygons.

 
Data
First of all, we need some data. There are several sources of spatial data, and I’ve described importing this data in a previous post. For this post, I’ll be using the Netherlands as an example dataset. You can also use the queries below on your own set of spatial data.

 
Layers
If you start working with spatial data, there are several things to keep in mind. One of these things is that spatial data consists of layers. Every set of objects you select (or draw yourself) is turned into a layer. Layers are stackable and can be joined together (with a UNION operator).

 
First selection
After importing the shape file, we start with a simple selection. In this case the city I love the most in the Netherlands: Rotterdam. This beautiful port-city in the province of South-Holland has a very recognizable shape, and is excellent for the upcoming examples.

Selecting the polygon of Rotterdam can be done with this query:

SELECT GEOG
FROM NLD_adm2
WHERE NAME_2 = 'Rotterdam';

 
Buffer
Every spatial object (Geography and Geometry) has extended methods you can use. These methods vary from very easy (draw a buffer around my object), to calculate the closest distance to another object.

Starting with a very easy one: adding a buffer to a spatial object. You can do this by using the STBuffer() method. The distance you can add as an argument, is based on the reference system of the shape file you imported. In this case we’re working with the EPSG system (SRID 4326), and the unit it used is meter:

SELECT *
FROM sys.spatial_reference_systems
WHERE spatial_reference_id = 4326

So, if you like to draw a buffer of 1 kilometer (1000 meters, or 0.62 miles/ 1091.2 yards for my American friends) around Rotterdam, we use the following query:

SELECT GEOG
FROM NLD_adm2
WHERE NAME_2 = 'Rotterdam'

UNION ALL

SELECT GEOG.STBuffer(1000)
FROM NLD_adm2
WHERE NAME_2 = 'Rotterdam';

As you can see, I selected the city of Rotterdam as a separate layer, with the “buffered layer” on top of it. I did this for visual purposes, so you can actually see what is happening to your object.

 
Get surrounding cities
One of the advantages of spatial data, is that polygons are actually connected to each other. So looking for cities around Rotterdam is pretty easy. What I did is draw a buffer of 1 meter around my city, and ask what intersects my buffer. This query actually works without the buffer, but again, I added it for visual reasons.

DECLARE @City geography;

SET @City = (SELECT GEOG FROM NLD_adm2 WHERE NAME_2 = 'Rotterdam');

SELECT GEOG
FROM NLD_adm2
WHERE GEOG.STIntersects(@City.STBuffer(1)) = 1;

 
Aggregating polygons
Sometimes you want to create your own polygon by aggregating some existing polygons. This can be accomplished by using the method UnionAggregate(). This method aggregates all polygons you select in your query. In this case, all surrounding cities and villages of Rotterdam.

DECLARE @CityBuffer geography;

SET @CityBuffer = (SELECT GEOG FROM NLD_adm2 WHERE NAME_2 = 'Rotterdam');

SELECT Geography::UnionAggregate(GEOG)
FROM NLD_adm2
WHERE GEOG.STIntersects(@CityBuffer.STBuffer(1)) = 1;

 
Conclusion
As you can see in the post above, Spatial Data is a very powerful part of SQL Server. My guess is that it will only become more and more important for businesses.

I’m still looking for a subject for part 3 of this series. I’m thinking about storage, indexing or performance ofw spatial data, but I’d like to hear it from you. What do you want to see in part 3?

Querying spatial data – the basics

A few months ago I started working with spatial data in SQL Server. From that moment on, and I’ve said this before, I’m hooked!

And since my daily work includes more and more work regarding spatial data, I’m trying to write some posts about querying spatial data.

 
Yet another source of information???
If you want to start with spatial data, and you try to find information about this subject, you’ll probably encounter the same problem I had: there’s too much information. Almost all examples are based on real-life polygons and spatial data, and in my opinion way over-engineered to start with or to answer basic questions.

That’s why I’ll try to cover the basics in this post, and maybe create a series about spatial data, without trying to reinvent the wheel.

 
What is spatial data?

Spatial data represents information about the physical location and shape of geometric objects

If we analyse this quote taken from TechNet, it tells us that every object can be visualized with spatial data? As a matter of fact, yes you can!

Every object around you, ranging from a tree to a city or country region, can be represented by one of 3 basic types. The image below (from MSDN) shows these 3 basic types:

And from these 3 types (or collections of types) you can create every spatial object you want.

 
Geometry vs Geography
In SQL Server you have the option to use 2 spatial types. The main difference between these 2, is that Geometry only stores 2D objects, and Geography can store 3D (and even 4D) objects. Also, Geometry and Geography don’t support the same exact methods. For example, Geography can’t calculate a center-point of a shape.

Another big difference is that Geometry calculates straight lines, and Geography actually compensates for the curvature of the earth.

In order to keep things simple, I created the examples in this post all in Geometry data.

Drawing your first object
To start off basic, what is easier then drawing a simple square? A square consists of 4 coordinates, and is one of the most basic forms you can draw. An example of a square looks like this:

DECLARE @Square geometry;
SET @Square = geometry::STGeomFromText('LINESTRING (0 0, 0 100, 100 100, 100 0, 0 0)', 4326);
SELECT @Square

But what if you want a solid square, instead of an outline? In that case, you need to change the type you’re drawing into a polygon. Where the 4 lines in the example above just draw the outline of the object, a polygon (like the example below) will also contain everything within the lines you draw:

DECLARE @SquareFilled geometry;
SET @SquareFilled = geometry::STGeomFromText('POLYGON((0 0, 0 100, 100 100, 100 0, 0 0))', 4326);
SELECT @SquareFilled

 
Layers
Okay, let’s take this one step further. You can also draw multiple objects in one context. These objects can be drawn next to each other, or on top of each other. Every object you draw will be drawn in a “separate layer”. Objects that don’t overlap are just 2 shapes (polygons). But if you draw 2 shapes on top of each other, it’s a whole different story. Both objects can actually aggregate into 1 big shape, or exclude each other. First, an example with 2 separate shapes:

DECLARE @Square geometry,
		@Triangle geometry;

SET @Square = geometry::STGeomFromText('POLYGON((0 0, 0 100, 100 100, 100 0, 0 0))', 4326);
SET @Triangle = geometry::STGeomFromText('POLYGON((50 50,100 150,150 50, 50 50))', 4326);

SELECT @Square
UNION ALL
SELECT @Triangle

If you run the query above, you’ll see 2 objects appear: a square and a triangle. Both object overlap at a certain point, but they’re still 2 independent shapes.

 
Layer aggregation
Until now it’s just child’s play. Now we’re getting to the exiting stuff! How about combining the 2 previous objects into one big shape?

DECLARE @Square geometry,
		@Triangle geometry;

SET @Square = geometry::STGeomFromText('POLYGON((0 0, 0 100, 100 100, 100 0, 0 0))', 4326);
SET @Triangle = geometry::STGeomFromText('POLYGON((50 50,100 150,150 50, 50 50))', 4326);

SELECT @Square.STUnion(@Triangle)

Now you’ll see that both objects merged into one single object. This is a result of “joining” 2 objects or layers. By using the extended method STUnion on one of your shapes, you can add another shape to it. So in the case, the triangle is added to the square.

 
Layer intersection
But what if you want to know the part of the polygon that intersects? So which part of object 1 overlaps object 2? You can do this by using the STIntersection method:

DECLARE @Square geometry,
		@Triangle geometry;

SET @Square = geometry::STGeomFromText('POLYGON((0 0, 0 100, 100 100, 100 0, 0 0))', 4326);
SET @Triangle = geometry::STGeomFromText('POLYGON((50 50,100 150,150 50, 50 50))', 4326);

SELECT @Square.STIntersection(@Triangle)

Or maybe you want to know which part doesn’t overlap. Then you can query the difference of both objects:

DECLARE @Square geometry,
		@Triangle geometry;

SET @Square = geometry::STGeomFromText('POLYGON((0 0, 0 100, 100 100, 100 0, 0 0))', 4326);
SET @Triangle = geometry::STGeomFromText('POLYGON((50 50,100 150,150 50, 50 50))', 4326);

SELECT @Square.STSymDifference(@Triangle)

 
Center
As you see, there are many really cool things you can do with spatial data. One other I want to show you is how to determine the center of your object. Instead of calculating it yourself, you can use a method called STCentroid:

DECLARE @Square geometry;

SET @Square = geometry::STGeomFromText('POLYGON((0 0, 0 100, 100 100, 100 0, 0 0))', 4326);

SELECT @Square
UNION ALL
SELECT @Square.STCentroid().STBuffer(10)

Just to keep it visual, I’ve added a buffer to the center point. What STBuffer does, is adding a radial to the selected object. So in this case, it created a radial around the center point.

If you didn’t draw that extra radial, it would literally just be a pixel on your screen. So by adding a buffer around the center, it’s still visible. But it’s only for visual purposes, and isn’t required to make this query work.

 
Other possibilities
Another thing I wanted to show, is a really awesome polygon made by one of my colleagues: Johannes Tedjaatmadja (@JTedjaatmadja). You have to see it for yourself, because posting it would spoil the surprise. You can download it from here. And I must say, this’ll be one to make Mladen Prajdic (Blog | @MladenPrajdic) proud! 😉

Strange behavior of spatial data

As of today, I’m kind of forced to admit I have a problem… I’m in love with spatial data. And once you’re hooked, there’s no turning back. It could be worse of course! And in these circumstances, you come across the most interesting cases…

After trying to draw a geometry polygon in SQL Server 2008, I wondered what the difference is between a polygon and a multipolygon. Polygon also accepts more then 1 polygon just like a multipolygon. So what’s the difference then? I think I found the difference, with the help of Andreas Wolter (Blog | @AndreasWolter).

Just look at the query below:

DECLARE @Obj GEOMETRY
SET @Obj = GEOMETRY::STGeomFromText('POLYGON((10 0, 10 10, 0 10, 0 0, 10 0),
											 (10 15, 10 25, 0 25, 0 15, 10 15, 10 15))'
									,4326)

SELECT @Obj.ToString(), @Obj

This query produces a valid set of polygons, even though it’s two objects in a single polygon. It draws the polygons, even though a polygon should only consist of 1 object, and not two as the example above.

This isn’t the weirdest I’ve seen. Just look at the example below:

DECLARE @Obj GEOMETRY
SET @Obj = GEOMETRY::STGeomFromText('POLYGON((10 0, 10 10, 0 10, 0 0, 10 0),
											 (10 15, 10 25, 0 25, 0 15, 10 15, 10 15))'
									,4326)

--==================================================

SELECT
	'Geo Object'					AS Description,
	@Obj							AS GeoObject,
	@Obj.ToString()					AS GeoObject_ToString,
	@Obj.STIsValid()				AS GeoObject_IsValid

UNION ALL

SELECT
	'Geo Object + MakeValid()'		AS Description,
	@Obj.MakeValid()				AS GeoObject,
	@Obj.MakeValid().ToString()		AS GeoObject_ToString,
	@Obj.MakeValid().STIsValid()	AS GeoObject_IsValid

If you run the example, you’ll see a description of the objects, the spatial object itself, the object ToString(), and a bit (boolean) for STValid(). The first record in the resultset is just the same as in the previous example. The second row contains the method .MakeValid().

As you see the first record (2 objects in 1 polygon) is not a valid geometry object. The second records shows that .MakeValid() converts your polygon into a multipolygon. And if you check if the multipolygon is valid, it returns true.

This example was getting weird, but now run the example on a SQL 2012 instance. You will see a difference in the multipolygon coordinates. I’ve ran them on both versions, and you can see the differences in the screenshot below:

SQL Server 2008:

SQL Server 2012:

The conversion to multipolygon causes a different rounding of coordinates. It seems like the converison in SQL 2008 is changed quite a bit for SQL 2012.

So what does this mean for the usage of geometry and geography data? I don’t know… This might mean we always need to use .MakeValid() before storing a polygon. Part of the issue is, that spatial data is pretty new stuff. This means that there are no best practices yet. So time will tell what the best practices will become…

Spatial Index – No catalog entry found for partition

Last week, I came across a problem with spatial indexes. Apparently there’s still a bug in SQL Server 2012 (the SQL Server team thought they’d fixed it in 2012), when building an index and using the DMVs after that.

I tried to build an index on a world map. This map contained all the country, province/state and city information of the whole world. We needed the data to feed a new company portal for our customers. To speed up the queries we ran on the data, I tried to add a spatial index. To see which index-setting worked best for us, I used the sp_help_spatial_geography_index DMV (Dynamic Management View) to look at performance statistics. This worked fine, untill I tried a number of different index setups. Then I got this error:

 

Msg 608, Level 16, State 1, Line 1
No catalog entry found for partition ID 72057594041073664 in database 9. The metadata is inconsistent. Run DBCC CHECKDB to check for a metadata corruption.

 

The first thing I did was search for the error number and specific issue. Apparently there were few problems documented. Eventually I ended up with a Microsoft Connect issue, started by my colleague and friend Bob Beauchemin (Blog | @bobbeauch).

After contacting and consulting Bob, I tried to install SQL Server 2012 CU1 (Consecutive Update) and CU2, and after both updates the error still occurred.

The conclusion is that the SQL Server team is trying to fix this in the next big update. Looking at the status of the Connect item, I personally don’t see this happening in a hotfix.

In the Connect issue I also posted a workaround. Because the building of the index is working, and only the DMV isn’t doing what it’s supposed to do, you can just restart the SQL Service and the DMV will work again. This isn’t a fix for production servers, but I assume you will only create an index once, and then the DMV will work!

 

UPDATE

With the release of SQL Server 2012 SP1 (announced at the PASS Summit 2012), this bug has been solved. You can download SP1 here. To see what’s new in SP1, read this MSDN article.

Importing and using GIS Spatial Data

For a project I’m working on right now, we (the team I’m currently a part of) decided to research the use of GIS data. The GIS data is available for free at DIVA-GIS. If you want to download the GIS data, choose the country and in the Subject drop-down, choose “Administrative areas”.

To import the data, I’ve used an easy to use .NET Tool: Shape2SQL. This tool is created by Morten Nielsen (Blog | @dotMorten), and allows the user to import Shapefiles (.SHP) into SQL Server without problems. If you want to download this tool-set, I advise you to download the “SqlSpatialTools”, which also contains “SQLSpatial.exe”, which allows you to query and visualize the data.

After downloading the GIS data and tools, run the “Sharp2Sql.exe”. You will see the following screen pop up (at first run only):

Fill in the server and database information. In my case, I imported the data on a local SQL Server:

After that, you will see a start screen like this:

Now you need to select a SHP file. If you press the button, the following window shows:

As you can see, I picked the GIS data of The Netherlands as an example. After selecting a source file, you need to chance the settings of the import:

I also changed the “Geometry Name” on the right from “geom” to “geog”, just to remind myself that the content of the column is Geography- and not Geometry-data.

Once you decided about the options and naming conventions, press “Upload to Database”, and wait for the file to be processed:

Once the processing is completed, you can start using the GIS data. You can do this straight from SQL Server, but you could also use the “SqlSpatial.exe” that you downloaded as part of the “SqlSpatialTools”. If you choose to use this tool, it would look something like this:

You can run the same query in SQL Server Management Studio (SSMS), and you will get an extra tab in the resultset:

There’s only one more thing to remember: In SQL Server 2008 and 2008 R2 you can only select 1 hemisphere at a time. SQL Server 2012 has a new version of the Geography assembly, and supports querying multiple hemispheres at the same time. For more information about this, read the MSDN article about Spatial Data Types.