Even though talking about ‘big data’ is the hipster thing to do, let’s talk about ‘large data’ (spatial)
First things first, ‘large data’ is something that I made up for lone sake of discussing spatial data where the individual shapes in every record are by themselves huge 🙂 . Data of this sort pose their own unique problems, even though, these issues are solvable with a little bit of ingenuity. The ‘big data’ problem is much harder to solve and seems to be getting a lot of attention lately.
Some examples of this data is US state boundaries, county boundaries etc. The size of the individual state shapes in the US state boundary dataset that has not been generalized (or simplified) can by themselves be formidable. The size of the state shapes is generally much bigger for the coastal states than for the land locked states. In one of the states boundary datasets I am working with, the size of Alaska’s state shape serialized to JSON by itself was close to 5 MB. Pushing such large data as-is to the client will just slow down the loading and rendering of web maps. And there are also other datasets like eco-regions etc where the individual region shapes are just the union of a multiple state shapes.
As you can guess, the size of the eco-regions shapes are also rather large. Such datasets create problems on multiple fronts. Because of the sheer size of these shapes, they need to generalized before they can be sent to the client (browser) for display purposes. Generalizing such shapes will use up a chunk of your CPU time, so these generalized shapes really to be cached to allow your app to be scalable. Caching such static datasets with such large shapes shouldn’t be too much of an issue but it is something that needs to be additionally done to handle the largeness of the shape 🙂
Another issue posed by the large datasets is with respect to the spatial operations that need to be done using them. Spatial indexes in databases make simple spatial operations like point-in-polygon really fast even against a large and complex polygon shape. But calculating the intersections of two or more large and complex polygons can still be a CPU intensive operation that takes a while. For example, if we have a couple of large datasets like let’s say eco-regions and watersheds, finding all forest stand areas that lie within eco-region ‘A’ and watershed area ‘2’ can still be a time-consuming operation as determining if an area completely lies within other large shapes is more time-consuming than checking if it intersects. And for more complex spatial operations that span across multiple large datasets, the number of CPU cycles needed increases drastically.
Most databases these days like MS SQL Server, Oracle, PostGIS all come packaged with really great spatial data types and spatial operation capabilities that makes the lives of application developers much easier. Saying that these databases have greatly simplified things would actually be an understatement. For transaction processing applications, the databases should be highly available, should be very fast, scale and support multiple concurrent users. But here comes the dilemma, if we perform such taxing spatial operations in the database, the spatial operations will use all the CPU cycles and keep the CPU utilization at 100% for the duration of the spatial operation. This means that other requests coming into the database will either be processed slower since they have to wait for the spatial operation to free up some CPU cycles or just timeout because the database has been unavailable for too long. This is not a good thing to have happen to an online transaction processing database. One other huge culprit here is also the serialization or de-serialization of large geometries into WKT/JSON/WKB to get shapes in and out of the database. Some databases MS SQL Server provide CLR data types that eliminates the need for serialization/deserialization, for other databases you might be out of luck based on the technologies you have to use. So, yes these ST_GeomFromText, ST_GeomFromGeoJSON, ST_AsGeoJSON are amazingly useful, but it is good to be aware of the side effects.
With good caching solutions, a lot of the problems described here can be alleviated, but they are problems nonetheless and don’t make your life easier 🙂 Let me know your thoughts…
If you are planning on using any one of the great services available these days that let you store, retrieve, update and delete (perform CRUD operations) your relational data (geographic or not) using a HTTP API, here is something that you might need to consider. There are multiple services that allow you to do this these days. Some of these services are Google FusionTables, ArcGIS Online, CartoDB, ArcGIS Server 10 and up etc. Even though the backend storage for Google FusionTables is not relational, the service presents a relational facade to user by allowing tables to be merged etc.
But one of the things it takes away from us is the ability to perform multiple CRUD operations as one atomic unit. That is, let us assume that we have a polygon layer in our ArcGIS Server map service which also contains a table which contains attribute data related (one to many) to the polygon layer data. In this case, if one of our requirements is to update attribute data in a single polygon layer table row AND one or more rows with related data in the standalone attribute data which either succeed or fail together as a unit, then this becomes very hard to accomplish (if it is possible to do so). That is, we could end up with cases where the polygon layer table does get updated, but the update operations on the related table fails due to one of many possible reasons. This may allow bad data to be stored in the database with loss of data integrity.
Non-relational storage services are more resistant to this drawback since the data structure design for those databases expect us to handle this during design based on how the app uses the data being stored.
If you have any thoughts or suggestions on this issue, please leave a comment.