Location-based Search for Property Information

This is third in a series of posts on a side project of mine, NJ Parcels. Read the first and the second parts.

In the summer of 2015, we decided to get takeout from a local Greek place. We were apparently not the only folks with the same idea. When I arrived, there was a line outside and all the seats in this small place were occupied. They apologized and told me my order would take about 20 more minutes. So I hopped back in the car and took a short drive around the neighborhoods south of Glassboro.

After passing several houses either for sale or just plain interesting from the curb, I realized that I needed an easy way to search my site based on my phone’s current location. The county roads in South Jersey all have variable names, often changing mile by mile. For example, Woodbury Road becomes “Woodbury-Glassboro Road” somewhere between Glassboro and Woodbury, and then eventually changes to Glassboro Road in Woodbury. The numbering resets at one point as well, which makes searching for a specific house difficult at times.

After dinner, I wrote down my thoughts on what I would need to put a location-based search together. That weekend in June, I implemented it as “Properties Around Me” and I included a cookie to conduct some basic tracking of how the search was used. The cookie has an expiration date of two weeks from the last request and it is only set if the user agrees to share his or her location. It’s a somewhat simple unique identifier, an MD5 hash of several user attributes. The two week expiration will allow me to link users that access the page at least once every two weeks and see their usage patterns. If you use the search less frequently (or use different devices or delete/block cookies) you’ll register as a new user with no history. Continue reading

Posted in NJ Parcels, Privacy, Technology | Tagged , , , | Comments Off on Location-based Search for Property Information

Open Data and Personal Privacy

This is the second in a series of articles regarding a side project of mine, NJParcels.com. Read the first article here. Read part three here.

Once GlassboroMap and NJ Parcels started to get considerable traffic, I began receiving emails asking for the owner information to be removed from the site. While this was public data and, I imagine, has always been available, the easy ability for a search by name to bring up this information caused some individuals concern. Having known about assessment and parcel data for a while, I was initially surprised at the level of concern some individuals had about the site.

Of the emails received, most were courteous.

redactemail01

Others were angry.

Continue reading

Posted in Data, NJ Parcels, OPRA, Privacy, Technology | Tagged , , , , | 3 Comments

Update on a side project

This is the first in a series of posts about a side project I’ve been developing and maintaining for the past few years. This introduction will be accompanied by subsequent posts dealing with the technical, commitment, and emotional issues associated with running a project on the side. Read part two here.

Back in 2009, I moved down to Glassboro, New Jersey to be close to work. I had spent that summer contributing to OpenStreetMap and wanted to spend more time working with local data. In searching for a new home, I wanted to use property information and other points-of-interest data to get (re)acclimated to the area.

At the time, I recall web GIS falling into two categories: Google Map mashup or ArcIMS/ArcGIS Server desktop-on-the-web GIS. Typically, the map mashups were tailored to a specific region or data set. The clunky web viewers set up by county and regional governments had valuable data, but an absolutely abysmal user experience.

So, in the Summer of 2009, I planned on making a web site with some local GIS data for Gloucester County. I originally planned to pull in information from OSM and other sources to paint a picture of the community, highlighting what would be most useful to someone considering moving to or visiting the area. Setting it up on Dreamhost shared hosting, I eventually bought GlassboroMap.com that fall and began building a web map, starting with property information for the County.

Screenshot of GlassboroMap.com

GlassboroMap.com – perpetually in beta

Continue reading

Posted in NJ Parcels, Technology | Tagged , , , , , | 6 Comments

Optimize Your Indexes and Selections

I have been working on a project that is driven by a series of functions written in PL/pgsql. The functions look at New Jersey’s land use data over several different time periods and selects out certain areas if they have the right mix of land use types and they meet an overall size threshold. The project requires the same selection process with approximately 60 different selection criteria and it operates over 5 different time periods of land use data. An efficient, programmatic approach to the problem is absolutely necessary.

Last night, I finished writing up a function to perform another set of steps in this selection process. As land use changes over time, the function creates 5 different sets of polygons based on the five time periods (1986, 1995, 2002, 2007, 2012) to represent the size thresholds the land use data must meet. After selecting the non-compliant areas, the function then marks the underlying land use as “not selected”, using a bitmask which represents selection/rejection for each time period. For example, I need to only include land use polygons where their contiguous area is greater than 50 acres. Individual polygons are going to be smaller than 50 acres, but should be included if they are part of a contiguous fabric of polygons that exceeds that size.

Before leaving work, I ran the function against some test data and when I arrived this morning I found the query was still running. It took 7 hours to complete the one step!

NOTICE: Generating patch polygons...
NOTICE: Applying constraint to patch polygons...
NOTICE: Applying bitmask to polygons that fail to meet size threshold...
NOTICE: Patch Size Requirement Constraint Complete.
NOTICE: Query execution time: 07:00:54.204277
query result with 1 row discarded.

The function that was causing the delay was that final update to the base data.

EXECUTE $$
 UPDATE $$||tblbase||$$ b 
    SET transition_mask = transition_mask & '11110'::bit(5)
   FROM $$||tblpatch||$$ patch 
  WHERE patch.period = 1986 AND patch.size_threshold = 0 
    AND ST_Intersects(patch.shape, b.shape)
;$$;

Keep in mind this is one of 5 functions for each of the time periods. What was causing it to run so slow? Well, the land use data is a conflation of all five time periods, weighing in at about 2.5 million polygons. I calculated a spatial index on both tables, but clearly that was not enough. Luckily, this data also has a region identifier. The land use data was split up into about 8,000 regions, each with its own unique region identifier. As the “patch” data was generated from the same land use, we could include the region identifier in that as well to help optimize the query. It was safe to use the region id, as no two polygons with different region ids would touch and be contiguous.

I modified the “patch” creation portion of the function to include the region identifier and then modified the function, like so:

EXECUTE $$
 UPDATE $$||tblbase||$$ b 
    SET transition_mask = transition_mask & '11110'::bit(5)
   FROM $$||tblpatch||$$ patch 
  WHERE patch.period = 1986 AND patch.size_threshold = 0 
    AND patch.newregionid = b.newregionid
    AND patch.shape && b.shape AND ST_Intersects(patch.shape, b.shape)
;$$;

I also realized I was missing a step I used elsewhere in this project. The double-ampersand (&&) operator performs a minimum-bounding rectangle comparison between the two geometries. This operation can also be performed using only the spatial index. So I added it as well in the hopes that it would improve the results. Running the function again, the process now only takes about 15 minutes, which is approximately 28 times faster than before.

Here’s the EXPLAIN on the first version of the function:

Update on baseland b  (cost=16.47..20972348.39 rows=4192474 width=526)
  ->  Nested Loop  (cost=16.47..20972348.39 rows=4192474 width=526)
    ->  Seq Scan on curr_patch patch  (cost=0.00..48093.89 rows=49087 width=1546)
         Filter: ((period = 1986) AND (size_threshold = 0))
    ->  Bitmap Heap Scan on baseland b  (cost=16.47..425.72 rows=34 width=520)
         Recheck Cond: (patch.shape && shape)
         Filter: _st_intersects(patch.shape, shape)
         ->  Bitmap Index Scan on sidx_baseland  (cost=0.00..16.46 rows=101 width=0)
              Index Cond: (patch.shape && shape)

And with the changes to the where clause:

Update on baseland b  (cost=0.41..458738.64 rows=1 width=526)
 ->  Nested Loop  (cost=0.41..458738.64 rows=1 width=526)
   ->  Seq Scan on curr_patch patch  (cost=0.00..48093.89 rows=49087 width=1553)
        Filter: ((period = 1986) AND (size_threshold = 0))
   ->  Index Scan using sidx_baseland on baseland b  (cost=0.41..8.36 rows=1 width=520)
        Index Cond: ((patch.shape && shape) AND (patch.shape && shape))
        Filter: ((patch.newregionid = (newregionid)::text) AND _st_intersects(patch.shape, shape))

No longer are we relying on a Heap Scan, instead we use the spatial index, with both the text comparison and the more thorough ST_Intersects() to validate the results returned from the Index Scan. I’m still amazed that the query planner (this is on PostgreSQL 9.3.9 with PostGIS 2.1.7) doesn’t use the MBR/Index Scan when comparing features using ST_Intersects. It’s always good to run EXPLAIN against your queries and test them in isolation. Just because you generated an index, doesn’t mean the database is actually using it. And it might seem redundant to type a.shape && b.shape AND ST_Intersects(a.shape, b.shape), but I’m happy to do it if it saves hours of time waiting for tasks to complete.

Posted in PostgreSQL, Technology, Tools and Scripts | Tagged , , , , , , | Comments Off on Optimize Your Indexes and Selections

Using Tableau to visualize land use change

Map of Land Use Categories by County. From the Tableau dashboard.

Map of Land Use Categories by County. From the Tableau dashboard.

I was home sick on Wednesday, so while I was on the couch I decided to dive into Tableau Public, a free desktop visualization tool. At IERP, we use Tableau for some of our public dashboards. Other than some minor playing around, I really did not use tool too much prior to this. It’s really great for working with complex data and seeing results quickly. Seeing that the 2012 Land Use data was released a few weeks ago, I wanted to see if I could bring it into PostgreSQL and produce some graphics of how the land use has changed over time.

View the Tableau Dashboard, the code to reproduce the data, and read on.

Continue reading

Posted in Data, Planning, PostgreSQL, Technology, Tools and Scripts, Visualization | Tagged , , , , , , | 1 Comment

Unix Philosophy

Often, when I go to tackle a problem, I look to use the simplest tools available first. Unix (and its derivatives/descendants, like Mac OS X and Linux) was designed with the following principles in mind:

Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface. —Doug McIlroy

One of my side projects is an index of property assessment records for New Jersey. The pages containing the property records all have the same URL structure:
/property/muncode/block/lot
Where the “muncode” is the 4 digit identifier for New Jersey municipalities published by the Division of Taxation, and the “block” and “lot” are the local level parcel identifiers. So each directory reflects essentially the spatial distribution of properties – lots will be near to other lots within the same block or municipal folder.

I recently wanted to see how each town was performing in terms of page views. Now, I can see such a report using the Content Drilldown in Google Analytics, but I really did not want to write something substantial to get at the information programmatically for further analysis.

Content Drilldown for NJParcels.com

Content Drilldown in Google Analytics


Content Drilldown is available through the API, so I could write something that authenticates against the API, performs the queries and stores the results, but that would be overkill, considering I also have the access log from Apache at my disposal.

I was able to make a report similar to what is on Google Analytics by using several Unix tools. Here’s what I did in one line to get the same type of information.

$ grep -oP '/property/\d+' access_log | sort | uniq -c | sort -rn
12289 /property/1508
11226 /property/0714
9630 /property/1506
8272 /property/0906
8130 /property/1507
...

First, grep (globally search with regular expressions and print) will take in a file – in this case, “access_log” – and return only the portion of each line that matches the regular expression/property/\d+“. The significant portion of that is the \d+ which means match one or more digits. Normally, grep returns the entire line when there is a match. The -oP are two flags to say only return the match and use Perl regular expressions.

If you’re not familiar with Unix, you may have once looked at your keyboard and thought, “what’s that vertical line and when would I ever need it?” The vertical line is often referred to as the “pipe” character, as it signifies that the output of one Unix tool should be passed (or piped) into the next tool written on the command line. With this, we can channel the results of a tool into the next tool for further processing.

The pipe character persists, even on an iPad keyboard.

The pipe character persists, even on an iPad keyboard.

We pipe the output of grep (a list of all second-order directories accessed by visitors) into sort. sort does what is sounds like, sorting each line returned alphabetically. We sort the output of grep to use the next tool to give us our count, uniq.

uniq collapses duplicate lines into one and can also provide a count of lines collapsed (using -c). In order for uniq to work properly, we need to sort the file first.

Finally, I sort the piped output one more time, now using the
r and n flags, which reverse the order and sort numerically instead of alphabetically, allowing me to see the most-visited municipalities at the top of my output.

I can further process these results, using other tools like awk and sed to perform other tasks, such as reformatting to CSV for loading into a database.

The Unix philosophy has always been in the back of my mind when working with Desktop GIS. The UIs are always so busy and complex that you often struggle with knowing what tool to use but being unable to find it. Or be stuck with the point-and-click mentality, where the tools expect human intervention in order to work. While concepts like ModelBuilder in ArcGIS Desktop are a step in the right direction, it still leaves much to be desired. Don’t get me wrong, I’m not suggesting we go back to AML, but we should put more thought into the tools that are available to us and use the ones that are best for the job. While I’m most at home programming in Python and could easily have written something in the language to parse the file and tally the results, it was ultimately much quicker to briefly experiment with existing tools and come up with a solution.

The right tools aren’t always the familiar ones, but you might become more familiar with them with some experimentation.

Posted in Data, Technology, Tools and Scripts | Comments Off on Unix Philosophy

Generating row and column IDs for a hexagon grid

hex_output

Recently when working on a project that uses a hexagon grid, I used the awesome mmqgis plugin to QGIS to produce the grid. The one small shortcoming with the tool is that I would prefer to have a cell identifier that is somehow related back to the position of the cell in the grid. I wanted an ID similar to a cartesian grid, specifying cells by row, then column. An integer “object id” wasn’t good enough.

The mmqgis plugin produces the grid according to your specifications as a shapefile in the current map projection. The plugin does not, for some reason, writes out an empty .prj file. I moved it into PostgreSQL using og2ogr and explicitly defined the projection (in this case, Web Mercator – 3857) while doing so.

$ ogr2ogr --config PG_USE_COPY YES 
    -f PGDump hex.sql hex.shp 
    -lco GEOMETRY_NAME=shape 
    -lco FID=objectid 
    -s_srs "EPSG:3857" 
    -t_srs "EPSG:3857”

I then began to work on creating my ID column. First, I would need to calculate a sequential number for both the row and the column.

ALTER TABLE hex ADD COLUMN rowid INT;
ALTER TABLE hex ADD COLUMN colid INT;

I would then need a sequence to generate the series of numbers used to count off the rows and columns.

CREATE TEMPORARY SEQUENCE idxid START WITH 1;

Now, I can use an update and a self-join to come up with a list of numbers for each column of hexagons. Hexagons are identified based on the integer value of the X coordinate of the cell’s center point. Joining the table to itself allows me to use the center points as the join criteria on a select distinct of centroid X values.

UPDATE hex SET colid = a.columnid
FROM (SELECT nextval('idxid') as columnid, b.centroidint
  FROM (SELECT DISTINCT ST_X(ST_CENTROID(shape))::int as centroidint 
    FROM hex 
    ORDER BY 1 ASC) b
  ) a
WHERE a.centroidint = ST_X(ST_CENTROID(shape))::int;

After calculating all of the column values, I restart the sequence and calculate the row values.

ALTER SEQUENCE idxid RESTART WITH 1;

UPDATE hex SET rowid = a.rowid
FROM (SELECT nextval('idxid') as rowid, b.centroidint
  FROM (SELECT DISTINCT ST_Y(ST_CENTROID(shape))::int as centroidint 
    FROM hex
    ORDER BY 1 ASC) b
  ) a
WHERE a.centroidint = ST_Y(ST_CENTROID(shape))::int;

Once I have a row and a column number for every cell, calculating the field is a straightforward update using concatenate.

ALTER TABLE hex ADD COLUMN hexid CHARACTER VARYING (10);
UPDATE hex SET hexid = 'r'||rowid::text||'c'||colid::text;

While writing this, I started to think about how I would go about doing this using ArcGIS. It was relatively quick to do – about 10 minutes from “I could use a different ID column” to the column being calculated. I am still not sure how I would do it using Arc. If you know a way to do this in ArcGIS, I’d love to hear about it – please leave a comment below.
You don’t have to do all of your GIS work in a database, but know that it can make life easier in many cases.

Posted in Data, PostgreSQL, Tools and Scripts | Tagged , , , , | Comments Off on Generating row and column IDs for a hexagon grid

“Doing more with SQL” talk at MAC URISA

Last week, I gave a 30 minute talk on “Doing More with SQL” at MAC URISA 2014. The slides from the talk are available below:

Continue reading

Posted in Conferences, PostgreSQL, Technology | Tagged , , , , , , , | 3 Comments

Insignificant Spaces

Yesterday, I mentioned that I discovered ArcGIS’s inconsistent handling of spaces within text fields. Today, I tested to see how ArcGIS handles NULLs, Empty Strings and Spaces within its native file formats.

Selecting 3 spaces selects all the strings.

Selecting 3 spaces selects all the strings.

Continue reading

Posted in Data, PostgreSQL, Technology, Tools and Scripts | Comments Off on Insignificant Spaces

Key issue with New Jersey’s parcel data

I’ve been working on a project to visualize differences between assessed value and levied taxes in New Jersey and I discovered a bug in some of New Jersey’s parcel data.

screenshot of ArcGIS with some incorrect parcels selected

256 parcels with the same PAMS_PIN key.

New Jersey uses “PAMS_PIN” as a key between the GIS data and the tax assessors’ rolls. The key is a simple concatenation: Municipal Code (a four digit value for each municipality), Block, and Lot joined by underscores. If the Qualifier Code (“QCODE”) field is populated, then that is also joined, again with an underscore. In Somerset County, many of these PAMS_PIN keys are incorrect, resulting in duplicates.

Continue reading

Posted in Data, Tools and Scripts | Comments Off on Key issue with New Jersey’s parcel data