May. 20, 2009 - Search Engines and the MLS Data Scraping Question
"Is Google a scraper?" That was the question at the center of news stories surrounding MIBOR's decision to tell a broker not to let Google index their site. The quick answer is "No" - there was no restrictive terms of service or limiting robots.txt file on the site, so technically Google did absolutely nothing wrong. But the question being asked ... that was the wrong question.
Finally, after the hype died down, the 'real' question started to emerge: "Should or could MLSs require that brokers not allow individual listing pages be indexed by search engines". Since listings are given to brokers for advertisement, unless the seller opts out of online advertisement, since most consumers are searching for property online and search engines are an important part of online marketing, search engines will be an important component of giving listings the proper exposure and should be leveraged as much as possible. Also (and obviously) the MLS could probably make rules pertaining to an IDX feed but realistically not regarding the broker's own listings. But whether search engines should be allowed to index the sites is again the wrong question.
What's the real concern here? We've had IDX for some time - was it really just okay when it was invisible to search engines? Of course not. The real concern about 'data scraping' only comes from when the data is misused - that is, used for a purpose other than that intended by the homeowner when they provided the information to the real estate professional and by that professional when they added their own creative descriptions to the data to create the often copyrighted listing content.
What kind of misuse has there traditionally been? When a site is easy to scrape someone can come along and grab the listings in an automated way for display in an unauthorized location. Data can also be recompiled to create derivative products or to market back to the consumer. If the scraper adds an automated reverse telephone look up to scraped data, someone giving a real estate professional information to market their property one fine morning may find themselves called by moving companies and other service providers that very evening - and it reflects poorly on the real estate professional when that happens. So, the real question we need to ask ourselves is, "How do we stop the misuse of data while not compromising the ability of the broker to market properties and promote the web sites on which the properties are located?"
Let's look at the type of requests consumers put into search engines. I believe that there has been a lot of hype about needing the whole address in the web page title and that individual addresses need their own website. Do consumers really expect to type in "100 Test Street in Testville, TN" and come back with a website? I don't think so - not at this point. We all know how the traffic comes in via web site search terms: "houses in Testville, TN" ... "Testville Tennessee real estate" ... "homes in Testville" "Subdivision Name in Testville". So, city, state and neighborhood/subdivision are obvious candidates to allow a search engine to index. Key attributes might also be searched on - "lake view" etc. But the full address? Price? Bedrooms? Bathrooms? Square feet? Lot size? I say, "ridiculous!" Are they needed for search engine optimization (SEO)? I believe the answer is an emphatic, "No". Since those bits of data don't help in the indexing of the listing by search engines for marketing of the property online BUT they are prone to misuse when programatically gathered (scraped) there is no reason why MLSs should not require that websites put anti-scraping mechanisms in place on those key items, while allowing search engines to programatically gather other information for the purpose of providing free links back to the web site.
But, anti-scraping begins at home. Less than 5% of MLS public sites have any anti-scraping in place to speak of - and good measures are far more rare. But, I digress - before we launch into a tangent of anti-scraping tactics, we need to agree on a strategy for the level of protection required for the data to balance marketing with information security and privacy, and we must set policy that is reflected in contract terms pertaining not only to industry sites but to syndication endpoints as well.
Note - I've been traveling for more than a week and am writing this at o-dark-thirty in an airport parking lot - it's not my finest piece of writing - sorry! Hopefully I'm getting the ideas across anyway...
|
Comments (10) :: Post A Comment! :: Permanent Link View more entries tagged with: Mls, Security
|
May. 20, 2009 - RE: Search Engines and the MLS Data Scraping Question |
| Posted by Judith Lindenau |
Good summary of the issues. I particularly like your thoughts about agreeing on a protection strategy to balance information and security--I think guidelines need to be proposed as a prototype at a national level, and made available to local MLSs if, in fact, it is to become a truly national strategy.
|
| Permanent Link |
May. 20, 2009 - RE: Search Engines and the MLS Data Scraping Question |
| Posted by Michael Wurzer |
Two comments:
1. The issue might not be whether consumers search for a specific address but rather how search engines rank sites that provide content for parts of addresses. For example, search Google for nearly any zip code plus the phrase real estate. Trulia is in the top 5 spots nearly every time for those searches. Is that because of other content on their site or because all of their detail property pages include the zip code in the URL or all of the above? Because we don't know what really works inside Google, everyone keeps trying different things. To have NAR inserting themselves inside that process against their members when others are not subject to those rules seems very wrong. If that's going to be the requirement, shift the debate to whether or not to eliminate IDX policies altogether, because that's nearly the same effect.
2. One of the reason single property sites have become so popular is the limited content and anti-branding rules of the MLS. To address this, the listing agent creates a web site for the seller's listing so they can load videos and all sorts of other content the MLS doesn't allow. Because the URL often is not memorable, the agent just says "Google" it. This obviously doesn't involve IDX directly, but what is happening is that more and more agents and buyers are Googling addresses of specific properties. You're right that buyers don't Google a specific address when they are first starting their search but they often Google specific addresses during their search, just like we all "Google" for every other question we have about anything. Now, answer this: why shouldn't the URL that comes to the top in response to those searches be a fact-filled and marketing-filled page from the MLS, the supposed best source of information on real estate? |
| Permanent Link |
May. 20, 2009 - RE: Search Engines and the MLS Data Scraping Question |
| Posted by Matt Cohen |
Michael, great points as always. As to your first point, sure - add zip code to the list - but there will be a place where the reward for adding additional information will be greater than the benefit? That's the discussion that I want to take place - not whether Google is evil or if we should allow IDX sites to be indexed by search engines - I think the answers to those questions are self-evident. Of course, I don't want to see NAR members disadvantaged in any way - and this is part of why we must explore the issue not just in terms of members, but syndication endpoints as previously noted.
I think the industry provides a wide variety of means to get to listings - and again, what a broker does on their own web site (or single property site) with their own listings is their own business - and they can make the URL memorable in that way if they wish outside the IDX context. I do take your point regarding why some brokers want to put up a single property site - I do see the value though think it is somewhat overhyped.
As to your last question, I agree with you entirely. How that is best accomplished is a subject for protracted conversations. |
| Permanent Link |
May. 22, 2009 - RE: Search Engines and the MLS Data Scraping Question |
| Posted by victor lund |
First, I think that MIBOR should be commended for their careful review of the complaint and proper ruling. Google is scraping data and repurposing it from an IDX website. Furthermore, Google is profiting from scraping (they are selling ads).
How is this any different from an agent advertising another broker's listing in the newspaper without permission? Just because it is happening online, and it is Google doing the scraping does not change the rules.
Personally, if I had an agent or broker website, it would also be my goal to have the listings indexed. There is a clear benefit to the marketing of a website if you can expose all of that data. The website becomes a rich authority of deep content.
If the rule changes, everyone will start scraping data. In fact, there are 2600 domain names going up for sale June 11th that would be perfect for launching data scraping sites. They are being auctioned by JP King (link).
Maybe it is time for a rule change. It would not be friendly to big brokers who dominate markets with lots of listings, but it would be friendly to the consumer and any aggressive advertiser. Let any member or subscriber of any MLS to promote any listing for sale in any way - Internet or otherwise as long as they include the listing broker to be prominently displayed in the ad. If Google or anyone else scrapes the data, its OK as long as the listing broker is displayed. |
| Permanent Link |
May. 22, 2009 - RE: Search Engines and the MLS Data Scraping Question |
| Posted by Michael Wurzer |
Victor, the case law in the US is pretty clear that Google's indexing of the web is fair use, i.e., it's not misappropriation, which is central to the IDX policy at issue. Though the policy language leaves much to be desired, interpreting the policy as preventing indexing defeats the fundamental purpose of the policy, which is to allow cooperating participants to display the listing data on the web.
I agree that the language should be revised and that time should be taken to study and craft appropriate language, but I do not agree with the original MIBOR interpretation and think what's most important here is that NAR has withdrawn the earlier position on which MIBOR's interpretation rested. http://speakingofrealestate.blogs.realtor.org/2009/05/16/nars-idx-rule-changes-need-more-study/#comment-66
To keep Paula's or others' web sites in the dark for the six months that this issue will be pending is wrong. It's based on an interpretation of the policy that goes against copyright law in the US, against the controlling term of misappropriation in the policy, and, most importantly, against the very purpose of the policy itself -- to all participants to display the information on the web.
|
| Permanent Link |
May. 22, 2009 - RE: Search Engines and the MLS Data Scraping Question |
| Posted by Michael Wurzer |
Also, the reason I'm glad the proposed language did not become final is that I don't think the last sentence affirmatively allowing "indexing" and "search engines" is necessary and it also opens another interpretation issue as demonstrated by this post:
http://www.bloodhoundrealty.com/BloodhoundBlog/?p=8560
What's demonstrated well by this entire controversy is that the web moves way too fast to be regulated by specific language like "scraping" or "search engines". Instead, my recommendation would be to simply say that the participant must take reasonable steps to protect the copyright in the data. In other words, invoke the current copyright laws that are controlling anyway and not get into crafting new rules and laws that will only result in controversies like this where the members are being hampered and the true scrapers keep on keeping on. |
| Permanent Link |
May. 22, 2009 - RE: Search Engines and the MLS Data Scraping Question |
| Posted by Michael Wurzer |
Sorry for this stream of comments, but I have one more.
Matt, you earlier said, "before we launch into a tangent of anti-scraping tactics". I'm quite interested in hearing about the anti-scraping tactics. This is core to the MIBOR interpretation, I believe. The argument was that what was done with the information (indexing or misappropriation) was irrelevant because it was after the fact -- the horses would have already left the barn. However, I think the use made of the data is the only thing that matters, not only because that's what the policy says by prefacing everything by the term "misappropriation" but also because once the information is on the open web the only anti-scraping method I'm aware of is seeding to try to monitor what others do with the data. The only way I'm aware to prevent scraping is not to put the data on the open web. |
| Permanent Link |
May. 22, 2009 - RE: Search Engines and the MLS Data Scraping Question |
| Posted by Matt Cohen |
There are a wide variety of tactics to prevent scraping - not all of which are practical for every application, but which in proper combinations can be considered reasonable steps to deter scraping. Besides steps that don't stop the bad guys but are critical nonetheless, an appropriate terms of service and robots.txt file should be put in place. Then come the steps that make it less convenient to scrape - not using incrementable IDs for easy looping through results, requiring at least some search fields be filled out, not allowing an unrealistic number of search returns or views of details within a time period, robot detection (pattern/speed detectors), and rendering of key data elements as graphics (or using Java, Flash, etc.) are all tactics. Obviously, there are also basic secure coding practices to prevent basic application hacks. Each tactic has advantages and some have significant disadvantages. Requirements for display to prevent misuse are just one element of a data distribution policy - which realistically should address secure authentication, transmission, storage, retention AND display. |
| Permanent Link |
Jun. 17, 2009 - RE: Search Engines and the MLS Data Scraping Question |
| Posted by Matt Cohen |
Brian - I look forward to the continuing discussion.
I still hold to my belief that Google did nothing wrong - however, I think our industry has a long way to go in discussing how it relates to the Internet - and topics such as how data is used by search engines, syndicators and others are ripe, perhaps over-ripe, for discussion.
Once that discussion occurs, policy is formulated and accepted, and applications reflect that policy, including but not limited to implementation of appropriate Terms of Service and robots.txt files to reflect policy, then we can yell at Google as a 'scraper' if they continue to index listings sites. Until then, I'll hold to my position regarding Google.
Again, I look forward to the continuing discussion building on all of the discourse the subject has already generated. |
| Permanent Link |
|
Matt Cohen has consulted to MLSs, Associations, franchises, brokerages, and many real estate industry software companies for over 12 years. Matt is a well-regarded real estate industry expert on industry trends, software design, product management, project management, and information security. Matt speaks at conferences, workshops and leadership retreats around the country on a wide variety of MLS-related topics.

Subscribe
Links
|