Big-data powerhouse – citypath.com: a self sustaining content directory

The advent of cloud-computing, big-data processing technology, and noSQL ecosystems, have enabled the design and development of automated self-sustaining systems.

One such example is citypath.com – a self-sustaining showcase local directory designed and developed by a team of of Israeli experts. This product combines cloud-computing infrastructure, big-data processing tech, noSQL ecosystem, machine learning, NLP and an admittedly ugly design (a bit of self irony is always good) to enable an automated system to continually generate and serve quality up-to-the-moment content without any human intervention:

  • 100% of the listings (over 60k), including description texts, are continually being generated and published automatically by the system
  • The system is self-sustaining requiring no human intervention. The showcase site on which the platform is deployed as has reached a PR5 rating
  • Due to the high quality and usefulness of the content, Google is picking up and indexing the listings as original content, and the site remains unaffected by Panda or Penguin
  • The system is platform-agnostic and can be used to serve all 4 screens (desktop, smart phones, tablets, smartTVs)

Being a showcase product, currently the system is configured to produce local content (listings for places and events). However, with some minor development, the system can be configured to handle other domains (e.g. sports, e-commerce, news, etc.).

Technically wise, the platform is comprised of three distinct sub-systems:

  • Content aggregation and mastering
  • Content analysis and generation (e.g. quality scores, qualitative indicators, auto descriptions)
  • Content publishing and delivery – using a noSQL ecosystem

While these are designed to work as part of a single seamless product, each can also function as a stand-alone platform complete with separate management systems.

Below is a high level description of how the content-side tech works. As you’ll be able to note, the process is almost fully automatic, requiring only minimal human intervention:

Content-side preliminary processes:

  • New content sources for scraping are chosen
  • Regular expressions (extraction rules) are written for each new source
  • Optional – once tags are extracted , these can be mapped to a canonical ontology via a simple interface

Note: the last two processes (regular expressions and mappings) can also be automated, and we already have the basic algorithmic design for such automation.

Content crawling and extraction:

  • Based on the rules, the system automatically crawls the sources, performs boilerplate removal, and extracts relevant entities (e.g. categories, metadata, geo-data, descriptions, reviews, etc.)

Content processing and updating:

  • Deduplication algorithms – automatic flagging and merging of duplicate listings (multiple items that describe the same place or event) to create one unique master record
  • Auto tagging algorithms – each master listing is automatically mapped to a main category, related categories, tags, and geo-coding data (including coordinates)
  • Auto text algorithms – a unique description text is automatically generated for each qualifying master listing (yes, this thing can write texts, and quite good ones actually)
  • Content synthesis algorithms – unique synthetic qualifiers are automatically generated for each qualifying master listing (e.g. quality scores)
  • Content update and enrichment – a set of scheduled tasks automatically extract data to enrich existing listings and to locate, extract and master new listings

Content QA:

The system is equipped with automated content QA protocols. These are protocols inserted at critical intersections along the content processing cycle geared to ensure that only the best quality content is mastered and published. The protocols flag content items that do not pass certain parameters. Examples of such flagging protocols, include, for instance:

  • Insufficient sources
  • Low quality source data
  • Faulty meta-data
  • Data collisions
  • Suspicious names

The flagging protocols may prevent data from being mastered, or prevent low quality masters from being published. Flagged content items – whether mastered or unmastered – are stored in a DB and are accessible for moderation and manual QA via the management system. However, the system constantly reviews flagged items to ameliorate them so that they may be unflagged.

Content publishing:

Processed listings (place or event) can be disseminated in a number of methods:

  • Automatically published as web pages on a searchable website (e.g. citypath.com) – we use a combination of Cassandra and Solr with customized configuration to support search
  • Automatically served on top of a recommender engine or app
  • Delivered on-demand via an API
  • Delivered as a data-dump in any standard format

Management (CRUD):

  • All the data collected and synthesized by the system is automatically synched and saved in a relational DB
  • The data can be viewed and managed (CRUD) via a back office management system (DBMS). This MS allows non technical personnel to generate complex SQL queries using a simple and intuitive interface

Personalized content delivery:

Citypath.com is integrated with Datapersona – an on-demand personalization engine developed by the same team. This engine is capable of dynamically tailoring the content and/or search results to match the interests of each individual user across all devices. This engine comes complete with its own management system and set of APIs (see: Datapersona – technology). You can refer to the videos posted here – Overview, Management System – to get a sense of how this engine operates on the background of our showcase local directory.

_ _
You can check out our autonomous and personalized local directory on citypath.com.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s