Data, Data and More Data

By Nilesh Bansal - Thursday, April 15th, 2010 at 8:30 am  

This blog post is first in the “Engineering” series by Sysomos’ co-founder and CTO Nilesh Bansal. As part of this series, Nilesh will share experiences in engineering Sysomos’ social media platform.

One question that is frequently asked is: What’s the biggest challenge I face? The simple answer is: data.

As I write this sentence, in less than a minute, our crawlers have collected tens of thousands of new conversations happening online. Within the same minute, each of these conversations was discovered, retrieved, cleaned, analyzed, and stored on our servers. Now, that is a lot of data.

We store billions of documents on our servers. Every hour, millions of them are read and analyzed by users. In the last few years, our team has experimented with a variety of options to get a better understanding of the black art of data management. I’ll share some of them in this post.

There are two main components of the storage layer: hardware and software. As well, there’s a third option: outsourcing by using Amazon’s cloud infrastructure, S3 and EC2. Cloud storage is a convenient option, and, if used properly, even economical. But convenience comes at the price of flexibility.

While Amazon has steadily added more customization options and features, there still isn’t enough flexibility to meet our needs. The lack of flexibility also limits our options to innovate such as our plans to start using solid-state drives. As a result, we have stayed with conventional on-premise solution.

Storage Hardware Disks and storage bays are the most expensive part of a purchase order. They are also the slowest and the least reliable. This means they have to be selected carefully.

There are three main architecture options. First, network mounts and NAS obviously will not work given the low latency requirements. Second, fiber-based SAN offers flexibility in adding new disk arrays or moving them across hosts, but is significantly more expensive. If planned properly, this flexibility is not really needed.

The last option, which I prefer among the three, is internal and direct attached storage. If I had to select one configuration option, I would go with a 2U server with 12 bays containing 1TB SATA disks, 60-80GB RAM, and 16 processing cores. This provides a good balance of computing power and storage space. Adding more disks is easy by adding an external disk array connected via SAS cable.

Reliability Disks fail, and when they fail all the data is lost. RAID is used to store multiple copies of the same data on different disks to ensure reliability. RAID 5 is the most commonly used option. However, the disk sizes have increased exponentially and the bit error rates still remain at the same level, there is a non-zero chance of data loss in RAID 5. RAID 6 adds an extra disk to RAID 5 to provide higher reliability. The data write speeds in both RAID 5 and 6 is slow and not best for what we do.

We use RAID 1+0 where all data is mirrored on two different disks. Since all data is stored twice on two separate disks, it means twice the cost but it also provides the best reliability and high performance.

Storage Software As our crawlers continue to add more data every minute, and our users analyze thousands of documents every second, data storage is an important consideration. While we use a combination of different solutions from flat files and custom data structures to inverted indexes, key value pairs and relational database, the bulk of our data is handled by MySQL.

For the most part, MySQL is used as a simple key value store. Since a single instance of MySQL can’t hold all the billions of documents within Sysomos, we partition the data logically across several big, fat servers. Each server is maintained independently (as NDB cluster does not really scale) using primarily the InnoDB table format. Inside each instance, we further partition the data logically to hundreds of different tables. This partitioning let us add new data without hitting the wall.

While MySQL is good enough for basic SELECT and INSERT operations, this is all it can do. Even thinking of a JOIN or any complicated operation can make the server crash. But as long as it is used as a key-value store, MySQL can handle a lot of data and provide for all replication and backup needs.

Key-Value Stores New generation of data stores are gaining popularity. Most notable ones are Apache’s HBase, Facebook’s Cassandra, LinkedIn’s Project Voldemort and Baidu’s Hypertable. Each of these have big-name backers with a lot of hype, and are trying to do what Google does with BigTable.

But they have to become more mature before they become useful for us. For example, when HBase crashes (and it does), it prints the most uninformative error log. Hash based partitioning is used for load balancing, which provides little visibility in where the data actually is, and is often less optimal than logical partitioning when it comes to latency. There is also  very limited user base for each of these outside of there parent companies (which also means bad documentation).

Tokyo Tyrant is another option because it is simple, fast and good for specialized needs.

In summary, it’s all about data. More data means, more we can do with it (and sleep less). I will explore some more topics, including real-time indexing, sentiment analysis, and load balancing, in my next posts, so stay tuned.

Tags: , , , ,

Leave a Reply