‘Advice to Marketers: Size Does Matter When it Comes to Data’, by Robin Verlangen, Data Architect, FlxOne
In the 1990s, a typical database consisted of just a couple of megabytes. By 2000 this had significantly increased, and one decade later we’re now storing petabytes of data. What caused this rapid growth and which types of software can help us manage these mountains of data?
The rapid growth of disk space
When I first started developing websites, a typical hosting package had only 25 megabytes of storage. Of course that was intended for personal websites – not large enterprises – but things changed quickly in a very short span of time.
Today, you can store several gigabytes for less than what it used to cost for a few megabytes. This is thanks to advances in technology, stiff competition and most importantly, cheap disks with multiple terabytes of storage. The going rate is now only about 0.05 USD per gigabyte.
The trend of rapidly increasing space and decreasing prices caused people to store more and more data. People tend to be lazy – why bother deleting something if you can just leave it there? You can compare a hard disk to your attic or basement – as long as there is space available, you’ll just keep filling it up with stuff!
Even major players like Yahoo! and Google ran into problems dealing with their ever-growing data. Entire data sets could no longer be managed by a single server. Buying more, or larger, servers increased operating costs exponentially. So tech companies started exploring software and programming-based solutions to this problem.
The rise of NoSQL
One of these solutions was NoSQL. Today, many people throw around terms like Big Data or NoSQL without fully understanding what they mean. That’s the primary reason why some people consider this new breed of technology just hype, like the terms “Web 2.0” and “Cloud”, but NoSQL can offer a variety of concrete benefits.
Let’s start by clearing up a common misconception: NoSQL does not mean “no SQL”, but is more like “not only SQL”. In 1998, one of the first NoSQL developers coined a much more accurate term – “NoREL” – referring to non-relative. Though rarely used, it does a much better job of describing what modern databases like Cassandra, MongoDb and Redis do: store data in a non-relational way.
Storing data in a non-relational way allows developers to make significant gains in areas like performance, consistency and distributed systems. This is something to consider when picking a database for your application. However, every database has its own pros and cons and (My) SQL should certainly not be forgotten. For example, it could still be the most suitable solution for account information, even for the world’s largest websites. You should keep in mind that SQL has a very rich set of features that most NoSQL databases can’t compete with. Cloud providers, like Amazon, offer managed SQL solutions that take away the pain of scaling and offer very low entry costs.
NoSQL starts to make sense when you’re developing an application that might considerably grow in scale. Facebook and Twitter started out on MySQL and are still running large parts of their business on it. They continue to make it work by using smart, advanced ways of scaling MySQL. Techniques like sharding, replication and advanced caching layers provide the “solution”. Implementing modern NoSQL solutions would be a preferable alternative. In fact, Twitter and Facebook are currently migrating, where possible, but that’s more of an afterthought.
When developing new products, companies should take scalability in account from day one. First, you need to determine whether the application requires massive scalability. There are many use cases that do, but even more that don’t. Does a local grocery require a highly scalable solution? Probably not. Does a simple app-based game (like WordFeud or Words with Friends)? Yes, it just might. These days, a simple application that starts with a handful of users has the ability to gain global publicity.
The reason you should not put everything exclusively into a SQL or NoSQL database is actually quite simple. Both serve different needs and require different skills to develop applications on top of it. Most solutions I work on, both at FlxOne and privately, use a balanced mix of SQL and one or more NoSQL solutions. This causes certain developers to become a “jack-of-all-trades”, who can function in a variety of languages.
One way of reducing the specific skills required, is to implement an abstraction layer that offers a transparent and easy-to-understand interface, but uses the whole (No)SQL mix underneath. This allows you to save costs on “multilingual” developers, who are a very rare breed.
As data volumes continue to grow, even NoSQL databases may not provide the best solution. The term data warehouse is nothing new, but the storage capabilities are. Hadoop, which originated at Yahoo!, is one of the modern platforms that supports both distributed storage and processing.
Data stored in Hadoop is not used on-the-fly (like a MySQL query), but is processed in large batches. The theory is simple, yet effective: One massive amount of data is split into small chunks that are processed parallel to each other on a large number of commodity hardware servers.
This allows it to complete tasks much more quickly than a very expensive single server would ever be capable of. Not to mention the fact that these top-of-the line machines are completely unaffordable for most! Note: Commodity hardware does not refer to “(old) desktop computer”. It sounds great, doesn’t it? But the problem is that Hadoop provides no SQL interface. This makes it difficult to find people who are capable of using it, and nearly impossible to find administrators.
Luckily, there are tools that make life easier. Companies like Cloudera and MapR provide solutions that let you administer Hadoop clusters with a couple of clicks from a web-based interface. These are easy to get up and running, yet still difficult to do it right. There are even tools like Hive that offer an SQL-like interface on top of Hadoop. The great advantage here is the ODBC-compliant connectors, meaning you can easily hook them up to a variety of different software packages. A personal favorite is Tableau, used for visualising petabytes of data from a regular desktop computer.
One common complaint about Hadoop is that it’s slow, when in fact it’s the overhead incorporated in every job (~query). You’re not able to fetch objects in milliseconds – it actually takes somewhere between 5 and 30 seconds before a job actually starts processing. This makes it useless for most tasks that require direct interaction.
Hadoop was not, however, designed for instant results. It was specifically developed for large-batch processing jobs that run for hours. This is where the previously discussed NoSQL databases come in. For example, a website like TripAdvisor that offers recommendations would probably want to leverage as much relevant data as possible. The amount of data involved can grow to petabyte-scale. Such a job could run for hours and hours, but the visible results are only a fraction of that original data set. The complete data set can be stored in a NoSQL database that scales out, but still offers perfect ad hoc query performance.
The process of running a large job, and storing the outputs in a NoSQL database, is common. However, there are a number of new developments in the works dedicated to processing large data sets in real time, often achieved by placing used data in the memory. Google wrote a paper about Dremel back in 2010, which resulted in the development of Apache Drill, and more recently Impala. HBase is another more commonly used example. This is an in-memory database on top of the Hadoop File System (HDFS).
All of the above options tend to tackle the same problem: running ad hoc queries on top of distributed data, commonly stored in the HDFS. They also take a similar approach, leveraging memory, parallel processing and columnar storage.
This software is still quite young and not yet production-ready, in my opinion; but as time passes these options will become more and more relevant. You should keep in mind that in most cases real-time is relative. Carefully consider whether you need real-time processing, and if you do, what does real-time mean to you? Is it less than 10ms, 5 seconds, 10 minutes or even 12 hours? The most pragmatic approach is probably the best one.
Data to information
Now we know what big data is, how to store it and how to process it, we should be able to do something with it, right? Unfortunately, most companies still struggle to do just that. It’s very challenging to turn the results into actionable, or even practical, information. To really get to the heart of all your information, and what it means, you need to go beyond just data sets and query strings. This is where the data visualisation engineer comes in. At FlxOne, we say they turn big data results into visual insights. Good analytics and reporting should show people what’s going wrong – and even more importantly – what causes it and how to solve it. Regardless of how you store and process your data, this visual layer is often the essential final step to translating petabytes of information into a few simple charts and truly actionable insight. This is the main reason why (online) marketers should look into this, and leverage every bit of data out there.