Your System Is Defined By Data. Engineer Accordingly.
Data used to be the realm of scientists, researchers, and mathematicians in government institutions, universities or massive corporations (think banks). Even in the latter case, and still today, data is often buried under layers of infrastructure and protocols.
The internet and increased bandwidth/accessibility over the past decade gave birth to the democratization of data. Okay, democratization is a big word, but you get the point. Data is being generated everywhere in various forms and is being collected and analysed by anyone who desires to do so. Even governments are opening up their troves of data to the public with the goal of greater transparency and efficacy.
In order build 24/7/365 web-scale applications on top of this data, we need to think about data in a disciplined way. The system does not stand alone. Data is part of the sytsem. Testing and measuring the system’s response to different types of data and/or increased loads of data requires a framework that informs the engineering team.
To that end, here are eight parameters that should be accounted for when building systems that deal with high volume data. These are things to think about up and down the full stack, not just the “back-end” or the “front-end”.
1. Total data size. This is the amount data the system reads and writes while operating. The proportion of data read to data written varies among systems. A site like the NYTimes.com ostensibly performs many more reads (serving articles to each visitor) than writes, while an email application may need to handle more writes when accounting for spam. Some systems can generate massive amounts of secondary output such as log files. The growth of these outputs should also be taken into account.
2. Growth and change. How fast is data added to the system? Can older data be edited? Can data be removed? If so, how often do these types of transactions occur? Knowing how fast your data grows and changes is critical to understanding system performance and tackling scalability. Set up an internal testing component that simulates increasing data loads and run your system against this to see how it responds to different operations.
3. Temporal axis. Almost all data has some temporal nature, whether it’s a publication date, location timestamp, or time of last access. The system often needs to handle data differently depending on where the data lies along the time axis. For applications such as email and to an even greater extent on Twitter, older data is less frequently accessed. Conversely, users of Facebook very often look at pictures taken months or years ago. This is even more true with Timeline. Data must be stored with these access patterns in mind.
4. Network vs Disk vs RAM vs CPU. I’ve stated these in a specific order, you should know why. To borrow and mutate a common phrase — Know thy system. One of these four components will be your bottleneck as the amount of data grows. A bottleneck at the CPU indicates that your algorithms and processing is slow (perhaps you’re doing a lot of compression or maybe you’ve nested one too many for loops). Conversely you may have an “I/O bottleneck” where the system spends most of its time fetching data from disk or over the network. In this case maybe you’ve missed an index on a database table or maybe you need to think about distributing your data to increase read/write throughput. Think about whether you should be bringing the computation (“CPU”) to the data rather than trying to fetch large amounts of data to a processing component.
5. Throughput. Throughput is defined by two numbers: request size and request rate. Request size is the amount of data the applications reads or writes in a given call or user transaction. Request rate is how many transactions occur each second or each minute. Request rate is defined by the number of concurrent users and the distribution of activity levels across those users (from power users to casual users).
6. Consistency. This becomes more of an issue when dealing with distributed data. One machine or storage unit may be written to with the most up-to-date data, but the other storage units need to be updated as well. Different storage systems tackle this problem in different ways. MySQL Cluster uses the two-phase commit which has long been in use but gives up some availability (MySQL Cluster assumes some level of redundancy). The recent NoSQL movement sacrifices consistency for availability. The strategy used depends heavily on the application and use cases. In cases such as at an ATM, you expect your checking account to update immediately after withdrawing cash. In other cases, such as updating your Twitter profile, the consistency requirement is not as strong (although it does make for a frustrating experience when you still see your old profile up several minutes after editing it).
7. Latency. How fast does the application need to respond to a user’s action? When performing a friend search on Facebook I expect the friend to be returned within a second or two. When buying tickets to a show online however, I would expect this transaction to a take a few seconds longer. My tolerance for waiting is higher here. Your application should optimize for frequently accessed data, either by indexing the data or caching the data.
8. Physical and logical proximity. Data that is requested together should be stored together. When I visit a friend’s Facebook page, all of her pictures need not be loaded as well. Information that appears on a friend’s profile page such as name, profile pic, age, location, statuses should be stored close to each other so that these bits can be retrieved more efficiently. Sharding is one strategy used to keep commonly accessed pieces of data close together while simultaneously using the advantages of a distributed system.
“Big data” in and of itself is not useful. Bryce from OATV wrote a great post about “big data” and it’s shortcomings. However it’s still up to us to build the systems that can deal with this type of data and it’s up to us to then figure out what data is useful and what isn’t. To be hamstrung by your own system is not a place you want to be.
Engineer and test with discipline and you will be headed in the right direction.
Leave a Reply