Your Laptop Can Now Analyze Big Data
Computer scientists from Carnegie Mellon University have devised a framework for running large-scale computations for tasks such as social network or Web search analysis efficiently on a single personal computer.
The software could help developers working on many modern tasks: for example, designing a new recommendation engine using social network connections. In order to make effective recommendations—"your friends liked this movie, so here is another movie that you haven't seen yet, but you will probably like"—the software has to be able to analyze the connections between the members of a social network. This type of task is called graph computation, and it is increasingly common. But working with large-scale data sets (such as online social networks) usually requires the processing horsepower of many computers clustered together, such as those offered by Amazon's cloud-based EC2 service.
The new software, called GraphChi, exploits the capacious hard drives that are becoming ever more common in personal computers. A graph would normally be stored in temporary memory (RAM) for analysis. With GraphChi, the hard drive performs this task instead.
"PCs don't have enough RAM to hold an entire Web graph, but they do have hard drives, which can hold a lot of information," says Carlos Guestrin, codirector of Carnegie Mellon's Select Lab, where GraphChi was developed. But hard drives are slow compared to RAM for reading and writing data, which tends to slow down computation. So Guestrin's student Aapo Kyrola designed a faster, less random method of accessing the hard drive.
According to Guestrin, a Mac Mini running GraphChi can analyze Twitter's social graph from 2010—which contains 40 million users and 1.2 billion connections—in 59 minutes. "The previous published result on this problem took 400 minutes using a cluster of about 1,000 computers," Guestrin says.
As technology gets more networked, and data sets get larger, graph computation is becoming more and more relevant in many domains, says David A. Bader, a graph computation expert at Georgia Tech. "Trying to understand how the human brain works or trying to make sense of medical patient records involve graph computing," he says.
Bader says GraphChi can't yet handle "streaming graphs," which more accurately model large networks by showing how relationships change over time. Bader and others at Georgia Tech have created a graph computation framework, called Stinger, that's optimized for supercomputers working with massive streaming graphs.
"The scales of these problems will obviously keep growing," says Guestrin, whose team is already working on ways to make GraphChi ready for streaming graphs. But he says GraphChi is capable of effectively handling many large-scale graph-computing problems without resorting to cloud-based solutions or supercomputers.
"A researcher in computational biology could do large-scale computations on their PC; a developer working on a data-center algorithm can test it on their laptop before pushing it to the cloud," Guestrin says. "Big data is everywhere now, but some big data isn't as big as it once was, relatively speaking. Tools like GraphChi will let many companies and startups solve all their graph-computing needs on a single machine. It's cost effective, and it drives innovation, too."