Wattpad is a website that brings together authors and readers. First-time authors, or for that matter, many-time authors, can post their stories to the site and gather critical feedback that may help in the revision process. Readers can follow favorite authors and access more than 10 million stories at no charge using a web browser or mobile device.
Wattpad has been endorsed by noted Canadian author Margaret Atwood and grown to 10 million visitors a month. Each second it fields 6,000 requests for content or requests to post comments. That means there can be few kinks in the way the site operates. Wattpad's owners want visitors to find what they're looking for -- many have favorite authors -- and get their requested stories to their devices quickly.
Wattpad, from its inception, has run on Amazon Web Services infrastructure, but its operators have found it hard to gain as much visibility as they'd like about website applications. Amazon's CloudWatch provides basic feedback, which can be amped up by adding for-a-fee reporting metrics. Wattpad has used CloudWatch, but added to it three independent monitoring systems: New Relic, Datadog, and Boundary.
[Want to learn more about what New Relic does for cloud applications? See Obama's Developer Brain Trust: Inside The Big Battle.]
In the second half of 2013, Charles Chan, head of engineering at Wattpad, wanted to know what was wrong when the site's search engine, Elasticsearch, developed signs of slowing down. Wattpad visitors may be absent for a few days or weeks, and when they return they want a quick compilation of any new postings by their favorite authors. Elasticsearch is a young search engine based on Lucene open-source code, designed to gather frequently updated documents. Its first, stable 1.0 release was released on February 12, 2014, so Chan carefully monitored the earlier versions used through 2013.
In the latter half of 2013, Wattpad's monitoring system showed many user requests flowing into Elasticsearch, but significantly less information than expected coming out. The search engine slowdown would have been hard to detect with AWS's CloudWatch monitoring service, which lacks a metric that reflects Elasticsearch's operation. But independent monitoring service Boundary, which detects traffic levels between the nodes of a system running on Amazon, helped Chan identify the problem.
Boundary's dashboard alerted Chan that Elasticsearch output was slowing down because it could spot the disproportionate amount of data coming out of the Elasticsearch node compared to the traffic going in. Chan corrected the problem by reconfiguring Elasticsearch to better fit Wattpad's usage patterns.
Chan said Wattpad has used Amazon's EC2 from its start, but it added Boundary six months ago to its New Relic application performance monitoring and its Datadog system, which compiles monitoring data into a unified display. He doesn't rely on Amazon's CloudWatch much anymore. "It doesn't give the same level of insights" as Boundary and New Relic, he said.
New Relic APM is useful in spotting application slowdowns, such as a hung application waiting for results from a satellite database system, and other potential trouble points. Boundary brings something else to the party: an ability to see what network traffic is feeding into those applications and the traffic coming out. Unlike traditional systems management, which tells you whether your servers and network switches are operating normally, Boundary watches the network segments between the nodes.
"Not all issues will be manifested at the application level," said Chan in an interview. He uses its overview of network bandwidth use and network traffic to spot potential trouble points. Wattpad conducted testing in the lead-up to its busy holiday season by firing off artificial demand against its website and observing, through Boundary, where traffic flowed smoothly and where it began to back up.
One trouble spot was the open-source Memcache caching system supplying frequently used data to servers. When fielding a request from an application, Memcache was returning more data than the application could use, chewing up network bandwidth. "Unnecessary data was being sent" because it had been configured to overdo the response side of its operation compared to the data coming in, said Chan. His staff was able to correct the problem before it led to any peak-traffic slowdowns.
Chan said Boundary was "easy to set up in a matter of hours." It places its own sensing agent on hardware devices, which automates the reporting of traffic to the central Boundary system. Any new agent reporting in prompts Boundary to add another device to its network topology map. The lines on the map illustrate which node is talking to which.
Chan has no complaints about Amazon as a cloud service provider and said CloudWatch served Wattpad's purpose initially. But now that Wattpad has reached 10 million visitors a month, he needs a more complete view of what's going on in his cloud infrastructure.
Asked where he'd be without his monitoring combination of New Relic, Datadog, and Boundary, he said, "We'd have to do some network sniffing on our own" to try to determine network traffic. "We wouldn't have much visibility without the network traffic element. With those three, they'll be able to carry us quite far."
Engage with Oracle president Mark Hurd, NFL CIO Michelle McKenna-Doyle, General Motors CIO Randy Mott, Box founder Aaron Levie, UPMC CIO Dan Drawbaugh, GE Power CIO Jim Fowler, and other leaders of the Digital Business movement at the InformationWeek Conference and Elite 100 Awards Ceremony, to be held in conjunction with Interop in Las Vegas, March 31 to April 1, 2014. See the full agenda here.