The Complete Noisy Data Arc

Note: The original Noisy Data Arc consisted of 12 posts. The complete arc is contained in this single post.

Standards and Noisy Data, Part 1

I’ve written before about my involvement in the Web Analytics Association (WAA) as Marketing Lead for the Research Committee, and as Marketing Lead and Industry Liaison for the Standards Committee [[(everybody relax. This was written back in Jan ’07, remember?)]]. An element of my role with the Standards Committee allows me to listen to discussions in much the same way I did as a child sitting around my grandmother’s table. There was lots of talk which was and is over my head and because I sit quietly and listen, I learn.

Sometimes what I learn is ways to integrate NextStage’s research into what the WAA and its committees are doing. In this case, standards and the effect noisy data has on creating them. I brought up the topic of standards in The Long Tail, Part 1 and feel there’s enough research to get back to it now.

First, the type of noisy data I’m discussing isn’t the traditional “signal to noise” concept. Jim Humphrys, Chair of the WAA Research Committee, acknowledges that signal to noise problems will be a growing concern as more and more RIAs come online. He recently wrote me, “I have some data I put into control charts to separate the signal from the natural variability.” This is a strong indication that Jim and the WAA are aware the traditional problem exists.

Noisy data as I’m using the term here is a concept used in many branches of science (neuroscience, climatology, astronomy, … you name it, its got some noisy data inside it) and is perfectly valid data but not necessarily for what you’re using it for.

An example from climatology is using tree rings to determine regional temperatures. Tree rings actually measure growth and growth is related to temperature, so you can use tree rings to provide a rough and not exact map of temperature variations in a given area. The challenge to using noisy data accurately (oxymoron warning, that) is correctly separating the wheat from the chaff, or in this case the noise from the data. Another example of noisy data is using light pulses inside deeply buried water tanks to detect neutrinoes flying through the earth. There’s just a slightly greater chance that a neutrino will hit a water molecule and cause a flash of light than normal subatomic decay doing the same thing. Now that’s truly noisy data.

Standards and Noisy Data, Part 2

This question of noisy data is becoming more and more relevant as new web technologies emerge. Here I’m thinking of what I posted in What’s a PageView (Ajax)? and that post’s ending comment, “After all, isn’t a human what you really want viewing your page?”

The seminal question (to me) of noisy data occurs when going from one set of analytics tools to another (“tag” to or from “log based”, for example).

Time on site, for example, was explained to me by Angie Brown, Strategic Services Consultant for Coremetrics as follows (and this was explained to me a while ago):

“I’d point you to Eric Peterson’s Web Analytics Demystified (page 150) where he discusses “Average Time Spent on Site”. What he’s describing is what we call “minutes per visitor” in the SurfAid tool: the average number of minutes each visitor spends on the site over a certain time frame. It’s used as a rough measure of interaction with the site, although the numbers are not
precise (we can’t measure how long was the last page view in any visit since the duration of one visit is simply the last timestamp minus the first one). It’s not a given that increasing this metric is good: for a customer support site or intranet we might actually prefer a decrease (get them the information they need in as little time as possible).”

Obviously Angie and others know that Ajax has changed this metric because now last page measurements are doable (NextStage does this all the time now. If we can do them, others can do them, I’m sure).

Currently much discussions are going on about defining new metrics (or to add new elements to the old metric definitions) in lieu of 2.0, RIA, Rich Media, take your pick of buzzwords to enter here. It is at this point that the anthropologist in me kicks in; does something exist because we define it or do we define it because it exists? This gets into the area of differentiating behaviors from actions or using a more concrete example, “I’m typing at my keyboard right now but the fact that I’m typing right now is the act expressing the internal state (psychological behavior). When you type at your keyboard are you expressing the same internal state that I am right now?”

Standards and Noisy Data, Part 3

I was talking with Frank Faubert, Vice President, Internet Marketing Solutions for Unica and several other Unicans at this past Wednesday’s WAW in Boston (Waltham, really. Come on over and join us. Or join one more local to you) about the subject of noisy data and they agreed it exists and may get worse in Web 2.0. One way that Unica handles the issue is by creating metrics using both tag and log information. Ajax can be used to send event driven information directly back to the log in any case and that picks up our 2.0, etc.

One of the challenges I’ve always had with analytics is that they deal with what’s happening at the machine, the computer, and not in the heart and mind of the person sitting at the computer. I’m not discrediting any web analytics provider or any web analytics package. In fact, NextStage and our technology suite is web analytics provider agnostic. We work with them all equally well, and I’ve repeatedly written and said that NextStage doesn’t do web analytics.

That offered, figuring out what’s going on with the person sitting at the computer is one of the reasons NextStage exists; just as there’s more things in heaven and earth than are dreamt of in your philosophy, Horatio, there’s more that happens between the user and the computer than ever happens between the browser and the server, and when you take the user’s psyche into account? The numbers become truly astronomical.

This returns us to differentiating behaviors from actions or “the reason I’m typing is definitely different from the reason you’re typing” and how noisy data is going to be shaping things.

Standards and Noisy Data, Part 4

Note: This entry borrows heavily from discussion I had with Angie Brown, Strategic Services Consultant for Coremetrics.

For the purposes of commercial analytics packages, the first choice is to throw out the noise. This isn’t as haphazard as it sounds. The more commercial an analytics solution is the scalable it must be and one of the ways to scalability is to categorize the data into very specific buckets. You might think of throwing out the noise as categorizing it into a junk bucket similar to a junk folder in your email client.

The reason for categorization is to determine what’s important and what isn’t. At the simplest reporting level all analytics packages use all the data. It’s when you get into very complex reports that each analytics vendor gets to demonstrate their unique strengths because, at this level, you’re winnowing out details to report on very specific items which don’t require that all details necessarily be present.

The change that noisy data brings is that a new question needs to be answered, “What is this page’s purpose?” The question use to be (and in many cases still is) “What happened on this page?” Interestingly enough, that latter question, when the psycho- and neuro-linguistic concept of chunking is applied, becomes “What event happened on this page?” which, in the world of RIA, Rich Media and Web 2.0, becomes “What events were triggered on this page?” Semphonics has published an interesting paper on page purpose and I’ve written about understanding purpose in several places. Where Semphonics and NextStage might disagree is in how the concept of purpose is best achieved.

The real question of any analysis is “What question do you want answered?” This is a question which scalability hates and is why commercial analytics vendors have consultants and such on staff. There’s going to be holes regardless of how complete a package is and it’s the consultant’s job to fill those holes. These holes are most often not easily filled by commercial vendors because what makes the hole a hole is noisy data.

Looking in holes and listening to noise is often relegated to exploratory analytics and exploratory analytics are usually performed once. After that first exploratory analysis both vendors and clients want results oriented analysis. This once and never again policy is good for commercial vendors because they want to sell you what they know how to analyze. And I’m not disparaging analytics vendors by writing that. It’s like Barbara Johnson’s writing in The Critical Difference, “When we read a text once, in other words, we can see in it only what we have already learned to see before.”

Noisy data is going to challenge a lot of what’s out there because noisy data has either been historically discarded as junk or not recognized as useful if not necessary information (like the DNA example I used in Not So Social Networks.

Standards and Noisy Data, Part 5

[[Evidently there was some time between parts 4 and 5 because I wrote:]] After a bit of time away we’re returning to the Noisy Data arc. I think the time away was well spent because the conversations which led to this arc and the conversations while I was writing the arc ended in the development of a tool which I’ll share in the final installment. As for the actual length of time away, my apologies. At least I’m finishing this arc on a weekend per my original and revised promises.

Noisy data is going to challenge a lot of what’s out there because noisy data has either been historically discarded as junk. It is neither recognized as useful nor thought of as necessary information (like the DNA example I used in Not So Social Networks).

Keith Jarrett wrote

“The treasure has always been there
It is not hidden
But is only where certain people would look
At all
Thus it remains a secret to the rest…” (Treasure Island)

This need to look where others aren’t looking is how noisy data got its start in so many sciences and why multi-disciplinary approaches to problem solving are gaining favor in so many fields; training in a given discipline teaches one to look through the lens of that discipline and that means one can only see what that discipline has trained you to see. It is a Maslow’s Hammerish trap; when all you have is a hammer everything looks like a nail.

When all I have is a hammer, everything looks like my thumb but that’s for another arc.

Standards and Noisy Data, Part 6

For readers wondering where this is going, I offer a quote from a participant in one of our corporate trainings;

“…you’ve got to hang in there until the punch line. Some other things that Carrabis comes up with can seem absolutely dotty in the beginning. You may have the urge to throw up your hands, walk out and find somebody who makes sense. Some of the folks in the last class did that. They managed to miss some of the most mind blowing educational experiences they could have had. I suggest you give it time if it seems weird, pointless, confusing, or irrelevant in the beginning. I promise it will pay.”

Exploratory Analysis has been expensive for many reasons. You need to have an idea what hole the noisy data you’re interested in is in, what type of noisy data you’re looking for, it’s usually a hands-on job and not automated (scalability again), once performed the end result is “Okay, we performed due diligence so we know what we can’t do and what we can’t look at. Let’s get back to something we can metricize”, …

These last two statements are key to our discussion moving forward; scalability and metricization (accountability). When we apply metrics to something we can make A=B and that means we have the ability to say “This is working, continue” or “That isn’t working, stop.” These “Do this, Don’t do that” are action items. Keep these in mind for what follows.

Here I want to reintroduce pieces of the discussion I was having with Angie Brown, Strategic Services Consultant for Coremetrics. Angie is a big believer in accountability. She explained to me that “At the simplest reporting level all analytics packages use all the data. It’s when you get into very complex reports that each analytics vendor gets to demonstrate their unique strengths because, at this level, you’re winnowing out details to report on very specific items which don’t require that all details necessarily be present.”

It was Angie who — thinking out loud and blue skying — offered that where analytics needed to be is in a place where it could offer “Simple tools backed by incredibly complex analysis.”

And this is where the NextStage staff begins to get nervous. Joseph (that’s me) thinks he hears a question and goes into a fugue state until he solves it (Angie, Trish, Debrianna, Cindy, Dan, Susan and several others are chuckling reading this, I know).

Standards and Noisy Data, Part 7

How will Noisy Data effect 2.0 applications? Are we measuring acts or the reason for the act?

At this point in the Noisy Data arc I need to introduce a discussion I was having with FindMeFaster‘s CEO, Matt Van Wagner. Matt has worked with NextStage and recommends us to clients he knows can benefit from our offerings.

Matt and I were talking about the anthropologic demonstration of tools and I mentioned that tools evolve over time. An example I used was the flint stone.

Flint stone

Most people today wouldn’t recognize the stone, shell, bone, etc., tools used by our prehistoric ancestors unless they saw them in a museum display or on some science show. Often knapped arrowheads look like oddly shaped stones to the casual observer. The point is, we wouldn’t know how to use let alone make the tools our ancestors used if we had to.

Few people appreciate that the reverse is also true. You could hand our ancestors any modern tool (think cellphone, radio, computer) and they would perhaps be impressed that we spent our time making such oddly shaped stones but they wouldn’t have a clue what to do with them even if we explained their use. This is often referred to as Clarke’s Third Law, “Any sufficiently advanced technology is indistinguishable from magic.”

What is also true is that tools evolve in step with those who use the tools, and I do mean “step”. Some tool is introduced and creates a technological plateau upon which the tool users spread far and wide. Depending on the tool and how much it is used, that plateau can be very far and very wide.

Standards and Noisy Data, Part 8

The business and economic concept of this plateau is a market. The automobile, for example, has done more to shape the civilized surface of our globe than much that came before it. We didn’t create oceans upon which ships could sail but we did create roads upon which our cars could drive (yes, I’m ignoring the Suez and Panama Canals, etc). There have been many improvements to the automobile since it was first introduced and all these improvements didn’t change what we recognize as “automobile”.

You could see the first automobile and you’d probably say, “Wow, look at the antique car.” Likewise, someone who purchased one of the first automobiles could look at a car manufactured in 2007 and say, “Wow, what kind of car is that?” This is because all improvements to the automobile have been sustaining technology or sustaining innovation, meaning they are refinements to what a car is rather than redefining what personal transportation is (I mourn the Dymaxion but not “It”. By the way, no one who contacted me about Enterprise 2.what? could remember what “It” was or is. So much for that company’s marketing dept. “It” is the Segway and if you didn’t know then the point is made).

Eventually the edge of the plateau is reached. Most people think of the edge as the point where the plateau falls off. Toolmakers think of the edge as where the rise to the next plateau begins and if you read Mr. Machine and Childhood Imagination you have an idea how NextStage thinks of plateaus.

Where Noisy Data Meets Standards (The Noisy Data arc, Part 9)

Defining plateaus is great for business, branding and consumer mindshare. When you want a softdrink and automatically ask for a Coke, you’re demonstrating a standard which had defined a plateau. This is true when you want a tissue and ask for a Kleenex, make a copy and say you’ve Xeroxed something, … In each case the plateau is defined.

The point where an existing plateau ends with a rise to a new plateaus is — in the terms of tools and tool users co-evolving — what business and economics recognizes as a disruptive technology. Disruptive technologies are disruptive because they redefine the plateau, give rise to another plateau, create an intersecting plateau which forces the market to shift in response, …

I know you’ll be shocked (Shocked, I tell you!) to learn that when I first formed a company around the technology (“Evolution Technology” or “ET”) based on my research I was told it was a disruptive technology. Fortunately I knew nothing about economics and even less about business so I usually responded with “Okay.”

Why was ET disruptive? Because ET didn’t care about clickthroughs or Time-On-Site or EntryPage or ExitPage. ET cared about “The reason I’m typing isn’t the reason you’re typing”, ie, “I’m typing at my keyboard right now but the fact that I’m typing right now is the act expressing the internal state (psychological behavior). When you type at your keyboard are you expressing the same internal state that I am right now?” The reason ET cares about these things is because ET’s origins aren’t in web analytics or the internet in general. It’s not on the plateau of the web at all nor does it trace its family tree through the evolution of web analytics tools since the early 1990s. People reading Reading Virtual Minds know it grew out of a completely different set of paradigms.

Where ET meets the web is in its ability to deal with “One of the challenges I’ve always had with analytics is that they deal with what’s happening at the machine, the computer, and not in the heart and mind of the person sitting at the computer.” because the data ET routinely works with is what traditional web analytics defines as “noisy”. Understanding cognitive, motivational/effective and behavioral elements has always been, to me, much more interesting than “How long was somebody on a page?”, “What page did someone enter a site on?”, “What page did someone exit a site at?” and so on.

The downside of coming from a completely different paradigm and dealing with what most people consider noisy data is that (in NextStage’s case) the tools (ET) are either stoneknives or cellphones to most people investigating them.

The Noisy Data arc, Part 10

I’m reminded of a line from Brian W. Aldiss’ short story, “Old Hundredth”:

“…useless to deny that it is well-nigh impossible to improve anything, however faulty, that has so much tradition behind it. And the origins of your bit of metricism are indeed embedded in such a fearful antiquity that we must needs — “

Anyway, on with the arc!

So in talking with FindMeFaster’s Matt Van Wagner and going over my conversations with Coremetrics’ Angie Brown, I mentioned that in the beginning and because ET was considered a disruptive technology, most of our first five years were spent waiting for our market to emerge or for our plateau to intersect with everybody else’s plateaus.

This intersection started when visualization packages first started appearing in web analytics. NextStage benefitted from this because the concept of “behavioral” started becoming lingua franca. Even though NextStage’s definition of “behavioral” isn’t the industry standard (surprise!) at least the word is out there.

Angie and I both agree that such packages are fun to watch and don’t provide a lot of actionability. Responses to Visualizing…what? indicate that most users feel the same way. It’s kind of like getting a racetrack when you’re a kid. You can only watch that car go around that loop so many times before you end up saying, “Yeah? Now what?”

I had been mulling over Angie’s tool definitions; simple reports use all the data, more complex reports use less data.This is true at the machine level because you can isolate machine components to test them separately before putting them into the machine as a whole. It’s a “Is the plane safe?” versus “How does landing gear work?” type of thing. In order to answer the former you need to know an awful lot about the whole plane, to answer the second you only need to know about one subsystem of the plane.

The Noisy Data arc, Part 11

Answering questions about humans interacting with information (which is what NextStage does) is a different matter because people have this nasty habit of not being simple machines, and this swings us back into standards and the current state of behavioral metrics. Equating an individual clicking on an ad with anything other than a click on an ad is (to me) dangerous. Human activity is not easily deconstructed into a series of subsystems (my apologies to Skinnerites everywhere). A simple report such as “Are people having a good or bad experience on a site?” requires the same data as the “drill-down” report “Why are people having a bad experience?” because people don’t have a good or bad experience simply because they’re having a good or bad experience. (see figure below)

There’s a reason they’re perceiving their experience the way they are, such as “They didn’t understand what was offered“, “There wasn’t enough information for them to make a decision” or “They didn’t see what they wanted.” (see figure below)

Know why someone is having a good or bad experience in a cognitive, motivational/effective or behavioral way and you can address their reasons accordingly. Simplify the offering, add more information, add an image; whatever is required to address your market and your business model (for the curious few, yes, these are genuine NextStage reports and demonstrate the kinds of information we provide our clients).

For Angie and Matt, and The Noisy Data Finale

I’ll bet some of you thought we’d never get to the end of this thread. Here is where all the threads come together with “…a tool is going to have either limited use or few users if substantial training is necessary in order to use it.”

The first tools need to be simple if they’re going to be used at all. Once simple tools are understood they can evolve along with the tool users to handle more complex situations and more advanced uses.

Let’s bring all the threads together at this point.

  1. Angie wants metrics that have accountability.
  2. Business demands that things be scalable in order for the plateau to go on far and wide.
  3. The blue sky desire is to have “Simple tools backed by incredibly complex analysis.”
  4. Matt helped me to understand that before someone can use an adjustable spanner they need to be comfortable with a stone knife.
  5. Noisy data, like what we use to call “junk” DNA, isn’t junk. It has meaning, you just have to know what it means.
  6. What’s noisy data in one paradigm is perfectly valid data in another.
  7. What traditional web analytics — what’s happening at the machine, the computer — considers noisy data is perfectly valid data for discerning what’s happening in the heart and mind of the person sitting at the computer.
  8. Tools need to start simple and evolve with their users.

So the end of fugue which Angie’s blue-skying sent me on and which Matt helped me understand resulted in what we’re calling The NextStage Gauge, “…a simple tool that indicates the online health of your website along with 1-3 action items for improving your site’s ROI.” The NextStage Gauge can recommend up to six action items and will only show at most three at any point in time (for reasons that’ll be in an upcoming IMedia column). The last recommendation The NextStage Gauge will give is that it’s time to upgrade to more advanced reporting.

The NextStage Gauge

In other words, The NextStage Gauge is

  • a simple tool (3,4) that
  • produces action items hence has accountability (1),
  • is scalable (2) because it’s ability to make recommendations is limited only by data storage and processing speed (the underlying algorithms have been making accurate predictions for years),
  • produces a metric based on what most consider noisy data (5-7) and
  • recognizes when the user is ready for a more complex tool (8).

By the way, NextStage is interested in hearing from companies interested in betaing The NextStage Gauge. Please contact us with your interest. [[(the mice) For some reason that tool isn’t listed on our Tools page even though it’s actively used. Bet this means Mrs. C is going have us add it sometime soon.]]

Right now NextStage is developing some algorithms to remove noisy data from blog metrics (I’ll bet my ranking as a B-list blogger‘s going to go down because of this).

Links for this arc:


Posted in , , , , , , , , , , , , ,

One thought on “The Complete Noisy Data Arc

Leave a Reply