Comments: Size Does Matter

If you’ve ever worked on a website where users are allowed to comment, it’s likely that you’ve had a discussion where somebody asked that the length of comments be restricted. I guess the assumption is that people will only write useless drivel anyways, so the shorter the comment, the less useless drivel they can put on your site. But if that’s the assumption, why have comments to begin with?

In my subjective experience, there is a correlation between comment length and quality. When reading comments online, I’ve always felt that the length of a comment is a good predictor for its quality; the longer a comment, the better it probably is. If this is true, restricting comment length is likely to restrict the best comments, while leaving poorer comments intact.

To find out whether there really is a correlation between comment length and quality, I’ve set up a little experiment this weekend. I’ve loaded a bunch of comments from MetaFilter and YouTube,1 and I’ve asked people to rate them.

Comment Rating Site

Since YouTube restricts comment length to 500 characters and MetaFilter doesn’t, mixing the results of the two sites would skew the results. YouTube comments are obviously worse than MetaFilter comments, so mixing the results of the two sites would unfairly hurt the average quality of short comments. To get around this problem, I’ve kept the two data sets separate.

I should also note that some short comments may be replies to other comments that don’t make sense out of context, so short comments may have a slight disadvantage. Additionally, I’ve removed all formatting from the comments, which may have made some of them look slightly incoherent. This is possibly more likely to hit longer comments. I believe that all of these issues don’t have a huge influence on the result, though. The context of most comments is obvious even without any formatting, and additionally, the problem likely affects all comment lengths to some degree.

In total, you guys have kindly rated 9310 comments (4470 from MetaFilter, 4840 from YouTube) with 3972 distinctive comment texts (1469 from MetaFilter, 2503 from YouTube). For duplicate comment texts (some comments were rated more than once, and other – short – comments had the same text), I’ve taken the average rating. For the correlations themselves, I’m providing charts, but I haven’t calculated the correlation for any of the results.

Finally, I’ve assigned a number to each «quality group», from 1 for worst to 5 for best. Sometimes, you’ll see the group’s smiley, sometimes, you’ll see the number

Rating Key

Okay, with all the disclaimers out of the way, let’s look at the data.

A Word On Comment Length

Before I’m answering any of the interesting questions, I have to point out that most comments are short. Here’s how MetaFilter’s comment length distribution looks like with a linear x-axis:

MetaFilter Comment Length Distribution Linear

YouTube’s comment length drops off even quicker since no comment can be longer than 500 characters. In order to get around this problem, I’m sometimes using an exponential scale with base 2 for comment length, i.e. when grouping comments by length, I’m increasing group sizes as comment sizes increase. The same comment length distribution as seen above looks like this on an exponential scale:

MetaFilter Comment Length Distribution Exponential

Since the sizes of the groups are mostly irrelevant for my results, using an exponential scale allows for much more readable charts.

Is there a Correlation Between Comment Length and Comment Quality?

First, I’ve taken the average rating for each comment, rounded it to the nearest «quality group», and calculated how long the average comment in each group is. Here’s the result for YouTube:

Length per Quality for YouTube

(Show me the Numbers)

Note that you can click on «Show me the Numbers» below each chart to see the numbers used in that chart, as well as some additional data such as the number of comments in each group, and the average word length in each group.

On YouTube at last, there doesn’t seem to be a correlation between comment length and quality. MetaFilter paints a different picture:

Length per Quality for MetaFilter

(Show me the Numbers)

It is clear that higher-rated comments are, on average, longer than lower-rated comments. The group of highest-rated comments has an average length of over 750 characters, way ahead of what is even allowed on YouTube.

Let’s turn this chart around and plot quality based on length. Note that I’ve made the group sizes grow exponentially on the x-axis since there are fewer long comments.

Here’s how this looks on YouTube:

Quality by Length YouTube

(Show me the Numbers)

Again, this looks pretty flat (there actually is a correlation between length and average rating here - correlation coefficient is around 0.4 - but the results are all pretty similar, so I don’t think it matters much).

Let’s look at MetaFilter’s chart:

Quality by Length MetaFilter

(Show me the Numbers)

This chart clearly shows that the longer a comment is, the higher it is being rated. However, the «ratings growth» flattens out at a comment length of about 2000 characters with a rating of about 3.70 – long before hitting the «rating limit» of 5.

Conclusion

Given this data, my recommendation is to not restrict comment length, or to restrict it at a high level.

Looking at MetaFilter comments, there seems to be a clear correlation between comment quality and comment length. At least on websites with an audience that is not actively malevolent, longer comments seem to be better. Since longer comments on average were rated higher, restricting comment length may cut off the best comments, or force comment writers to redact their comments until they fit – and it’s pretty hard to trim a 2000-character-text into 500 characters without removing substance.

If, for some reason, you are required to restrict comment length, I would recommend to set the limit at around 2000-4000 characters. At this point, the return in quality for additional comment length seems to diminish. Comment quality actually decreases after 4000 characters in my data. However, since there are only four comments longer than 4000 characters, this may not be representative. Even so, the data points before that final group imply that average quality is at least not going to grow further after 4000 characters.

Note that I am not claiming that there is a causal relationship between comment length and quality. Comment quality is determined by many factors. Length may be one of them. However, the data does seem to suggest that restricting comment length will not improve comment quality. It does not prove that this is the case, though - even given this data, it’s possible that comments on MetaFilter would be more succinct and to the point if their length was restricted. It might also be that people simply rated longer comments higher because shorter comments were more likely to lack context, and that there is no actual correlation between quality and length even on MetaFilter. I should also repeat that YouTube did not exhibit a similar correlation between comment length and comment quality, so clearly, different sites operate under different rules.

What Else?

Of course, at this point, all of this got me thinking. Is there anything about the content of these comments that acts as a predictor for quality? You betcha.

Please note that this is not a statistically valid study. The selection of the comments is not very random (which should be okay when evaluating comment length, but probably isn’t when evaluating words, especially for words which occur rarely). The people who voted were self-selected, and the sites which advertised the voting form further skewed who was able to vote.2

Please also note that I am not advocating using any of this data for filtering purposes. This is just for fun.

For each of these charts, I’ve started with the average rating, followed by the ratings for the given subset of comments. For those «subset» ratings, I’ve marked them red if they were below average, and green if they were above average.

Weird Punctuation

First, let’s look whether there’s a connection between poor punctuation and quality. I’ve looked at four different punctuation mistakes: A missing apostrophe in «don’t», an ellipsis with only two dots, an ellipsis with four dots, and four consecutive question marks (arguably, the last one isn’t really a mistake, I guess).

Mistake Average Rating Number of Comments Standard Deviation
Average 2.38 3972
…. 2.00 147 1.17
dont 1.84 69 1.12
.. 2.00 91 1.10
???? 1.81 15 0.81

All of these mistakes seem to correlate with a decrease in perceived comment quality. The four consecutive question marks are hit the hardest, followed by the missing apostrophe.

CAPS LOCK

Looking at the comments with at least 8 or 20 consecutive capital letters, we get a similar picture. CAPS seems to be an indicator for a poor comment; the more caps, the worse the comment:

Average Rating Number of Comments Standard Deviation
Average 2.38 3972
8 Consecutive Caps 2.01 256 1.18
20 Consecutive Caps 1.90 46  1.06

Colloquialisms

Next, let’s look at terms which are typically not used in formal language. This includes abbreviations like «OMG» or «LOL». I’m also including things like using «u» instead of «you». Here’s a list, and how they stack up.

Colloquialism Average Rating Number of Comments Standard Deviation
all 2.38 3972 1.24
omg 2.16 54 1.25
LOL 1.79 245 1.00
u 1.82 173 1.06
dat 1.58 12 0.93
bro 1.60 5 0.73

Unsurprisingly, they’re all indicators for poor comments. So let’s move on to something more interesting.

Swearing

What’s interesting about swear words is that not all of them are created alike. Some actually improve comment quality. See for yourself:

Swear Word Average Rating Number of Comments Standard Deviation
all 2.38 3972 1.24
suck 1.87 60 1.12
fuck 2.10 229 1.25
cock 1.98 14 1.29
cunt 1.40 5 0.37
asshole 1.99 14 1.21
jackass 2.70 11 1.30
crap 2.70 30 1.31
shit 1.94 140 1.16
bullshit 1.88 17 1.05
holy shit 2.25 2 1.25
douchebag 2.57 7 1.59

Most notably, both «jackass» and «crap» seem to be reasonably good indicators for above-average comments.

Addressing Each Other

Comments which refer to the writer or to the another person (and contain words such as «I», «you» or «me») rank above average. I guess People are nicer and write more thoughtfully when they are addressing each other personally.

Average Rating Number of Comments Standard Deviation
all 2.38 3972 1.24
you 2.64 1117 1.28
I 2.72 1527 1.24
me 2.76 386 1.30
your 2.69 299 1.32
my 2.69 461 1.30

Other Positive Words

There are many more words which are predictors for good comments. Here are a few.

Word Average Rating Number of Comments Standard Deviation
all 2.38 3972 1.24
consider 3.61 48 0.91
interestingly 3.50 2 0.50
presumably 3.89 5 0.40

Note that «interestingly» and «presumably» only occurred twice and five times, respectively. What’s more, none of these words appeared on YouTube at all. This might imply that they’re only above average because they only appear in MetaFilter comments. However, MetaFilter’s average rating is 3.17, so they are above average even when only considering MetaFilter comments.

Average Word Length

Next, let’s look at word length. Is there a connection between average word length and comment quality? Let’s see. Here’s the average word length for every rating group for YouTube comments:

Word Length by Quality on YouTube

(Show me the Numbers)

And here’s the same chart for MetaFilter comments:

Word Length by Quality on MetaFilter

(Show me the Numbers)

I guess the interesting part here is that there is no interesting part. I would have expected to see a clear correlation between comment quality and word length. That does not seem to be the case.

Average Sentence Length

Same deal as average word length, but this time comparing average sentence length. On YouTube, it looks like this:

Sentence Length by Quality on YouTube

(Show me the Numbers)

Again, that doesn’t look like much of a correlation. But look at MetaFilter:

Sentence Length by Quality on MetaFilter

(Show me the Numbers)

Clearly, MetaFilter comments with longer sentences are perceived as «better.»

Context

So, can you use this data to predict whether a comment is good or bad? No. It’s all about context. Consider the words «gay» and «dog». On YouTube, these words have clear negative connotations. On MetaFilter, however, comments containing the word «gay» or «dog» are rated above average.

Word Average Rating on YouTube Average Rating on MetaFilter
all 1.92 3.17
bush 2.5 3.04
obama 2.22 2.95
god 1.60 3.26
dog 1.74 3.78
gay 1.73 3.85

One possible explanation may be that «gay» and «dog» are used as insults on YouTube, but are used with their neutral meanings on MetaFilter.

As for Obama and Bush, perhaps political discussions are above average compared to the typical YouTube discussion, but below average compared to the typical MetaFilter discussion.

Comparing the Quality of YouTube and MetaFilter Comments

And finally, if I’m not going to provide this information, you’ll ask me for it, so here goes.

I’m not going to compare comment length, since YouTube restricts the length of its comments. However, what I can compare is word length. On average, MetaFilter commenters use longer words:

Average Word Length
All 4.40
Mefi 4.52
YouTube 4.02

Despite using words that are only slightly longer on average, their sentences are more than twice as long:

Average Sentence Length (Characters)
All 70.38
Mefi 89.87
YouTube 39.92

That means that MetaFilter commenters use roughly twice as many words per sentence as YouTube commenters.

And the average quality of their comments is better:

Average Comment Rating
All 2.38
Mefi 3.17
YouTube 1.92

And that’s all, folks.


  1. To be perfectly clear, none of the data I have collected will be stored or released except for the data in this blog post. The collected list of rated comments will not be released for privacy and copyright reasons. In fact, I have already deleted the corpus of rated comments, so if you have additional ideas for evaluation, I unfortunately can’t implement them. back

  2. A word on statistical significance. Given that some of these results are based on a very small number of comments, it is possible or even likely that they have occurred by accident. I haven’t calculated the significance for any of the results. For some results, I have provided the standard deviation to indicate how closely clustered the results were. Again, I’m not a statistician, and this part is just for fun; don’t rely on any of these results. back

If you require a short url to link to this article, please use http://ignco.de/188

designed_for_use_small

If you liked this, you'll love my book. It's called Designed for Use: Create Usable Interfaces for Applications and the Web. In it, I cover the whole design process, from user research and sketching to usability tests and A/B testing. But I don't just explain techniques, I also talk about concepts like discoverability, when and how to use animations, what we can learn from video games, and much more.

You can find out more about it (and order it directly, printed or as a DRM-free ebook) on the Pragmatic Programmers website. It's been translated to Chinese and Japanese.