About the Vocabulary Size Test

Bruce Zhang (01/22/2004)

Since the release of the vocabulary size test on 11/05/2002, over three thousand users have taken the test as of today (01/22/2004). It's time to share the statistic data with those who're interested in the test and to explain how the test was developed.

Test Results

Users take about 10 - 30 minutes to complete the test. The average time to complete the test is about 15 minutes. The accuracy of the test in the following table is based on feedback from over 800 users.

Not Meaningful10

I was surprised by the percentage of underestimate since the development of the test is very scientific. I can adjust the calculation of the size to increase the accuracy level, but I decided not to. This is a test for estimating vocabulary size under 10,000. Many users who completed the test have vocabulary size way over 20,000. Excluding the users whose estimated size is 8,500 or higher, the adjusted accuracy level is about 60%.

Development of the Vocabulary Size Test

While thinking of a simple, easy to use, yet scientific way to estimate a person's vocabulary size, I knew it's more of a mathematical problem than a language problem. It is really a very simple statistical sampling in theory. The challenge lays in the implementation. Luck enough, the resources on the Internet have made it possible. The test was developed in the following steps:

  1. Generate a frequently used word list by combining 5 word lists (e.g. College Vocabulary List). The generated list has over 30,000 words.
  2. Estimate the word frequencies by searching 5 search engines. The estimated frequencies using word occurrences in billions of documents is much more reliable than the known frequency estimates which use a much less number of documents.
  3. Generate a random sample of particular size (50 for instance) from the word list for word frequency under 10,000. I used random.org Website to generate true random numbers.
  4. Use a weighted average to calculate estimated vocabulary size based on word frequency and user's answers to the multiple choice questions.

