How SimilarWeb’s APIs Tackle the Challenges of Categorizing Websites

Sorting web domains into categories can be a major problem for marketers and product managers from ad networks, lead generation companies, security software firms, content filter providers, marketing segmentation experts, classified publications and scores of other types of companies. These professionals know that categorizing a website accurately can make the difference between a win or a loss.

Entire companies can be built or destroyed by the integrity of their system for categorizing the web. That’s why we make sure that SimilarWeb’s category database, which powers the SimilarWeb Website Categorization API, is as robust, versatile and accurate as possible. Simply create a call for virtually any domain out there, and the API will turn around actionable data that you can count on.

 

Many Layers to the Challenge

Users can’t always accurately categorize a website, even after reading through most of the text. This job is doubly hard for bots that need to do this job without the benefit of human intelligence. What parameters should a computer program use to categorize websites – meta data, content text analysis, or websites that function in similar ways?

This problem is especially acute when trying to categorize very large websites like Wikipedia, or websites that use many images strictly for their illustration symbolism. Robots don’t do well with non-literal meanings.

Another aspect of this problem is the fact that many websites fit into more than one category or sub-category. How many categories and sub-categories should be allowed per website?

Also, because the internet is so dynamic and fast-paced, websites need to be checked again and again. They may have disappeared or evolved in a way which places them in a different category. New sites go up all the time – 571 of them every minute of every day, according to one estimate. Sometimes old ones are taken down, too.

 

Categorizing Websites with the SimilarWeb API

In order to take the headache out of website categorization, the data team behind SimilarWeb created a categorization engine which uses a multi-class learning algorithm. Our code assigns a category for any given website using website content tags, similarity results and a learning set of 2.5 million websites that have verified category assignments.

SimilarWeb’s data is collected from many sources at once, leading to unparalleled accuracy. Our engine can accurately classify an unknown website into one of 25 main categories and 219 sub-categories. You can see the full list here. Machine learning and input allow us to constantly improve the engine, and the results are rigorously cross-validated and tested so the API provides the best results.

 

Keep on Moving

The ability to algorithmically generate categories for a given list of websites is enormously powerful and can be used for lead generation, marketing segmentation and online filtering. Instead of going through lists of leads or visiting web addresses manually one by one, you can save time and money running them through the SimilarWeb API and getting quick and accurate information on each URL’s category.

Checkout the data yourself

 

About the Author -

Noam Schwartz - Entrepreneur, Hacker, Analyst. Obsessed with making the world better using data.

Discover the secrets of online success

Comments