> About Manatee

About Manatee

Card Catalog of the Web

This is the proposal. To see the rationale for its necessity, check out The Problem linked below.

Sections about CCoW

Here is the Problem
How Paper Libraries Have Handled Categorization and Computer Searches
The CCOW Proposal
Technical Implementation of the System, or How Webmasters Would Use It
Semantic Implementation of the System, or How Search Engines and Directories Would Process It
Recommended Reading on the Subjects of Categorization and Searching

Proposal: A Card Catalog of the Web

I propose a means of adding hierarchy to free text searching. Since libraries have made good use of cataloging books based on their subject within a hierarchy or a set of subject headings, and since the advent of widespread computer use has given us the power to catalog keywords and other information in databases and quickly search the databases, why not solve the problem mentioned in passing in TIME's article, "there is no comprehensive 'card catalogue' system to organize the gargantuan library" 2? Why not create a card catalog that people could use to find materials based on subject and author as well as title? Or why not give webmasters and search engines the power to turn search engines and indices into more powerful tools for finding the information we really want?

Users Will Benefit

Search engines could give users the ability to add a category limiter to their search queries, and indices could quickly sort Web sites into appropriate categories, which could then be searched for sites within the selected categories.

I believe the best way to accomplish this improvement in Internet searching (and, one would hope, in electronic card catalogs) is for webmasters and search engines to adopt a unified standard for a new meta tag that would allow webmasters to categorize their own sites, trusted authority servers to maintain lists of uncategorized or miscategorized sites so search engines can match them with the proper categories, and search engines to group sites in appropriate categories.

The New meta cowc Tag

The tag would look like this: <meta name="cowc" value="#1234"> Where value is the categorization of the page under the Catalog of the Web Classification.

This new system will require webmasters to categorize their pages based on the COWC, so I have begun developing the categories. The general system is a hierarchy, and the categories aim to make classification simple and easy-to-remember.

How Material is Classified with the CCOW Classification

COWC uses four hexadecimal digits to form a number for each of around 65000 categories. Each level except the least significant uses one of a small number of schemes. There is, for example, a general scheme, a geographic scheme, a linguistic scheme, and a historic scheme, depending on the type of the category to be divided.

The Hexadecimal Numerals in General Division

In the general categorization, 0 denotes a topic of multiplicity or collections. Because of this, a Web site that covers multiple topics can still categorize itself in a single position. For example, a metasearch search engine, such as Dogpile would use the category #0000, because it searches a number of search engines, which gives it access to almost unlimited types of data. A site dealing with a multitude of languages might fall into the category of #3100, while a site dealing with multiple topics on the English language might fall into the category of #4120.

The other categories are similarly divided and subdivided. 1 denotes topics of generality that are not topics of multiplicity. For example, encyclopedia sites would fall into #1100 or one of the other 255 categories below the top levels 11, because they cover general knowledge. 2 has to do with mentality, or the psychology or philosophy of a topic, so under each of the other categories and subcategories, there might be a section dealing with philosophy, such as the philosophy of faith healing (#bc20). 3 denotes memorability and covers topics of memory or history.

4 is the category of communicability or communications, which is where topics of languages, rhetoric, and transportation may be found. 5 is technology. 6 is spatiality and covers topics such as geography, cartography, and non-chemical astronomy.

7 is empiricism and holds the hard sciences and mathematics. 8 is for artistry and covers topics of artistic expression. 9 is the number of legality, law, and political science.

The numeral a is for provision and holds things like agriculture. The numeral b is for vitality and holds topics of health or the maintenance of non-biological systems. The numeral c covers religion, rituality, and faith, so biographies of religious figures would be found under a subheading of c, somewhere between #ec00 and #ecff.

The numeral d denotes sociality and houses things like social science. The numeral e signifies generally that a category or subcategory has to do with specificity or individuality. This topic would cover biographies and personal home pages. The specific method for categorizing home pages has not been decided, but there are so many of them on the Internet that they get to share (#ee00-#eeff) the e category with biography. Finally, f signifies marginality, abnormality, and fiction. In addition to fiction (#f000-#fe00), there is space here for new, marginal, controversial, conjectural, and unaccepted topics (#ff00-#ffff).

These categories will all be assigned as there is time and demand for them.

The other types of division mentioned have similar significance for each numeral.

How do webmasters use this?

Technical Implementation,
or How Webmasters Would Use It

Selection

When the system is more fully developed, webmasters will be able to visit a Web page maintained by the CCOW's authoritative agency (either an individual or a group) and browse the category tree.

Having located the category that best matches the webmasters' sites or a page on their sites, the webmasters will see the proper category number and categorize their pages.

Tagging

Webmasters will categorize their pages by adding a <meta name="cowc" content="#1357"> tag to the head section of their HTML.

This tag can then be read by search engine spiders and directory maintainers, who will be able to put the Web site in an appropriate category.

Visitors

Having been properly categorized, the Web site will receive a better ratio of visitors who want to see the site compared to visitors who were looking for something unrelated to the site.

How search engines will use this information and how cheating can be diminished are covered in the section on Semantic Implementation.


Semantic Implementation,
or How Search Engines Would Use It

Making Sense of the Numbers

When a search engine or index decides to use the CCOW system and store cowc data in its database, it does not, at first, need to make sense of the numbers. But if users are going to reap the benefits of this system, changes must be made at some point to the way the search engine or index processes the stored information.

The capabilities for this system can be enabled for a search engine or index by the administrators of the search engine. Visiting the CCOW authority's Web site, they can code the category hierarchy to their system.

Giving Users Access

Adding a Delimiter String

One method of allowing users to search with the hierarchy is to add a delimiter string, such as "cowc:", which would be used similar to "domain:".

Directory, Then String

With the power this system gives robots to automatically put URLs in specific categories, indices can begin using robots or crawlers to find new sites to be added to their directories, and search engines can make their own directories with little extra effort.

With the directory in place, users can browse to a general directory and then apply their search string. This will avoid most of the multiple-meaning problems inherent in a broad and uncategorized free text query.

Downward Inclusion

The hierarchical nature of the system allows for easy inclusion and exclusion for searching. A search in the 12 subcategory should search all categories between #1200 and #12ff while excluding results from #1300 and up.

The same is true whether the user browsed to 1200 or entered "cowc:#12" in the search string.

Preventing Cheaters

This system also has the potential, with some clever coding, to help crawlers discover cheating Web sites.

The subjective nature of keywords has probably been a major reason most search engines do not employ report abuse features to alert them to mismatched results. But with a hierarchy, there is a more objective standard, and search engines can increase the reliability of their results.

A number of possibilities exist to make catching cheaters easier.

Report Links

When a search query is served using the hierarchy, report links can be added to each result so that the porn site that somehow got into the category for childrens' toys can be reported. If enough searchers report the site, an administrator can verify it and move it to an appropriate category.

This information can then be compiled by trusted authorities.

Trusted Authorities

Trusted authorities can be established to serve XML files to search engines and indices with URLs and their correct categories.

This can be accomplished with simple records, such as:

<record>
    <url>http://someadultsite.com</url>
    <name>Site's Name</name>
    <cowc>#877f</cowc>
</record>

When the search engine records the URL of a Web site, it will check that URL against the domains and page URLs listed on the trusted authority's server. If a conflict occurs between a subpage on a listed domain, the crawler will flag the URL for human processing or record the URL with the trusted authority's categorization.

This will help prevent sites from being recorded in inappropriate categories.

Category Comparison Across Pages

When a home page server, such as GeoCities or Cox, is added to the database of a search engine or index, it should be categorized as an ISP or home page host. These and other special categories should be marked as allowed to have pages in virtual subcategories #ee00-#eeff, which is the category space for home pages, as well as allowed to have other categories under them.

But when a domain that is not a home page host, such as history.com or joesantiques.com is added, its pages should be checked against other pages on the domain. If two pages on the same domain have unrelated categories, they should be marked for human processing. This will prevent a commercial site of any sort from simply adding pages to its Web site (or dynamically generating pages) that cover each of the 65000+ categories in an attempt to bring in visitors who are searching for something unrelated to their site's content.

The main page of the domain (or in the case of ISPs and free and fee-based page hosts, the individual user directories) should contain the topmost category represented by their page. For example, CNN.com would probably put its front page under the category of #0c00, with its U.S. news section under the category of #0c31 for all pages in its national U.S. section. Likewise, for example, History.com could put its homepage under the category of #3000, with a category of #350a for pages referring to Mary I because she is an English monarch from the 1500s, and #350a is European and Scandanavian History in the 1500s.

However, if the search engine is checking categories against other pages on the same domain, History.com could not have a page categorized #81a5. A page with an unrelated category number would be either rejected from inclusion in the database or flagged for human intervention.

With these and older safeguards in place, most users will be able on most queries using the hierarchy to find what they actually want to find without having to wade through sites trying to pull them into something they didn't want.

Related Sites

In addition to hindering cheaters, comparing categories can also help users find pages they might not have otherwise seen with the search keywords they chose.

Similar to the function included in the Endeca-powered system of the Antelman, et al, study 14, search engines could add category links to the results of every free text query. These category links would add that category marker to the query text and limit the search to that category.

With these links, users can make a first query, look at the results, and click on the category link of a matching result to get more results that are in the right category. The ability of users, who do not normally stray much from their original queries 14 or try additional queries when a search fails on the first one, to find the information they seek would be increased by this allowance for users to jump from uncategorized searches into categorized searching and browsing. Spink, et al, found that "most people use few search terms, few modified queries ... and rarely use advanced search features" 15.

Conclusion

Hopefully, with the features available of using categories within searches or searches within categories, and of traversing trees of pages instead of trying to divine the right magic words for a search query, users will be able to find information easily and efficiently without realizing that they are using an advanced feature, because its ease-of-use will be so intuitive. And when people see how much better they are finding what they seek, perhaps they will take time to learn the other advanced features, finally deeming them worth the effort.

There is more information available on these topics.

Recommended Reading

Other Materials on Categorizing and Searching

Search Engine Watch - This site has news and information about topics in the realm of search engines, including ratings of the top ten search terms and search engines. Search Engine Watch also has tips and tricks for webmasters seeking to get good results in search engines, as well as a discussion forum where users can talk about search engines, optimization, industry trends, and other topics related to Web searching. Their discussion area also has a number of featured discussions for more focus on highlighted topics.

Dogpile is a metasearch engine which compiles the results of the more popular and respected search engines, including Google, Ask.com, Yahoo! Search, and Windows Live. This allows users to find results on multiple engines without going to multiple search engine sites. Dogpile offers preference memory, specialized search (Web, Images, White Pages, Audio, etc.), and mature content filtering. Dogpile also suggests closely matching words that might be the correct spelling of a user's search term. Results include sponsored links.

Kosmix is a search engine and directory hybrid that provides for searching along tag-based topics, similar to the Endeca-based search system observed in the research of Antleman, et al14. Kosmix lets users explore topics or use a search term to find information. For example, the user can enter a car model name into the search box or click on one of the suggested starting points, such as "Ford Mustang" or the major category, "Autos." The browse and results pages contain briefs of the topic's Wikipedia page (if applicable), related Weblogs, and available video files.

Web Searching, Web Users, Web Queries, etc. - B. J. Jansen's Web site with many articles on the topics of Web searching and keywords.

Works Cited:

[1] Labrie, Ryan, and St. Louis, Robert. "Information retrieval from knowledge management systems: using knowledge hierarchies to overcome keyword limitations." 2003. June 7, 2007 http://myhome.spu.edu/ryanl/AMCIS2003-LaBrie-StLouis.pdf.

[2] Owens-Liston, Peta. "Improving on Wikipedia?" TIME 15 May 2007. 18 June 2007 http://www.time.com/time/business/article/0,8599,1621221,00.html.

[3] The first search engine,Archive. 3 July 2007. http://www.isrl.uiuc.edu/~chip/projects/timeline/1990archie.htm.

[4] Search Engine History. 3 July 2007. http://www.seoconsultants.com/search-engines/history/.

[5] Official Website for the Town of Shalimar, Florida. Shalimar, FL. 18 June 2007. http://www.shalimarflorida.org/.

[6] Guerlain Paris. Guerlain. 18 June 2007 http://www.guerlain.com/index.asp?page=gbasp/parfum/produit.asp%3FID%3D1%26IdAxe%3D1&logo=1.

[7] Manohar Malgonkar. Wikipedia. 15 Sep. 2006. 18 June 2007 http://en.wikipedia.org/wiki/Manohar_Malgonkar.

[8] Kolkata Suburban Railway. Wikipedia. 4 June 2007. 18 June 2007 http://en.wikipedia.org/wiki/Kolkata_suburban_railway.

[9] Weideman, Melius, and Corrie Strumpfer. "The effect of search engine keyword choice and demographic features on internet searching success." Information Technology and Libraries 23.2 (June 2004): 58-66. InfoTrac OneFile. Thomson Gale. University of West Florida. 14 June 2007 http://find.galegroup.com.ezproxy.lib.uwf.edu/itx/infomark.do?&contentset=iac-documents&type=retrieve&tabid=t002&prodid=itof&docid=a120909270&source=gale&srcprod=itof&usergroupname=pens49866&version=1.0.

[10] About DDC. OCLC. 18 June 2007 http://www.oclc.org/dewey/about/default.htm.

[11] Introduction to Dewey Decimal Classification.OCLC. 18 June 2007 http://www.oclc.org/dewey/versions/ddc22print/intro.pdf.

[12] Library of Congress Classification. Library of Congress. 18 Jun 2007 http://www.loc.gov/catdir/cpso/lcc.html.

[13] Library of Congress Classification. Wikipedia. 16 June 2007. 18 June 2007 http://en.wikipedia.org/wiki/Library_of_Congress_Classification.

[14] Antelman, Kristin, Emily Lynema, and Andrew K. Pace. "Toward a twenty-first century library catalog." Information Technology and Libraries 25.3 (Sept 2006): 128-40. InfoTrac OneFile. Thomson Gale. University of West Florida. 14 June 2007 http://find.galegroup.com.ezproxy.lib.uwf.edu/itx/infomark.do?&contentSet=IAC-Documents&type=retrieve&tabID=T002&prodId=ITOF&docId=A152373992&source=gale&srcprod=ITOF&userGroupName=pens49866&version=1.0.

[15] Spink, Amanda, et al. "Searching the Web: The public and their queries." Journal of the American Society for Information Science and Technology 52, 3 (Feb 1, 2001); 226. ABI/INFORM Global

Photo Credits

* photo-keyboard.jpg came from http://www.freephoto1.com/computer.php, where it was named photo-computer.jpg. Used by permission.
* web.jpg came from http://pdphoto.org/PictureDetail.php?mat=pdef&pg=8281. Public Domain.
* manateelogo.png is (c) Copyright 2007 by Lincoln Sayger. All Rights Reserved.

Related pages: Nothing Related

For questions, comments, or suggestions, contact the webmaster by e-mail.
This page was last updated on 2022.8.25b.