In a recent comparison of different text extraction algorithms, Gravity’s open source project: Goose tied for second place and was even written up over at Read Write Web! I find this very exciting because our project is still quite young and actively in development whereas the algorithms in close standing are mostly well established and semi-finalized. Another interesting point is that most of the competition was built by teams of researchers, you know… Doctors in their fields!
The graph below from Tomaž Kovačič‘s study shows only a small amount of the data he collected in his analysis. If you are curious of how he compared these algorithms, I highly recommend you head over to his post. He does a great job exposing the details behind his analysis.
Goose’s wiki provides a very detailed explanation about what Goose is and how it works, and also touches on the original need we had at Gravity behind its creation. Jim Plush wrote the first version from the ground up on his own and only recently gave me commit access to the repository. By the time I got into the project, it had all the bells and whistles required to compete in the analysis completed by Kovačič. My contributions to Goose have been to extend it to allow for more specific extractions of additional meta data outside of the primary content and have no effect on its standings above.
Such a utility can be applied to a wide variety of web content analysis problems, and I’m really glad Plush decided to share it with the rest of the open source community. At Gravity, we have been building a lot of exciting (to me at least) technology and most of it is held dearly by us and needs to remain a company secret as they make up a large part of our company’s overall value. When it comes to analyzing the content out here on the web, Goose can be looked at as our trusty messenger delivering our system plenty of content to analyze without a lot of the noise that comes along with it on the web pages the content is sourced from.
If you are looking to mine some of the golden nuggets of information that is buried under a ton of ads, peripheral links, site menu structures, and other distracting noise, then why not take a look at what Goose has to offer? If you find anything you think Goose may be lacking or have some ideas on anything else that may be improved, let us know on our Github repository: https://github.com/jiminoc/goose