In a recent comparison of different text extraction algorithms, Gravity’s open source project: Goose tied for second place and was even written up over at Read Write Web! I find this very exciting because our project is still quite young and actively in development whereas the algorithms in close standing are mostly well established and semi-finalized. Another interesting point is that most of the competition was built by teams of researchers, you know… Doctors in their fields!

The graph below from Tomaž Kovačič‘s study shows only a small amount of the data he collected in his analysis. If you are curious of how he compared these algorithms, I highly recommend you head over to his post. He does a great job exposing the details behind his analysis.

graph

Goose's standing among other algorithms tested

So what is Goose used for at Gravity and why have we open sourced it?

Goose’s wiki provides a very detailed explanation about what Goose is and how it works, and also touches on the original need we had at Gravity behind its creation. Jim Plush wrote the first version from the ground up on his own and only recently gave me commit access to the repository. By the time I got into the project, it had all the bells and whistles required to compete in the analysis completed by Kovačič. My contributions to Goose have been to extend it to allow for more specific extractions of additional meta data outside of the primary content and have no effect on its standings above.

Such a utility can be applied to a wide variety of web content analysis problems, and I’m really glad Plush decided to share it with the rest of the open source community. At Gravity, we have been building a lot of exciting (to me at least) technology and most of it is held dearly by us and needs to remain a company secret as they make up a large part of our company’s overall value. When it comes to analyzing the content out here on the web, Goose can be looked at as our trusty messenger delivering our system plenty of content to analyze without a lot of the noise that comes along with it on the web pages the content is sourced from.

If you are looking to mine some of the golden nuggets of information that is buried under a ton of ads, peripheral links, site menu structures, and other distracting noise, then why not take a look at what Goose has to offer? If you find anything you think Goose may be lacking or have some ideas on anything else that may be improved, let us know on our Github repository: https://github.com/jiminoc/goose

May 162011
 

Well… I got sick of not being able to get my CommunityServer blog to do what I wanted, so I bit the bullet and went WordPress.

For now, all of my previous robnrob.com posts are tucked away in a MSSQL database hidden away from the world. Once I can get the data ported over to this WordPress instance, I will most definitely do it.

Short post, but I have a lot of work to do.

peace,
– robbie

 

QUESTION 4 my peeps: Do you still use myspace? If yes, why? if no, why not? Please respond at least w/a yes or no. THANKS!

Was posted to these 17 Services at the same time via Ping.fm:

AIM Bebo Brightkite Facebook FriendFeed Friendster GTalk / Google Buzz LinkedIn Y! meme Multiply MySpace Ning Plaxo Plurk Posterous Xanga Yahoo!

That is the question I asked all of my friends and followers from all of those social networkie sites I’m active on. I asked this about 5 hours before posting the results here:

  1. MySpace (view post): 6 comments:
    4 yes / 2 other (but of course all were a yes by the fact that they used it to respond)
  2. Facebook (view post): 24 comments:
    10 yes / 9 no / 5 other
  3. Twitter (view post): 4 @replies [ 1 , 2 , 3 , 4 ] + 1 direct message:
    3 yes /  1 no / 1 other
  4. Brightkite (view post): 1 comment (actually 2 but from the same person):
    1 no
  5. Work Email (lol, yeah, since a lot of my coworkers follow me): 1 response:
    1 yes

This makes the total “yes” answers 18 and the total “no” answers 10 out of a total of 37 responses.

Two interesting points are that out of the four networks I’ve inked to above, the most commented network gives me a link that can only be viewed by a logged-in facebook user and the 2nd network (twitter) has no “view entire thread” link as the other three do. Either way, I am not some social media guru, so I only have a “normal” following (under 600 on each network) and this may not be much of a representation of the web masses at all. I did however find it very interesting and really appreciate all of the responses. It is also worth noting that I posted this same question on  13 other networks without any response at all.

I also need to mention that out of those responding “yes”, about half of them mentioned either a minimal usage or tried to explain why they still do. As many of you know, I work at MySpace and a quite a few of the respondents do as well. 10 of the total yes answers came from my fellow employees at MySpace and 2 of those stated only using it for work. To keep this in perspective though, there are at least that many people that are my friends/followers that used to but no longer work at MySpace and that can affect their perspective differently.

Here are some of the key points made by the respondents (I don’t mention too much detail here but covered all of those not publicly visible in the links above):

  1. If they do use MySpace, most noted that either not for social networking, or very little.
  2. Some mentioned using it for either just music, or finding connecting with bands.
  3. Out of those that used to use MySpace but no longer do, some mentioned they stopped when their moved on to twitter or Facebook.
  4. Out of the other reasons why some don’t use MySpace, the two top were bad design and bad impression of what users are there.

As pointed out by Christina Gagnier in her article on the Huffington Post:

“The users of MySpace are diverse. It is superficial to cast off MySpace as merely a “digital ghetto” because its demographics may be different from that of the other social networks.”

My friends / followers do not represent the entire internet at all and in fact most like represent a very small demographic, but this is the same demographic that is most active in the “popular” social networking space.

I have more detailed feedback that I felt would be better suited for a separate post entirely.

© 2011 Rob 'n' Rob Suffusion theme by Sayontan Sinha