Trifork Blog

AngularJS training

How to cluster Seinfeld episodes with Mahout

April 4th, 2011 by
| Reply

This february I gave a talk on Mahout clustering at FOSDEM 2011 where I demonstrated how to cluster Seinfeld episodes. A few people wanted to know how to run this example so I write up a short blog about it. In just a few minutes you can run the Seinfeld demo on your own machine.

Update 24-03-2014 The seinfeld scripts and demo are no longer available. The original Seinfeld scripts were from stanthecaddy.com and they received a take down notice for the scripts. I decided to remove the scripts from my Github branch as well so the demo described below is no longer valid. No soup for you! :-(

Step 1 Get the Mahout and demo sources from GitHub

First make sure you have git installed

Now clone my GitHub repo and switch to the seinfeld_demo branch

$ git clone https://github.com/frankscholten/mahout
$ git checkout seinfeld_demo

Step 2 Build the source tree

Enter the mahout directory and build the source tree with

$ mvn clean install -DskipTests=true

Step 3 Cluster Seinfeld episodes

Now enter the following

$ examples/bin/seinfeld_vectors.sh
$ examples/bin/seinfeld_kmeans.sh

The first script will create vectors from the plain text Seinfeld episodes stored in examples/src/main/resources/seinfeld-scripts-preprocessed and the second
script clusters the vectors with K-Means. Finally, Mahout's clusterdump program is used to print the clusters
along with top 5 labels and episodes to standard output.

Step 4 Experiment!

You can tweak some of the command line arguments passed to the Mahout jobs and see how it affects
the cluster process. Additionally you can extend the SeinfeldAnalyzer that came with the demo at examples/src/main/java/org/apache/mahout/analysis or create one yourself.

Enjoy!

20 Responses

  1. April 5, 2011 at 20:56 by Rob Terhaar

    So what's the best way to begin digging into Mahout visualization based on this output? Will Gephi work?

  2. April 13, 2011 at 17:27 by Frank Scholten

    @Rob

    Gephi looks quite interesting! After taking a quick look at the docs it seems that you can import different XML formats for graphs. Some code needs to be written to take the different clustering output files (clusters, n-grams, dictionary, points) and join them together into an XML that can be read by Gephi.

  3. May 9, 2011 at 23:21 by Bryan Copeland

    Hi Frank,

    This is excellent work, I wonder where you've scraped together the Seinfeld Episodes' data from?

    I would like to expand the concept you've laid out here to my OpenRecommender project, which aims to build a similar user-rankable cluster for every TV show (and later Movies, Music albums, Books, Events, News articles, etc).

  4. June 19, 2011 at 14:00 by Rui

    Any help with this?:

    Cloning into mahout...
    error: SSL certificate problem, verify that the CA cert is OK. Details:
    error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed while accessing https://github.com/frankscholten/mahout/info/refs

  5. June 19, 2011 at 20:21 by Rui

    Ok. I found out that it should not have the https...

    However, I got an error:

    $ examples/bin/seinfeld_kmeans.sh
    Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
    No HADOOP_CONF_DIR set, using /usr/local/hadoop/conf
    11/06/19 19:40:01 INFO common.AbstractJob: Command line arguments: {--clustering=null, --clusters=out-seinfeld-kmeans/initialclusters, --convergenceDelta=0.5, --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, --endPhase=2147483647, --input=out-seinfeld-vectors/tfidf-vectors, --maxIter=10, --method=mapreduce, --numClusters=100, --output=out-seinfeld-kmeans/clusters, --overwrite=null, --startPhase=0, --tempDir=temp}
    11/06/19 19:40:01 INFO common.HadoopUtil: Deleting out-seinfeld-kmeans/initialclusters
    11/06/19 19:40:01 INFO util.NativeCodeLoader: Loaded the native-hadoop library
    11/06/19 19:40:01 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
    11/06/19 19:40:01 INFO compress.CodecPool: Got brand-new compressor
    Exception in thread "main" java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
    at java.util.ArrayList.rangeCheck(ArrayList.java:571)
    at java.util.ArrayList.get(ArrayList.java:349)
    at org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:107)
    at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:97)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:616)
    at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
    at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
    at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:616)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
    Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
    No HADOOP_CONF_DIR set, using /usr/local/hadoop/conf
    11/06/19 19:40:02 INFO common.AbstractJob: Command line arguments: {--dictionary=out-seinfeld-vectors/dictionary.file-0, --dictionaryType=sequencefile, --endPhase=2147483647, --numWords=5, --pointsDir=out-seinfeld-kmeans/clusters/clusteredPoints, --seqFileDir=out-seinfeld-kmeans/clusters/clusters-1, --startPhase=0, --tempDir=temp}
    11/06/19 19:40:03 INFO driver.MahoutDriver: Program took 282 ms

  6. [...] Now you can upload some data to the cluster via the command line, for instance the Seinfeld dataset from one of my earlier blogs [...]

  7. June 21, 2011 at 13:12 by Frank Scholten

    @rui - Did you first create vectors with examples/bin/seinfeld_vectors.sh?

  8. June 24, 2011 at 16:14 by Rui

    Yes I did.
    I found out.
    I notice after that the seqfiles were not ok.
    So, I uploaded the files to HDFS first. Run the script with the correspondent hdfs input directory and it was ok.
    Seems that I cannot run with success the sequence file converter from a local directory.

  9. July 18, 2011 at 19:27 by Adam

    I run the command:
    git checkout seinfeld_demo

    and I get...

    error: pathspec 'seinfeld_demo' did not match any file(s) known to git.

    Before this I did a git init on the empty directory I created and then followed up with a clone of the mahout stuff.

    I'm new to git so I am assuming the files are still there and this is just me. I'm starting to pine for SVN at this point. :) Any help would be greatly appreciated.

  10. July 21, 2011 at 16:54 by Benhei

    I get the same error as Rui,

    but i don't understand his solution.

    The problem is, that seqdirectory produces "empty" files.
    There's only the one file created: /user/d056324/out-seinfeld-seqfiles/chunk-0.
    It's content is "SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.TextÜÒRt£}YZú?S!+R„@"

    Any ideas?

  11. July 22, 2011 at 10:24 by Benhei

    ... ok, now i got it somehow working to generate the sequence file.
    First i copied the preprocessed scripts to HDFS

    ./bin/hadoop fs -put seinfeld-scripts-preprocessed scripts

    then i used

    mahout seqdirectory --input scripts --output out-seqfiles-new -c utf-8

    to generate the sequence file.

  12. [...] Now you can upload some data to the cluster via the command line, for instance the Seinfeld dataset from one of my earlier blogs [...]

  13. October 25, 2011 at 16:11 by Pavan

    Hello , really nice article . I was just curious . Can you explain what the output directories contain briefly ? It would be really helpful . Thank you .

  14. November 14, 2011 at 20:21 by Gary

    I too got the IndexOutOfBoundsException. I followed what Benhei did on 22-07-2011. The assumption is you have a working hadoop instance running.

  15. November 30, 2011 at 04:34 by Richard

    How much preprocessing did you need to do on the original Seinfeld scripts? I took a quick look at the preprocessed data and it didn't look too different from the original data. But I wasn't checking line-by-line.

    Great work! Very cool.

  16. January 27, 2012 at 21:09 by Renan Oliveira

    Very good project. I adapted your code to work with stopwords in Portuguese. Put it to work with another block of text (not episodes). Congratulations

  17. February 6, 2012 at 05:21 by Bharat Shrinevas

    I am getting an error on mvn clean install. Error is below. Any pointers? Thanks!

    [ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:2.
    3.2:compile (default-compile) on project mahout-core: Compilation failure
    [ERROR] \Users\bshrinev\mahout\core\src\main\java\org\apache\mahout\cf\taste\imp
    l\model\jdbc\ConnectionPoolDataSource.java:[40,13] error: ConnectionPoolDataSour
    ce is not abstract and does not override abstract method getParentLogger() in Co
    mmonDataSource

  18. February 13, 2012 at 03:47 by Mark Loiseau

    This is such an awesome project. Some friends and I were looking into Mahout and came across this post. We read down the list, built the project as specified and it worked perfectly on the first try. What a great idea.

  19. June 13, 2013 at 02:20 by nik

    what if you provided the clusterdump command options?

  20. June 27, 2013 at 21:18 by Sheik

    Hello Frank,
    I'm getting major errors when I run mvn clean install on the mahout directory. This only happens when I switch to the demo branch. The build was successful on the regular mahout directory. Here is the error

    If anybody could help it would be appreciated

    ------------------------------------------------------------------------
    [INFO] Building Mahout Examples 0.5-SNAPSHOT
    [INFO] ------------------------------------------------------------------------
    [WARNING] The POM for org.apache.mahout:mahout-utils:jar:0.5-SNAPSHOT is missing, no dependency information available
    [INFO] ------------------------------------------------------------------------
    [INFO] BUILD FAILURE
    [INFO] ------------------------------------------------------------------------
    [INFO] Total time: 0.995s
    [INFO] Finished at: Thu Jun 27 15:14:56 EDT 2013
    [INFO] Final Memory: 8M/150M
    [INFO] ------------------------------------------------------------------------
    [ERROR] Failed to execute goal on project mahout-examples: Could not resolve dependencies for project org.apache.mahout:mahout-examples:jar:0.5-SNAPSHOT: Failure to find org.apache.mahout:mahout-utils:jar:0.5-SNAPSHOT in http://repo1.maven.org/maven2 was cached in the local repository, resolution will not be reattempted until the update interval of maven2-repository.maven.org has elapsed or updates are forced -> [Help 1]
    [ERROR]
    [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
    [ERROR] Re-run Maven using the -X switch to enable full debug logging.
    [ERROR]
    [ERROR] For more information about the errors and possible solutions, please read the following articles:
    [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException

Leave a Reply