This february I gave a talk on Mahout clustering at FOSDEM 2011 where I demonstrated how to cluster Seinfeld episodes. A few people wanted to know how to run this example so I write up a short blog about it. In just a few minutes you can run the Seinfeld demo on your own machine.
Update 24-03-2014 The seinfeld scripts and demo are no longer available. The original Seinfeld scripts were from stanthecaddy.com and they received a take down notice for the scripts. I decided to remove the scripts from my Github branch as well so the demo described below is no longer valid. No soup for you! 🙁
Step 1 Get the Mahout and demo sources from GitHub
First make sure you have git installed
Now clone my GitHub repo and switch to the seinfeld_demo branch
$ git clone https://github.com/frankscholten/mahout $ git checkout seinfeld_demo
Step 2 Build the source tree
Enter the mahout directory and build the source tree with
$ mvn clean install -DskipTests=true
Step 3 Cluster Seinfeld episodes
Now enter the following
$ examples/bin/seinfeld_vectors.sh $ examples/bin/seinfeld_kmeans.sh
The first script will create vectors from the plain text Seinfeld episodes stored in examples/src/main/resources/seinfeld-scripts-preprocessed
and the second
script clusters the vectors with K-Means. Finally, Mahout’s clusterdump
program is used to print the clusters
along with top 5 labels and episodes to standard output.
Step 4 Experiment!
You can tweak some of the command line arguments passed to the Mahout jobs and see how it affects
the cluster process. Additionally you can extend the SeinfeldAnalyzer
that came with the demo at examples/src/main/java/org/apache/mahout/analysis
or create one yourself.
Enjoy!
So what’s the best way to begin digging into Mahout visualization based on this output? Will Gephi work?
@Rob
Gephi looks quite interesting! After taking a quick look at the docs it seems that you can import different XML formats for graphs. Some code needs to be written to take the different clustering output files (clusters, n-grams, dictionary, points) and join them together into an XML that can be read by Gephi.
Hi Frank,
This is excellent work, I wonder where you’ve scraped together the Seinfeld Episodes’ data from?
I would like to expand the concept you’ve laid out here to my OpenRecommender project, which aims to build a similar user-rankable cluster for every TV show (and later Movies, Music albums, Books, Events, News articles, etc).
Any help with this?:
Cloning into mahout…
error: SSL certificate problem, verify that the CA cert is OK. Details:
error:14090086:SSL routines:SSL3_GET_SERVER_CERTIFICATE:certificate verify failed while accessing https://github.com/frankscholten/mahout/info/refs
Ok. I found out that it should not have the https…
However, I got an error:
$ examples/bin/seinfeld_kmeans.sh
Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
No HADOOP_CONF_DIR set, using /usr/local/hadoop/conf
11/06/19 19:40:01 INFO common.AbstractJob: Command line arguments: {–clustering=null, –clusters=out-seinfeld-kmeans/initialclusters, –convergenceDelta=0.5, –distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure, –endPhase=2147483647, –input=out-seinfeld-vectors/tfidf-vectors, –maxIter=10, –method=mapreduce, –numClusters=100, –output=out-seinfeld-kmeans/clusters, –overwrite=null, –startPhase=0, –tempDir=temp}
11/06/19 19:40:01 INFO common.HadoopUtil: Deleting out-seinfeld-kmeans/initialclusters
11/06/19 19:40:01 INFO util.NativeCodeLoader: Loaded the native-hadoop library
11/06/19 19:40:01 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
11/06/19 19:40:01 INFO compress.CodecPool: Got brand-new compressor
Exception in thread “main” java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
at java.util.ArrayList.rangeCheck(ArrayList.java:571)
at java.util.ArrayList.get(ArrayList.java:349)
at org.apache.mahout.clustering.kmeans.RandomSeedGenerator.buildRandom(RandomSeedGenerator.java:107)
at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:97)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:54)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68)
at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139)
at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:187)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:616)
at org.apache.hadoop.util.RunJar.main(RunJar.java:156)
Running on hadoop, using HADOOP_HOME=/usr/local/hadoop
No HADOOP_CONF_DIR set, using /usr/local/hadoop/conf
11/06/19 19:40:02 INFO common.AbstractJob: Command line arguments: {–dictionary=out-seinfeld-vectors/dictionary.file-0, –dictionaryType=sequencefile, –endPhase=2147483647, –numWords=5, –pointsDir=out-seinfeld-kmeans/clusters/clusteredPoints, –seqFileDir=out-seinfeld-kmeans/clusters/clusters-1, –startPhase=0, –tempDir=temp}
11/06/19 19:40:03 INFO driver.MahoutDriver: Program took 282 ms
[…] Now you can upload some data to the cluster via the command line, for instance the Seinfeld dataset from one of my earlier blogs […]
@rui – Did you first create vectors with examples/bin/seinfeld_vectors.sh?
Yes I did.
I found out.
I notice after that the seqfiles were not ok.
So, I uploaded the files to HDFS first. Run the script with the correspondent hdfs input directory and it was ok.
Seems that I cannot run with success the sequence file converter from a local directory.
I run the command:
git checkout seinfeld_demo
and I get…
error: pathspec ‘seinfeld_demo’ did not match any file(s) known to git.
Before this I did a git init on the empty directory I created and then followed up with a clone of the mahout stuff.
I’m new to git so I am assuming the files are still there and this is just me. I’m starting to pine for SVN at this point. 🙂 Any help would be greatly appreciated.
I get the same error as Rui,
but i don’t understand his solution.
The problem is, that seqdirectory produces “empty” files.
There’s only the one file created: /user/d056324/out-seinfeld-seqfiles/chunk-0.
It’s content is “SEQorg.apache.hadoop.io.Textorg.apache.hadoop.io.TextÜÒRt£}YZú?S!+R„@”
Any ideas?
… ok, now i got it somehow working to generate the sequence file.
First i copied the preprocessed scripts to HDFS
./bin/hadoop fs -put seinfeld-scripts-preprocessed scripts
then i used
mahout seqdirectory –input scripts –output out-seqfiles-new -c utf-8
to generate the sequence file.
[…] Now you can upload some data to the cluster via the command line, for instance the Seinfeld dataset from one of my earlier blogs […]
Hello , really nice article . I was just curious . Can you explain what the output directories contain briefly ? It would be really helpful . Thank you .
I too got the IndexOutOfBoundsException. I followed what Benhei did on 22-07-2011. The assumption is you have a working hadoop instance running.
How much preprocessing did you need to do on the original Seinfeld scripts? I took a quick look at the preprocessed data and it didn’t look too different from the original data. But I wasn’t checking line-by-line.
Great work! Very cool.
Very good project. I adapted your code to work with stopwords in Portuguese. Put it to work with another block of text (not episodes). Congratulations
I am getting an error on mvn clean install. Error is below. Any pointers? Thanks!
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:2.
3.2:compile (default-compile) on project mahout-core: Compilation failure
[ERROR] \Users\bshrinev\mahout\core\src\main\java\org\apache\mahout\cf\taste\imp
l\model\jdbc\ConnectionPoolDataSource.java:[40,13] error: ConnectionPoolDataSour
ce is not abstract and does not override abstract method getParentLogger() in Co
mmonDataSource
This is such an awesome project. Some friends and I were looking into Mahout and came across this post. We read down the list, built the project as specified and it worked perfectly on the first try. What a great idea.
what if you provided the clusterdump command options?
Hello Frank,
I’m getting major errors when I run mvn clean install on the mahout directory. This only happens when I switch to the demo branch. The build was successful on the regular mahout directory. Here is the error
If anybody could help it would be appreciated
————————————————————————
[INFO] Building Mahout Examples 0.5-SNAPSHOT
[INFO] ————————————————————————
[WARNING] The POM for org.apache.mahout:mahout-utils:jar:0.5-SNAPSHOT is missing, no dependency information available
[INFO] ————————————————————————
[INFO] BUILD FAILURE
[INFO] ————————————————————————
[INFO] Total time: 0.995s
[INFO] Finished at: Thu Jun 27 15:14:56 EDT 2013
[INFO] Final Memory: 8M/150M
[INFO] ————————————————————————
[ERROR] Failed to execute goal on project mahout-examples: Could not resolve dependencies for project org.apache.mahout:mahout-examples:jar:0.5-SNAPSHOT: Failure to find org.apache.mahout:mahout-utils:jar:0.5-SNAPSHOT in http://repo1.maven.org/maven2 was cached in the local repository, resolution will not be reattempted until the update interval of maven2-repository.maven.org has elapsed or updates are forced -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please read the following articles:
[ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException