Trifork Blog

Posts by Frank Scholten

How to manage your Docker runtime config with Vagrant

July 20th, 2014 by
(http://blog.trifork.com/2014/07/20/how-to-manage-your-docker-runtime-config-with-vagrant/)

Vagrant LogoIn this short blog I will show you how to manage a Docker container using Vagrant. Since version 1.6 Vagrant supports Docker as a provider, next to existing providers for VirtualBox and AWS. With the new Docker support Vagrant boxes can be started way faster. In turn Vagrant makes Docker easier to use since its runtime configuration can be stored in the Vagrantfile. You won't have to add runtime parameters on the command line any time you want to start a container. Read on if you like to see how I create a Vagrantfile for an existing Docker image from Quinten's Docker cookbooks collection.

Read the rest of this entry »

An Introduction To Mahout's Logistic Regression SGD Classifier

February 4th, 2014 by
(http://blog.trifork.com/2014/02/04/an-introduction-to-mahouts-logistic-regression-sgd-classifier/)

Mahout-logoThis blog features classification in Mahout and the underlying concepts. I will explain the basic classification process, training a Logistic Regression model with Stochastic Gradient Descent and a give walkthrough of classifying the Iris flower dataset with Mahout.

Read the rest of this entry »

Docker From A Distance - The Remote API

December 24th, 2013 by
(http://blog.trifork.com/2013/12/24/docker-from-a-distance-the-remote-api/)

Docker-logoMany people use docker from the command line to build images, run containers and manage Docker on their machine. However, you can also run the same Docker commands via its remote REST API. In this blog I will guide you through Docker's remote API using curl while pointing out a few details and tools that you might not know about. We will remotely search and pull an elasticsearch image, run a container and clean up after ourselves.

Read the rest of this entry »

NLUUG DevOps Conference 2013 - Reliability, clouds and the UNIX way

November 26th, 2013 by
(http://blog.trifork.com/2013/11/26/nluug-devops-conference-2013-reliability-clouds-and-the-unix-way/)

Last Thursday I attended the NLUUG DevOps conference in Bunnik, near Utrecht. The NLUUG is the Dutch UNIX user group. In this blog I will summarize the talks I attended, some fun things I learned and I will discuss my own talk about continuous integration at a large organization.
Read the rest of this entry »

Puppet from the trenches - How to prevent overwritten user configuration with a custom type

October 29th, 2013 by
(http://blog.trifork.com/2013/10/29/puppet-from-the-trenches-how-to-prevent-overwritten-user-configuration-with-a-custom-type/)

Puppet_LogoIn this installment of the 'from the trenches' series I cover the use of Puppet during one of our projects. We have used Puppet to provision Jenkins instances as part of a build and deployment platform for a large organization. I discuss the problem of when Puppet overwrites user managed configuration and how we solved it by writing a custom type.

Read the rest of this entry »

Bash - A few commands to use again and again

March 28th, 2013 by
(http://blog.trifork.com/2013/03/28/bash-a-few-commands-to-use-again-and-again/)

Introduction

These days I spend a lot of time in the bash shell. I use it for ad-hoc scripting or driving several Linux boxes. In my current project we set up a continuous delivery environment and migrate code onto it. I lift code from CVS to SVN, mavenize Ant builds and funnel artifacts into Nexus. One script I wrote determines if a jar that was checked into a CVS source tree exists in Nexus or not. This check can be done via the Nexus REST API. More on this script at the end of the blog. But first let's have a look at a few bash commands that I use all the time in day-to-day bash usage, in no particular order.

  1. find
  2. Find searches files recursively in the current directory.

    $ find -name *.jar

    This command lists all jars in the current directory, recursively. We use this command to figure out if a source tree has jars. If this is the case we add them to Nexus and to the pom as part of the migration from Ant to Maven.

    $ find -name *.jar -exec sha1sum {} \;

    Find combined with exec is very powerful. This command lists the jars and computes sha1sum for each of them. The shasum command is put directly after the -exec flag. The {} will be replaced with the jar that is found. The \; is an escaped semicolon for find to figure out when the command ends.

  3. for
  4. For loops are often the basis of my shell scripts. I start with a for loop that just echoes some values to the terminal so I can check if it works and then go from there.


    $ for i in $(cat items.txt); do echo $i; done;

    The for loop keywords should be followed by either a newline or an ';'. When the for loop is OK I will add more commands between the do and done blocks. Note that I could have also used find -exec but if I have a script that is more than a one-liner I prefer a for loop for readability.

  5. tr
  6. Transliterate. You can use this to get rid of certain characters or replace them, piecewise.

    $ echo 'Com_Acme_Library' | tr '_A-Z' '.a-z'

    Lowercases and replaces underscores with dots.

  7. awk

  8. $ echo 'one two three' | awk '{ print $2, $3 }'

    Prints the second and third column of the output. Awk is of course a full blown programming language but I tend to use this snippets like this a lot for selecting columns from the output of another command.

  9. sed
  10. Stream EDitor. A complete tool on its own, yet I use it mostly for small substitutions.


    $ cat 'foo bar baz' | sed -e 's/foo/quux/'

    Replaces foo with quux.

  11. xargs
  12. Run a command on every line of input on standard in.


    $ cat jars.txt | xargs -n1 sha1sum

    Run sha1sum on every line in the file. This is another for loop or find -exec alternative. I use this when I have a long pipeline of commands in a oneliner and want to process every line in the end result.

  13. grep
  14. Here are some grep features you might not know:

    $ grep -A3 -B3 keyword data.txt

    This will list the match of the keyword in data.txt including 3 lines after (-A3) and 3 lines before (-B3) the match.

    $ grep -v keyword data.txt

    Inverse match. Match everything except keyword.

  15. sort
  16. Sort is another command often used at the end of a pipeline. For numerical sorting use

    $ sort -n

  17. Reverse search (CTRL-R)
  18. This one isn't a real command but it's really useful. Instead of typing history and looking up a previous command, press CTRL-R,
    start typing and have bash autocomplete your history. Use escape to quit reverse search mode. When you press CTRL-R your prompt will look like this:

    (reverse-i-search)`':

  19. !!
  20. Pronounced 'bang-bang'. Repeats the previous command. Here is the cool thing:

    $ !!:s/foo/bar

    This repeats the previous command, but with foo replaced by bar. Useful if you entered a long command with a typo. Instead of manually replacing one of the arguments replace it this way.

    Bash script - checking artifacts in Nexus

    Below is the script I talked about. It loops over every jar and dll file in the current directory, calls Nexus via wget and optionally outputs a pom dependency snippet. It also adds a status column at the end of the output, either an OK or a KO, which makes the output easy to grep for further processing.

    #!/bin/bash
    
    ok=0
    jars=0
    
    for jar in $(find $(pwd) 2&>/dev/null -name '*.jar' -o -name '*.dll')
    do
    ((jars+=1))
    
    output=$(basename $jar)-pom.xml
    sha1=$(sha1sum $jar | awk '{print $1}')
    
    response=$(curl -s http://oss.sonatype.org/service/local/data_index?sha1=$sha1)
    
    if [[ $response =~ groupId ]]; then
    ((ok+=1))
    echo "findjars $jar OK"
    echo "" >> "$output"
    echo "$response" | grep groupId -A3 -m1 >> "$output"
    echo "" >> "$output"
    else
    echo "findjars $jar KO"
    fi
    
    done
    
    if [[ $jars > 0 ]]; then
    echo "findjars Found $ok/$jars jars/dlls. See -pom.xml file for XML snippet"
    exit 1
    fi
    

    Conclusions

    It is amazing what you can do in terms of scripting when you combine just these commands via pipes and redirection! It's like a Pareto's law of shell scripting, 20% of the features of bash and related tools provide 80% of the results. The basis of most scripts can be a for loop. Inside the for loop the resulting data can be transliterated, grepped, replaced by sed and finally run through another program via xargs.

    References

    The Bash Cookbook is a great overview of how to solve solutions to common problems using bash. It also teaches good bash coding style.

QCon London 2013 - Simplicity, complexity and doodles

March 21st, 2013 by
(http://blog.trifork.com/2013/03/21/qcon-london-2013-simplicity-complexity-and-doodles/)

Westminster Abbey

Westminster Abbey - View from the Queen Elizabeth II conference center

...and now back home

On my desk lies a stack of notepads from the QCon sponsors. I pick up one of them and turn few pages trying to decipher my own handwriting. As I read my notes I reflect back on the conference. QCon had a great line up and awesome keynote speakers: Turing award winner Barbara Liskov, Ward Cunningham, inventor of the Wiki, and of course Damian Conway who gave two highly entertaining keynotes. My colleague Sven Johann and I were at QCon for three days. We attended a few talks together but also went our own way from time to time. Below I discuss the talks I attended that Sven didn't cover in his QCon blog from last week.

Ideas not art: drawing out solutions - Heather Willems

The first talk I cover has nothing to do with software technology but with communication. Heather Willems shows us the value of communicating ideas visually. She started the talk with an entertaining discussion of the benefits of drawing in knowledge work. Diagrams and visuals help us to retain information and helps group discussion. The short of it: it's OK to doodle. In fact it is encouraged!

The second part of the talk was a mini-workshop where we learned how to create our own icons and draw faces expressing basic emotions. These icons can form the building blocks of bigger diagrams. Earlier in the day Heather made a graphic recording of Barbara Liskov's keynote. In real-time: Heather was drawing on-the-spot based on what Barbara was talking about!

Graphic recording keynote Barbara Liskov

Graphic recording of Barbara Liskov's keynote 'The power of abstraction'

You are not a software developer! - Russel Miles

Thought provoking talk by Russel Miles about simplicity in problem solving. His main message: in the last decade we learned to deliver software quite well and now face a different problem: overproduction. Problems can often be solved much easier or without writing software at all. Russel argues that software developers find requirements boring, yet they have the drive to code, hence they sometimes create complex, over-engineered solutions.

He also warns of oversimplifying: a solution so simple that the value we seek is lost. His concluding remark relates to a key tenet of Agile development: delivering valuable software frequently. He proposes to instead focus on 'delivering valuable change frequently'. Work on the change you want to accomplish rather than cranking out new features. These ideas are related to the concepts of impact mapping, which he used to structure the presentation itself, he revealed in the end :-)

Want to see Russel live? He will be giving an updated version of this presentation at a GOTO night in Amsterdam on May 14 and he'll be speaking at GOTO Amsterdam in June too.

The inevitability of failure - Dave Cliff

In this talk professor Dave Cliff of the Large Scale Complex IT systems group at University of Bristol warns us about the evergrowing complexity in large scale software systems. Especially automated traders in financial markets. Dave mentions recent stock market crashes as failures. These failures did not make big waves in the news, but could have had catastrophic effects if the market did not recover properly. He discusses an interesting concept, normalization of deviance.

Everytime a safety margin is crossed without problems it is likely that the safety margin will be ignored in the future. He argues that we were quite lucky with the temporary market crashes. Because of 'normalization of defiance' it's only a matter of time before a serious failure occurs. Unfortunately I missed an overview of ways to prevent these kind of problems. If they can be prevented at all. A principle from cybernetics, Ashby's law of requisite variety, says that a system can only be controlled if the controller has enough variety in it's actions to compensate any behaviour of the system to be controlled. In a financial market, with many interacting traders, human or not, this isn't the case.

Performance testing Java applications - Martin Thompson

Informative talk about performance testing Java applications. Starts with fundamental definitions and covers tools and approaches on how to do all sorts of performance testing. Martin proposes to use a red-green-debug-profile-refactor cycle in order to really know what is happening with your code and how it performs. Another takeway is the difference between performance testing and optimization. Yes, defer optimization until you need it. But this is not a reason not to know the boundaries of your system. When load testing, use a framework that spends little time on parsing requests and responses. All good points and I'll have to read his slides again later for all the links to the tools he suggests for performance testing.

Insanely Better Presentations - Damian Conway

Great talk on how to give presentations. Damian shows examples of bad slides and refactors them during his talk. He discusses fear of public speaking, how to properly prepare a talk, a lot of great tips! I won't do the talk justice by describing it in text. Many of Conway's ideas have to be seen live to make sense. Nevertheless there is a method to the madness:

  • Dump everything you know on the subject
  • Decide on 5 main points and create storyline that flows between them
  • Toss out everything that does not fit the storyline
  • Simplicity - show less content, on more slides
  • Use highlighting for code walkthroughs
  • Use animations to show code refactorings
  • Get rid of distractions
  • The most important part of a presentation is person-to-person communication!
  • Practice in front of an audience at least 3 times. Even if it is just your cat.

Visualization with HTML 5 - Dio Synodinos

In this tour of technologies for visualizing data, Dio showed everything from CSS3 to SVG, processing and D3js. For each of these he gave a good overview of their pros and cons and made specific animations and demos for all of them. He also mentioned pure CSS3 iOS icons. Lot's of eye candy and from reading the #QconLondon Twitter stream it seems a few people liked to try out all these frameworks and technologies.

Coffee breaks

Thankfully, there were plenty of coffee breaks at the conference. During breaks I often bumped into Sejal and Daphne, as well as other Triforkers from both our Zurich & Aarhaus offices. Besides attending talks we went to a nice conference party and went out to dinner a few times. Between talks Sven and I meetup and had a chat about what we saw, whilst we grabbed some delicious cookies here and there. Unfortunately the chocolate chip ones were gone most of the time!

Souvenir

At one point I took the elevator to the top floor. On my right is a large table covered with techy books. Conference goers try to walk by, but look over and can't help but gravitate to this mountain of tech information. Of course I couldn't resist either so I browsed a bit and finally bought 'Team Geek - A software developer's guide to working well with others'. Later on I visit the web development open space. I listen in on a few conversations and end up chatting with James and Kathy, the camera operators, while they are packing their stuff. They have been filming all the talks for the last three days and we talk a bit about the conference until the place closes down.

All in all QCon London 2013 was a great conference!

Berlin Buzzwords 2012 Recap

June 7th, 2012 by
(http://blog.trifork.com/2012/06/07/berlin-buzzwords-2012-recap/)

This is a recap of Berlin Buzzwords 2012, the 2 day conference on everything scale, search and store in the NoSQL world. Myself and Martijn van Groningen arrive in Berlin Sunday evening. Unfortunately we are too late for the infamous Barcamp, a low-key mix of lightning talks, beer and socializing, so we decide to have a beer at the hotel instead.

Day 1

So let's fast forward to Monday morning. Berlin Buzzwords kicks off with this year over a record 700 delegates from all over the world. The first keynote session was by Leslie Hawthorn on community building. Using gardening as a metaphor she talks us through slides of gardening tools, plucking weeds and landscaping as activities for building a healthy community. Her advice, display all the 'paths to success' on the site and how to nip in the bud unproductive mailinglist discussions using a project 'mission statement'. My mind wanders off as I think about how I can apply this to Apache Whirr. Then I think about the upcoming release that I still want to do testing for. "Community building is hard work", she says. Point taken.

Storm and Hadoop

Ted Dunning gave an entertaining talk on closing the gap between the realtime features of Storm and the batch nature of Hadoop. Besides that he talked about an approach called Bayesian Bandits for doing real-time A/B testing while he casually performs a coin trick to explain the statistics behind it.

Socializing, coding & the geek-play area

During the rest of the day I visited lots of talks but I occasionally socialize and hang out with my peers. In the hallway I bump into old friends and I meet many new people from the open source community. At other times we wander outside to the geek-and-play area watching people playing table tennis and enjoying the inspiring and laid-back vibe of this pretty unique conference.

Occasionally I feel an urge to do some coding. "You should get a bigger screen", a fellow Buzzworder says as he points at my laptop. I snap out of my coding trance and realize I'm hunched over my laptop, peering at the screen from up close. I'm like most people at the conference, multi-tasking between listening to a talk, intense coding, and sending out a #bbuzz Tweet. There are just so many inspiring things that grab your attention, I can hardly keep up with them.

Mahout at Berlin Buzzwords

Almost every developer from the Mahout development team hangs out at Buzzwords. As I gave my talk, Robin Anil and Grant Ingersoll type up last-minute patches and close the final issues for the 0.7 Mahout release. On stage I discuss the Whirr's Mahout service which automatically installs Mahout on a cluster. In hindsight the title didn't fit that well as I mostly talked about Whirr, not Machine Learning with Mahout. Furthermore, my talk finished way earlier than expected; bad Frank. Note to self: better planning next time.

Day 2

Elasticsearch

Day two features quite a few talks on ElasticSearch. In one session Shay Banon, ElasticSearch' founder, talks about how his framework handles 'Big Data' and he explains the design principles like shard overallocation, performance penalties of splitting your index, and so on. In a different session Lukáš Vlček and Karel Minarik fire up an ElasticSearch cluster during their talk. The big screen updates continuously with stats on every node in the cluster. The crowd cracks up laughing as Clinton Gormley suddenly joins the live cluster using his phone and laptop.

New directions in Mahout

In the early afternoon Simon Willnauer announced that one of the talks had to be cancelled, luckily Ted Dunning could fill this spot by giving another talk. This time the subject was the future of Mahout. He discussed the upcoming 0.7 release which is largely a clean-up and refactoring effort. Additionally he discussed two future contributions: pig-vector and the streaming K-means clustering algorithm.

The idea of pig-vector is to create a glue layer between Pig and Mahout. Currently you run Mahout by first transforming your data, say a directory of text files, to a format that can Mahout can read and then you run the actual Mahout algorithms. The goal for pig-vector is to use Pig to shoehorn your data so Mahout can use it. The benefit of using Pig is that is designed to read data from a lot of sources so Mahout can focus on the actual machine learning.

The upcoming streaming K-means clustering algorithm looks very promising. "No knobs, the system adapts to the data" he says. Ted refers to the problem with existing Mahout algorithms that require a lot tweaking of parameters. On top of that the algorithm is blazingly fast too, using clever tricks like projection search. Even though the original K-means algorithm is easily parallelizable using MapReduce, the downside is that it is iterative and has to make several passes over the data. Very inefficient for large datasets.

See you next year!

This wraps up my coverage of Berlin Buzzwords. There were many more talks and of course so I just covered my personal interest area which amounts largely towards Mahout. The only downside was the weather in Berlin this year but other than that I really enjoyed the conference. Many thanks to everyone involved in organizing the conference and roll on 2013!

Frank Scholten joins Apache Whirr development team

March 8th, 2012 by
(http://blog.trifork.com/2012/03/08/frank-scholten-joins-apache-whirr-development-team/)

I am pleased to announce that I have been voted in as a committer on Apache Whirr! Whirr is a Java library for quickly setting up services in the cloud. For example, using Whirr you can start a Hadoop cluster on Amazon in 5 minutes by configuring a simple property file and running the whirr command-line tool. See the quick start guide for more information.

Hadoop is only one of the supported services however. Whirr supports several NoSQL databases or distributed computing platforms and tools. Currently Whirr supports HBase, Hama, Ganglia, Zookeeper, ElasticSearch, Mahout, Puppet, Chef and Voldemort.

One of my contributions was the Mahout service which installs the Mahout binary distribution on a given node. When used in conjunction with Hadoop you can have a fully operational Mahout cluster in minutes. For more information about using the Mahout service checkout this blog on Mahout support in Whirr on the community site SearchWorkings.org

More services are continuously being added to Whirr. For instance the Solr and MongoDB services are planned for the upcoming 0.8.0 release. If you would like to know and keep up to date with more about Whirr checkout the project page or subscribe to the mailinglist.

Using your Lucene index as input to your Mahout job - Part I

March 5th, 2012 by
(http://blog.trifork.com/2012/03/05/using-your-lucene-index-as-input-to-your-mahout-job-part-i/)

This blog shows you how to use an upcoming Mahout feature, the lucene2seq program or https://issues.apache.org/jira/browse/MAHOUT-944. This program reads the contents of stored fields in your Lucene index and converts them into text sequence files, to be used by a Mahout text clustering job. The tool contains both a sequential and MapReduce implementation and can be run from the command line or from Java using a bean configuration object. In this blog I demonstrate how to use the sequential version on an index of Wikipedia.

Introduction

When working with Mahout text clustering or classification you preprocess your data so it can be understood by Mahout. Mahout contains input tools such as seqdirectory and seqemailarchives for fetching data from different input sources and transforming them into text sequence files. The resulting sequence files are then fed into seq2sparse to create Mahout vectors. Finally you can run one of Mahout's algorithms on these vectors to do text clustering.

The lucene2seq program

Recently a new input tool has been added, lucene2seq, which allows you read from stored fields of a Lucene index to create text sequence files. This is different from the existing lucene.vector program which reads term vectors from a Lucene index and transforms them into Mahout vectors straight away. When using the original text content you can take full advantage of Mahout's collocation identification algorithm which improves clustering results.

Let's look at the lucene2seq program in more detail by running

$ bin/mahout lucene2seq --help

This will print out all the program's options.

Job-Specific Options:                                                           
  --output (-o) output       The directory pathname for output.                 
  --dir (-d) dir             The Lucene directory                               
  --idField (-i) idField     The field in the index containing the id           
  --fields (-f) fields       The stored field(s) in the index containing text   
  --query (-q) query         (Optional) Lucene query. Defaults to               
                             MatchAllDocsQuery                                  
  --maxHits (-n) maxHits     (Optional) Max hits. Defaults to 2147483647        
  --method (-xm) method      The execution method to use: sequential or         
                             mapreduce. Default is mapreduce                    
  --help (-h)                Print out help                                     
  --tempDir tempDir          Intermediate output directory                      
  --startPhase startPhase    First phase to run                                 
  --endPhase endPhase        Last phase to run

The required parameters are lucene directory path(s), output path, id field and list of stored fields. The tool will fetch all documents and create a key value pair where the key equals the value of the id field and the value equals the concatenated values of the stored fields. The optional parameters are a Lucene query, a maximum number of hits and the execution method, sequential or MapReduce. The tool can be run like any other Mahout tool.

Converting an index of Wikipedia articles to sequence files

To demonstrate lucene2seq we will convert an index of Wikipedia articles to sequence files. Checkout the Lucene 3x branch, download a part of the Wikpedia articles dump and run a benchmark algorithm to create an index of the articles in the dump.

$ svn checkout http://svn.apache.org/repos/asf/lucene/dev/branches/branch_3x lucene_3x
$ cd lucene_3x/lucene/contrib/benchmark
$ mkdir temp work
$ cd temp
$ wget http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles1.xml-p000000010p000010000.bz2
$ bunzip enwiki-latest-pages-articles1.xml-p000000010p000010000.bz2

The next step is to run a benchmark 'algorithm' to index the Wikipedia dump. Contrib benchmark contains several of these algorithms in the conf directory. For this demo we only index a small part of the Wikipedia index so edit the conf/wikipediaOneRound.alg file so it points to enwiki-latest-pages-articles1.xml-p000000010p000010000. For an overview of the syntax of these benchmarking algorithms check out the benchmark.byTask package-summary Javadocs

Now it's time to create the index

$ cd ..
$ ant run-task -Dtask.alg=conf/wikipediaOneRound.alg -Dtask.mem=2048M

The next step is to run lucene2seq on the generated index under work/index. Checkout the lucene2seq branch from Github

$ git clone https://github.com/frankscholten/mahout
$ git checkout lucene2seq
$ mvn clean install -DskipTests=true

Change back to the lucene 3x contrib/benchmark work dir and run

$ <path/to>/bin/mahout lucene2seq -d index -o wikipedia-seq -i docid -f title,body -q 'body:java' -xm sequential

To create sequence files of all documents that contain the term 'java'. From here you can run seq2sparse followed by a clustering algorithm to cluster the text contents of the articles.

Running the sequential version in Java

The lucene2seq program can also be run from Java. First create a LuceneStorageConfiguration bean and pass in the list of index paths, the sequence files output path, the id field and the list of stored fields in the constructor.

LuceneStorageConfiguration luceneStorageConf = new LuceneStorageConfiguration(configuration, asList(index), seqFilesOutputPath, "id", asList("title", "description"));

You can then optionally set a Lucene query and max hits via setters

luceneStorageConf.setQuery(new TermQuery(new Term("body", "Java")));
luceneStorageConf.setMaxHits(10000);

Now you can run the tool by calling the run method with the configuration as a parameter

LuceneIndexToSequenceFiles lucene2Seq = new LuceneIndexToSequenceFiles();
lucene2seq.run(luceneStorageConf); 

Conclusions

In this post I showed you how to use lucene2seq on an index of Wikipedia articles. I hope this tool will make it easier for you to start using Mahout text clustering. In a future blog post I discuss how to run the MapReduce version on very large indexes. Feel free to post comments or feedback below.