With the growing need of processing huge data it is eminent that computing at this scale with a real time component, isn’t a piece of cake using simple client-server architecture. The need was to come up with an architecture to handle the massive quantities of data by taking advantage of both batch and stream processing across clusters, with a fault tolerance and delegation in mind. It also required selecting the right set of tools that can handle and cater to such massive data processing. Though, there are several right tools that fits into the Lambda  architecture, the most famous combination that exists as of today is Kafka for Messaging and Queuing, Spark + Streaming for processing data and Cassandra (NoSQL database) as a persistence layer.

In this article we will be exploring a simple program that utilizes all of these technologies and provide a good foundation of end-to-end integration, to get you started with the Lambda Architecture. I will stick to keeping it as simple as possible, so that one does not get overwhelmed in trying to understand all the tools at once. I will still use the latest concepts and code to make this session worth an exciting. Later, I will provide some links and references for further study.

Ok ... Enough of talking ... So let’s get started!!

The Lambda Architecture

The Lambda architecture provides the model for processing large quantity of distributed data in the most reliable fashion by taking advantage of both batch and streaming. Spark does both using its inbuilt library for streaming and map reduce. Following diagram shows a high level interaction with Lambda architecture.

The speed layer consists of processing component that aggregates, reduces and transforms a micro-batch of data in a stream, which can be posted back to Kafka that can provide immediate data to UI via simple websocket, serving as speed layer. The best way is to imagine something like running total, count or current location being shown by the speed layer using hot data. Batch layer is more about persistence, machine learning and visualizing analytical trends. At the same time spark writes this data to the storage (cassandra in this case) using sink. The versatility of Apache Spark’s API for both batch/ETL and streaming workloads brings the promise of lambda architecture to the real world. Writing a consumer to fetch data from Kafka, is already explained in my previous article, so I will skip building UI here and follow with next step. We will see shortly, how to write to cassandra in a while.

Spark has numerous libraries to perform a real-time and batch transformations on the data. It has recently gained a lot of popularity due its intensive support for libraries like graph, machine learning and SQL that is backed by very active contributors, forum and documentation. Spark ecosystem includes Kafka, Spark, Spark Streaming and wide number of drivers for real time data processing and sinking to external storage like Cassandra or HDFS (Hadoop File System).

I will also skip talking about the benefits of using Kafka or Cassandra in the spark ecosystem for now with some links later in this article for further reading. Our goal is to see the big picture of the eco system and how it works.

Now let us see how spark integrates with these underlying technologies to fit as an excellent candidate for Lambda architecture.

As we can see in the above diagram that spark efficiently connects with Kafka and Cassandra to stream data in both direction, while also catering to the apps that can easily connect and request information from Sparks in real-time. This sounds cool isn’t it … Data ingestion, processing, persistence and visualization, all happening in real time and in micro batches.

As a developer you should be able to make much sense out of it, since this is something we can relate to more like a client server architecture. Check out my previous article on Kafka where I have explained how Kafka works as a Pub-Sub messaging framework. The source of data depends on one’s implementation of producer that sends data to Kafka. Spark as a consumer, collects the data by subscribing to the Kafka’s topic using Spark streaming. The data is received incrementally in batches called micro batches. Spark utilizes its core engine components that comprises of libraries like Map-Reduce, GraphX, MLLib and SQL to do the transformations and aggregations incrementally on these micro batches of data. Later it persists the data to an external storage like HDFS or Cassandra. This does not mean that spark cannot be a producer, however, in this example it is a consumer.

Spark Ecosystem

I have already demonstrated how to setup Kafka in my previous article and that should be enough for this demo. I will also skip talking about steps involved in installing Spark and Scala for the sake of this article. Refer to some of the links provided later in this article to get your environment up and running on windows. I am using Spark 2.1.0, Scala 2.11.8, JDK 1.8, Cassandra 3.10 and Kafka 0.10.

To make it more like a real time implementation, consisting of different team and skills, I have implemented Kafka Producer using .Net that mimics a real world distributed ecosystem where data ingestion can happen through multiple sources and mechanisms. Next, I have Spark core and Consumer written in Scala as a Maven project using IntelliJ.

Pre-Requisites

I am assuming that anyone reading this article knows the basics of Scala, Java and .Net programming languages. However, you can still to do this example in a purely Scala, Java or .Net code with the respective frameworks available. Scala is native to Spark, while Java and .Net uses adapters. Microsoft has a project called Mobius that is in its initial stage but very promising. Apart from the programming language, I would also encourage to do some reading on Lambda Architecture, Spark basics and Cassandra (NoSQL database) to get familiar with the concepts and practices.

I will be using maven style project structure and IntelliJ as IDE, but you are free to use your own choice of style and IDE. I am also assuming that Zookeeper, Kafka, Spark, Scala and Cassandra are already installed.

Creating Project

To begin, create a new Maven project in IntelliJ with JDK 1.8.x, and add a new module to it. Select the newly created module and add Scala framework support by right clicking and selecting "Add framework support" from the context menu. If you do not see Scala in the add framework support list, then most likely you are missing the Scala and Sbt installation. Also remove the java folder from under the source and instead add a scala folder to your source root. Make sure "scala" folder is marked as source root by again right clicking on scala folder and using "mark directory as" context menu.


Project Dependencies

Following are the dependencies I have written in the POM file. This includes the core libraries you will require for all the bells and whistles. The file is available in the GitHub project link.


Kafka Producer (.Net)

I have created some dummy data available under the resource directory to work with. You can use online tools to create similar data or use some real API to collect data from. The next step is to push the data to Kafka within a topic. Following is a simple .Net console program that reads the CSV file and pushes the data to Kafka in batches. It does that to mimic a real-time incremental data capture scenario. You can look at my article for Kafka to understand how to create a Kafka producer using .Net and Confluent library.


IntelliJ Project Structure (Scala)

The IntelliJ project includes following implementations to act as a consumer, speed layer, batch processing and serving layer. Let us walk through each objects and I will explain as we move forward.

resources:This folder contains application.conf for some static application configurations, log4j.properties and the dummy data file.

settings {
  spark_wd = "/Apache/Logs/spark/wd"
  hadoop = "hdfs://localhost:11025/user/manish"
  checkPoint = "D:/Apache/Logs/spark/cp"
}

  • spark_wd:I have configured to point to a local directory for spark working.
  • hadoop:Configured it to reference Hadoop HDFS folder if you want to read/write anything from there.
  • checkpoint:This folder is used by spark to manage its checkpoints in case of reviving from a failure

config:This package contains settings object that is used to provide application configuration settings around the application.

package config

import com.typesafe.config.ConfigFactory

object Settings {
    private val config = ConfigFactory.load()

    object WebLogGen {
        private val weblogGen = config.getConfig("settings")
        lazy val sparkWd: String = weblogGen.getString("spark_wd")
        lazy val HadoopHome: String = weblogGen.getString("hadoop")
        lazy val checkPointDir: String = weblogGen.getString("checkPoint")
    }
}

model:This package is used to declare the models that I will be using to create object from data received through Kafka streaming, and during transformations.

package object model {

    case class ProductSale(Id: Int, firstName: String, lastName: String, house: Int, street: String, city: String, state: String, zip: String, prod: String, tag: String) extends Serializable

    case class ProvinceSale(prod: String, state: String, total: Long) extends Serializable

}

utils:This package contains common code for initializing spark session. The instance is called sparkSession that requires setting up application name, working directory, check point and Cassandra connector. Note the call to getOrCreate function. A streaming application must operate 24/7 and hence must be resilient to failures. Spark, therefor maintains Checkpoints. This enables Spark to recreate its DAG (Directed Acyclic Graph) in case of failure and start from where it failed, instead of starting a fresh. However, checkpoint does not work if the underlying code has been modified. Check-pointed data is only useable as long as you haven’t modified the existing code.

Spark session is an abstraction introduced from version 2.0 that unifies the use of Spark and Spark SQL Context that caused lot of confusions.

package utils

import java.lang.management.ManagementFactory

import config.Settings.WebLogGen
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession

object SparkUtils {

    val isIDE: Boolean = {
        //If executing inside IDE
        ManagementFactory.getRuntimeMXBean.getInputArguments.toString.contains("IntelliJ IDEA")
    }

    def getSparkSession: SparkSession = {
        //get spark configuration
        val conf = new SparkConf()
            .setAppName("Simple-Lambda")
            .set("spark.cassandra.connection.host", "localhost")
            .set("spark.local.dir", WebLogGen.sparkWd)

        val checkpointDirectory = WebLogGen.checkPointDir

        //If executing inside IDE
        if (isIDE) {
            System.setProperty("hadoop.home.dir", WebLogGen.HadoopHome)
            conf.setMaster("local[*]")
        }

        //init spark session
        val spark = SparkSession.builder()
            .config(conf)
            .getOrCreate()

        spark.sparkContext.setLogLevel("WARN")
        spark.sparkContext.setCheckpointDir(checkpointDirectory)

        spark
    }

}

cassandra:This package contains prepared statement that is issued to Cassandra. I will be using Spark Structure Streaming API that is the latest and greatest enhancement in Spark. However, it comes with a little challenges, where support from external technologies are not fully developed and/or still in alpha/beta stage. Said that, it does not stop us from using it and be more productive by writing a little extra code.

Currently, Databricks does not have support for sink to Cassandra using Spark Structured Streaming API. It is still in beta stage. However, with some knowledge of CQL and Cassandra, it should be pretty handy to develop your custom query helper functions as shown below.

package cassandra

import com.datastax.driver.core.{ResultSet, Session}
import com.datastax.spark.connector.cql.CassandraConnector
import model.ProvinceSale
import org.apache.spark.sql.SparkSession

object Satement {
    def getConnector(spark: SparkSession): CassandraConnector ={
        val connector = CassandraConnector.apply(spark.sparkContext.getConf)
        connector
    }

    private def cql_update(prod: String, state: String, total: Long): String =
        s"""insert into demo1.sales (prod,state,total) values ('$prod', '$state', $total)""".stripMargin

    def updateProvinceSale(connector: CassandraConnector, value: ProvinceSale): ResultSet = {
        connector.withSessionDo { session =>
            session.execute(cql_update(value.prod, value.state, value.total))
        }
    }

    def createKeySpaceAndTable(session: Session, dropTable: Boolean = false): ResultSet = {
        session.execute(
            """CREATE KEYSPACE  if not exists  demo1 WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'datacenter1' : 1 };""")
        if (dropTable)
            session.execute("""drop table if exists my_keyspace.test_table""")

        session.execute(
            """create table if not exists demo1.sales (prod text, state text, total int, PRIMARY KEY (prod, state));""")
    }
}

streaming:This is the heart of the program where all the magic happens. This contains the code that connects to Kafka, retrieve batches of records, do transforms and aggregations and then finally save it to Cassandra.

Structured Streaming API is new to Spark and it takes a lot of loading, transforming and aggregation complexities away from the developers. It is a processing engine built on the Spark SQL engine. The Spark SQL engine takes care of running it incrementally and updating the final result continuously for each micro batches. It is an abstraction based on Repeated Queries (RQ), where the queries are repeated for each batches of data. The API simplifies the process by abstracting the incremental process such that the processing logic remains common whether processing data in whole or batches. The entire data is seen as a collection of micro batches to be processed. The developers are left with simplified outlook, where they have to be mainly concerned with source, incremental execution logic and sink.


IMPORTS


If you have done Java or .Net it should be pretty clear to you that these references are required when an object of that package is used.

package streaming

import cassandra._
import model.{ProductSale, ProvinceSale}
import org.apache.spark.sql.streaming.StreamingQuery
import org.apache.spark.sql.{DataFrame, SparkSession}
import utils.SparkUtils.getSparkSession

Function "Main()"


Entire code is under the main function. The signature of the main function is similar to the main function in Java or .Net console programs. In the following code, the first line starts a new spark session by calling the getSparkSession function defined in SparkUtils object inside utils package. It then imports implicits._ where underscore is same as "*" in java or .Net language.

object LambdaJob {
    def main(args: Array[String]): Unit = {

        // setup spark context
        val spark: SparkSession = getSparkSession
        import spark.implicits._

        //.........
    }
}

Kafka Stream


The following line of code does number of things using method chaining. It loads Kafka connection driver, defines connection options (host and port) for Kafka, subscribes as a consumer and starts reading streams. The next line extracts the "VALUE" item that contains the actual data. I have ignored other Meta data for now.

// load stream from kafka
val input: DataFrame = spark
    .readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", "localhost:9092")
    .option("subscribe", "sales")
    .load()

// select the values
val df = input.selectExpr("CAST(value AS STRING)").as[String]

Create LIST<MODEL>


The next job is to deserialize the raw string into workable list of objects. To do this Spark exposes an awesome Structured Streaming API that are lot easier than the previous versions, which required to learn RDD (Resilient Distributed Datasets), DStreams and DataFrames. Datasets is one of the abstraction introduced that represents the set of data in same way whether it is batch or stream. The new API hides away the need of understanding the difference between batch and stream. The API we will use will function on both (streams and batch), whereas the underlying core of spark does all the hard work of joining and combining the state full activities.

// transform the raw input to a Dataset of model.ProductSale
val ds = df.map(r => r.split(",")).map(c => ProductSale(c(0).toInt, c(1), c(2), c(3).toInt, c(4), c(5), c(6),c(7), c(8), c(9)))

The code simply splits the string by comma and deserializes it to the ProductSale model.

Aggregation and Transformation


These are the incremental execution with continuous transformations and aggregation happening on a batch/stream of data. The code is a simple aggregation for the sake of this article, but this is where you implement the entire gamut of complex map-reduce code.

// do continuous aggregation
val aggDF = ds.groupBy("prod", "state").count()

// transform aggregated dataset to a Dataset of model.ProvinceSale
val provincialSale = aggDF.map(r => ProvinceSale(r.getString(0), r.getString(1), r.getLong(2))

Cassandra CQL


The following code defines a Cassandra connector and an immutable function to sink the aggregated data to Cassandra.

///// CASSANDRA ////
val connector = Satement.getConnector(spark)

// This Foreach sink writer writes the output to cassandra.
import org.apache.spark.sql.ForeachWriter
val writer = new ForeachWriter[ProvinceSale] {
    override def open(partitionId: Long, version: Long) = true
    override def process(value: ProvinceSale): Unit = {
        Satement.updateProvinceSale(connector, value)
    }
    override def close(errorOrNull: Throwable): Unit = {}
}
///// CASSANDRA END ////

As we pass the incremental executions, it is time to sink the data to an external storage. The sink has three main parameters:

  • Complete: Entire result is written to external storage
  • Append: Only new from last trigger is written to external storage.
  • Update: Only updated from the last trigger is written to external storage

The code query.awaitTermination() is what tells spark to wait until the stream is over. It accepts parameter to indicate the waiting time and if no value is provided it means wait indefinitely.

val query: StreamingQuery = provincialSale.writeStream
    .queryName("ProvincialSales")
    .outputMode("complete")
    .foreach(writer)
    .start()

query.awaitTermination()

Run the Program

To execute the program you need to have Kafka and Cassandra running in the background. Once you have that you can start the LambdaJob consumer followed with Producer code in .Net. This will insert the required data as expected into Cassandra demo1.sales table. The code for this demo is available at my GitHub account for download.


Further Reading

Enjoy!!

 
  Posted in:  Big-Data



The boom in social networking era contributed towards very high volume of data that’s constantly changing and moving. Processing such huge data in real time was a big challenge and LinkedIn engineers thus started a pet project to meet the requirements of high volume distributed messaging system that can process trillions of data over several clusters. This is how Kafka was born.

Kafka was built with speed and scalability in mind. It mainly resolves issues with streaming and queueing of high volume data. The key design principle behind Kafka is that it should have simple messaging API for the pub-sub (publisher-subscriber) of the data, with high throughput. Further, it is at present the most durable, low latency and secured messaging system around.

I was looking forward to use it in one of my projects that was built using .Net. Though I could have used Java as a platform too, but I came across confluent Kafka client that made me change my mind. While trying to implement I found some articles outdated and obsolete. I figured out finally the steps after some struggle. This is why I decided to blog in this article on how to setup your environment in windows operating system. For the sake of this article to be less confusing, I would assume that the root of all installation is under C:\Apache. In my case the drive is different.

Kickstart Kafka in 7 steps

Step 1:
Download JSDK and install using default options. I installed it such that my JRE folder is C:\Program Files\Java\jre1.8.0_121.

Step 2: Download and Install Zookeeper
Download latest stable release of Zookeeper. You need to download the compressed file and extract using WinRAR or any other tool you prefer at C:\Apache. After extraction you should see C:\Apache\zookeeper-3.4.10 if you downloaded 3.4.10.

Step 3: Download and Install Kafka
Download latest stable release of Kafka. Extract the compressed files at C:\Apache, such that the Kafka folder is like C:\Apache\kafka_2.12-0.10.2.0

Step 4: Configuring environment

  • Create following environment variables with given values
  • JAVA_HOME = C:\Program Files\Java\jre1.8.0_121
  • _JAVA_OPTIONS = -Xmx512M -Xms512M
  • ZOOKEEPER_HOME = D:\Apache\zookeeper-3.4.10
  • Find and edit environment variable “Path” and add the following locations separated by semi-colon.
  • %JAVA_HOME%\bin; %ZOOKEEPER_HOME%\bin

Create a sub-folder "Logs" under "C:\Apache" folder, which will be used by Kafka and Zookeeper to maintain its logs and indexes. Within "Logs" sub-folder create two empty folders "kafka" and "zookeeper". We will use these folders in the next step to configure Kafka and Zookeeper configuration.

Step 5: Configuring Kafka Properties

  • Server.properties
        log.dirs = C:/Apache/Logs/kafka
  • zookeeper.properties
        dataDir = C:/Apache/Logs/zookeeper

Step 6: Running Kafka Server
In this section all the commands shown below assumes that your current directory in command window is C:\Apache\kafka_2.12-0.10.2.0\ prior to executing any command. You can use Shift+Right-Click within C:\Apache\kafka_2.12-0.10.2.0\ folder in windows explorer and select “Open command window here” to open command window with selected path or use cd command after opening the command prompt window to set your current folder.

Open a command window and execute the following command line
.\bin\windows\zkserver .\config\server.properties

Since Kafka uses Zookeeper, it should be stated before Kafka. If all went fine, you should see the following window with Zookeeper listening at port 2181.

Now start Kafka by opening another command window with same current directory and executing the command ".\bin\windows\kafka-server-start.bat .\config\server.properties". If successful, Kafka would start and bind at port 9092.

Create Topic: Open another command window and change directory to bin\windows sub folder. Execute the following command to create a topic for Kafka.
kafka-topics.bat --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test

You can list the topics created by issuing the command - kafka-topics.bat --list --zookeeper localhost:2181

Test Producer: In a new command window, execute the following command kafka-console-producer.bat --broker-list localhost:9092 --topic test

Test Consumer: Open another command window and change directory to the bin\windows sub folder. Execute the following command to create a topic for Kafka.
kafka-console-consumer.bat --bootstrap-server localhost:9092 --topic test --from-beginning

Step 7: Using Confluent Kafka .NET Open Visual Studio and create a new console App Projects with name SimpleProducer and SimpleConsumer. Then select “Manage NuGet Packages for Solution…” from tools/NuGet Package Manager menu. Install Confluent.Kafka for your solution for both the projects.

Now add the following code and references to Producer and Consumer applications respectively.

Producer Code:

Consumer Code:

Test your console EXE’s: Rebuild and run the Producer and Consumer EXE’s from their respective bin/debug folders. If everything has gone fine, you can play with producer and consumer by typing any text in producer and see it appear in the consumer.

For further reference and learning about Kafka use following links

Enjoy!!

 
  Posted in:  .Net Big-Data



A very interesting algorithm that uses greedy approach to find the shortest path to the end node.
Imagine a connected pathways that takes different time to traverse as shown in the diagram below.
public class Node { public string Name { get; set; } public int Cost { get; set; } public string From { get; set; } public bool Visited { get; set; } public Dictionary&lt;int, Node&gt; Neighbors { get; set; } } public class Graph { private List NodeGraph { get; } public Graph(List graph) { NodeGraph = graph; } public List ShortestPath() { //Starting from the first node. var start = NodeGraph[0]; //if picked, the item is visited while (start != null &amp;&amp; start.Visited == false) { start.Visited = true; //adjust the cost of visiting neighbors if not already visited //if the cost is lower when visitng from another node, replace the cost value foreach (var neighbor in start.Neighbors) { var cost = start.Cost + neighbor.Key; if (neighbor.Value.Cost > cost &amp;&amp; !neighbor.Value.Visited) { //adjust cost and remember predecessor neighbor.Value.Cost = cost; neighbor.Value.From = start.Name; } } //find the minimum cost node for the next iteration Node next = null; foreach (var neighbor in start.Neighbors) { //start from first neighbor and replace it by the //lower cost neighbor that is not already visited if (next == null) next = neighbor.Value; if (neighbor.Value.Visited != true &amp;&amp; next.Cost >= neighbor.Value.Cost) next = neighbor.Value; } start = next; } //finally return the adjusted node return NodeGraph; } } static void Main(string[] args) { Node _aNode, _bNode, _cNode, _dNode, _eNode, _fNode, _gNode, _hNode; _aNode = new Node { Name = "A", Cost = 0, Visited = false }; _bNode = new Node { Name = "B", Cost = int.MaxValue, Visited = false }; _cNode = new Node { Name = "C", Cost = int.MaxValue, Visited = false }; _dNode = new Node { Name = "D", Cost = int.MaxValue, Visited = false }; _eNode = new Node { Name = "E", Cost = int.MaxValue, Visited = false }; _fNode = new Node { Name = "F", Cost = int.MaxValue, Visited = false }; _gNode = new Node { Name = "G", Cost = int.MaxValue, Visited = false }; _hNode = new Node { Name = "H", Cost = int.MaxValue, Visited = false }; //A can go to to D, B, G //B can go to F //C can go to D, F, H //D can go to G //E can go to B, G //F can go to C, D //G can go to A //H can go to none _aNode.Neighbors = new Dictionary<int, Node>() { { 20, _bNode }, { 80, _dNode }, { 90, _gNode } }; _bNode.Neighbors = new Dictionary<int, Node>() { { 10, _fNode } }; _cNode.Neighbors = new Dictionary<int, Node>() { { 10, _dNode }, { 50, _fNode }, { 20, _hNode } }; _dNode.Neighbors = new Dictionary<int, Node>() { { 20, _gNode } }; _eNode.Neighbors = new Dictionary<int, Node>() { { 50, _bNode }, { 30, _gNode } }; _fNode.Neighbors = new Dictionary<int, Node>() { { 10, _cNode }, { 40, _dNode } }; _gNode.Neighbors = new Dictionary<int, Node>() { { 20, _aNode } }; _hNode.Neighbors = new Dictionary<int, Node>(); var graph = new List() { _aNode, _bNode, _cNode, _dNode, _eNode, _fNode, _gNode, _hNode }; //initialize graph and call Dijkstra's Algorithm var sp = new Graph(graph); graph = sp.ShortestPath(); //print the graph with shortest path foreach (var node in graph) { Console.WriteLine("Node = " + node.Name + ", Cost = " + node.Cost + " from " + node.From); } Console.ReadLine(); }

 
  Posted in:  Algorithm



React and Flux Environment Setup

It did not came to me as easy to start a react project and setting up an environment. This is why I decided to write this blog. The idea here is not to delve deep into react concepts, but instead to help setting up the stadium to play the match. This would come easy to seasoned front end developers but still difficult for one starting react. I would demonstrate how to setup the environment where you have the necessary ingredients to focus on react concept than struggling with the environment.

First: Have Node installed

You should install the node to get started with the environment setup and learning. Browse Node.js and install the appropriate version on your machine. Fifty percent of battle is won right here itself.

Choose your editor

I know visual studio user would love to stick to it, but my experience is that it’s still far being the ideal. I myself used VSCode. You may use other editors from the list below. I would mention only a couple of which are awesome.
  1. VSCode
  2. Webstorm (not free)
  3. Sublime
  4. Atom
Choose your battle ground but the ones above are best. They all come with integrated command terminal or available through plugins.

.editorconfig

This is the file to setup your editor configuration defaults like tabspace, trailing spaces, EOL and many more as per your preferences. This is an optional if you are satisfied with the defaults provided by your choice of editor. An example shown shows the configuration I have for my VSCode.
# editorconfig.org root = true [*] indent_style = tab indent_size = 2 end_of_line = lf charset = utf-8 trim_trailing_whitespace = true insert_final_newline = true [*.md] trim_trailing_whitespace = false

Package.json

Instead of going through so many ways to initialize it, I would get straight to the point. Create a folder which would be the root of your application. Then create a file with name package.json in this folder. This file serves as a definition for all framework required by your project. By seeing an example you would be able to guess much out of it. Following is the package.json for react project that uses React, React-Router, Gulp, Flux, ESlint, Bootstrap and Browserify beside some utility tools.
{ "name": "starter", "version": "1.0.0", "description": "React Kickstart", "main": "index.js", "scripts": { "test": "echo \"Error: no test specified\" && exit 1" }, "author": "Manish", "license": "MIT", "dependencies": { "bootstrap": "^3.3.5", "browserify": "^11.0.1", "flux": "^2.0.3", "gulp": "^3.9.0", "gulp-concat": "^2.6.0", "gulp-connect": "^2.2.0", "gulp-eslint": "^0.15.0", "gulp-open": "^1.0.0", "jquery": "^2.1.4", "lodash": "^3.10.0", "object-assign": "^3.0.0", "react": "^0.13.3", "react-router": "^0.13.3", "reactify": "^1.1.1", "toastr": "^2.1.0", "vinyl-source-stream": "^1.1.0" }, "devDependencies": { "gulp-livereload": "^3.8.1" } }

NPM Install

After this use the integrated terminal window, DOS prompt, PowerShell or Bash to run the following command: npm install P.S. Assuming Node.js is installed globally on your machine.
This will install all the framework defined in package.json in your project folder inside node_module. Voila!!! Your project is now setup with the required frameworks. The dependencies are required for production release, while devDependencies are only required for your development environment. Following are the quick description and link for each library mentioned in package.json.
FrameworkDescription
BootstrapFor responsive design.
BrowserifyBrowserify lets you require('modules') in the browser
FluxFlux is an architecture that Facebook uses internally when working with React. It is not a framework or a library. It is simply a new kind of architecture that complements React and the concept of Unidirectional Data Flow.
GulpUsed to automate the build.
Gulp-concatSimply to concat files.
Gulp-connectGulp plugin to run a webserver (with LiveReload).
Gulp-eslintFor linting your codebase (Verify Code Quality)
Gulp-openOpen files and URLs with gulp.
JqueryJavaScript framework required by bootstrap
LodashLodash is a toolkit of Javascript functions that provides clean, performant methods for manipulating objects and collections.
Object-assignFor merging objects. Used by gulp.
ReactReact is a declarative, efficient, and flexible JavaScript library for building user interfaces.
React-routerDelivers navigation elegance to React.
Toastrtoastr is a Javascript library for Gnome / Growl type non-blocking notifications.
Vinyl-source-streamUsed to manage streams in gulp.
Gulp-livereloadHot reload and refresh.

Project Structure explained

The following project structure is simplified keeping in mind the scope of this article. However it is more like what many experts would suggest for the development environment. The project contains all source code inside the folder “src”. All configuration files like pakage.json, eslint.config.json and gulpfile.js resides at the root folder. The “dist” folder is meant for distribution. It is empty in the beginning. The gulp tasks minifies, bundles and copies files in the “dist” folder. Node_modules are the framework that gets installed when “npm install” is executed from the command prompt.

Automation and Build with Gulpfile.json

The gulpfile.js contains the definition of build. It is one of the most preferred and easy to understand tool to automate your build. Each steps of the build process is defined as a task with one task that combines all these sub-tasks into sequence to call each of them and complete the build process. You can further visit the gulp tutorial to take a deep dive. I will provide a quick explanation of each tasks below. Initialization – This is the topmost section that defines variables and plugins that will be used by the gulp tasks. They are assumed to be available to gulp from node modules installed.
"use strict"; var gulp = require('gulp'); var connect = require('gulp-connect'); //Runs a local dev server var open = require('gulp-open'); //Open URL in a web browser var livereload = require('gulp-livereload'); var browserify = require('browserify'); // Bundles JS var reactify = require('reactify'); // Transforms React JSX to JS var source = require('vinyl-source-stream'); // Use conventional text streams with Gulp var concat = require('gulp-concat'); //Concatenates files var lint = require('gulp-eslint'); //Lint JS files, including JSX
Configuration: This section mainly provides gulp with necessary input to define its behavior while automation and execution.
var config = { port:9005, devBaseUrl:'http://localhost', paths:{ html: './src/*.html', js:'./src/**/*.js', images: './src/images/*', css: [ 'node_modules/bootstrap/dist/css/bootstrap.min.css', 'node_modules/bootstrap/dist/css/bootstrap-theme.min.css', 'node_modules/toastr/toastr.scss' ], dist: './dist', mainJs: './src/main.js' } }
The following two tasks defines how gulp would run the build after compilation.
//Start local development server gulp.task('connect', function(){ connect.server({ root:['dist'], port:config.port, base:config.devBaseUrl, liverreload: true }); }); gulp.task('open', ['connect'], function(){ gulp.src('dist/index.html') .pipe(open({ uri: config.devBaseUrl + ':' + config.port + '/' })); });
Html Task – The following task informs gulp to copy source html files to destination and watch for changes during execution. If the source file is changed this task will be automatically called and the browser will be refreshed automatically through “livereload” plugin installed. The livereload plugin is available for chrome as an extension.
gulp.task('html', function() { gulp.src(config.paths.html) .pipe(gulp.dest(config.paths.dist)) .pipe(connect.reload()) .pipe(livereload()); });
JS Task – The following task transforms the React-JS to JavaScript, minifies and bundles all JavaScript to bundle.js. It then copies the bundle.js to “dist/scripts” folder. It further watches and loads the files when changes occur.
gulp.task('js', function() { browserify(config.paths.mainJs) .transform(reactify) .bundle() .on('error', console.error.bind(console)) .pipe(source('bundle.js')) .pipe(gulp.dest(config.paths.dist + '/scripts')) .pipe(connect.reload()) .pipe(livereload()); });
CSS Task – Bundles and minifies the CSS file(s) and copies to the “dist/css” folder.
gulp.task('css', function() { gulp.src(config.paths.css) .pipe(concat('bundle.css')) .pipe(gulp.dest(config.paths.dist + '/css')) .pipe(connect.reload()) .pipe(livereload()); });
Images Task – Copies source images to the “dist/images” folder.
// Migrates images to dist folder // Note that I could even optimize my images here gulp.task('images', function () { gulp.src(config.paths.images) .pipe(gulp.dest(config.paths.dist + '/images')) .pipe(connect.reload()); //publish favicon gulp.src('./src/favicon.ico') .pipe(gulp.dest(config.paths.dist)); });
Lint Task – This task helps gulp to verifying code quality according to the rules defined in eslint.config.json.
gulp.task('lint', function() { return gulp.src(config.paths.js) .pipe(lint({config: 'eslint.config.json'})) .pipe(lint.format()); });
Watch Task – Once the build is executed in Dev Server, this task informs gulp to keep it hot and watch for any changes. If the changes are made, gulp will execute the task to recompile and refresh the content.
gulp.task('watch', function() { livereload.listen(); gulp.watch(config.paths.html, ['html']); gulp.watch(config.paths.js, ['js', 'lint']); });
Combined Default Task – The “default” task is the entry point for gulp that calls all sub-task in sequence defined.
gulp.task('default', ['html', 'js', 'css', 'images', 'lint', 'open', 'watch']);

Eslint.config.json

Gulp uses this file to provide verbose info of code quality during compilation and build.
{ "root": true, "ecmaFeatures": { "jsx": true }, "env": { "browser": true, "node": true, "jquery": true }, "rules": { "quotes": 0, "no-trailing-spaces": 0, "eol-last": 0, "no-unused-vars": 0, "no-underscore-dangle": 0, "no-alert": 0, "no-lone-blocks": 0 }, "globals": { "jQuery": true, "$": true } }

The Hello World of ReactJS

Index.html – The SPA (Single Page Application)

main.js – This is the entry point of ReactJS

homepage.js – A react component

Where to go from here !!

Now you can jump start, concentrating on ReactJS and its features.

Enjoy!!
Manish


 
  Posted in:  Front-End




Scrum is an easy process to understand but not as easy to do well. It has many facets which if overlooked, results in failure of scrum team and erosion of the agile process. Scrum, done correctly, leads to creating a “YaY” team that is small and focused to velocity and an environment that is cooperative with the fire and willingness to succeed. But it requires a good understanding of how those facets affect the scrum team and sprint. Rather than discussing the scrum process, I would like to focus on the parameters that makes a YAY Scrum team.

The Team Dynamics

Team is formed of various shapes and size. Skills, work style and experiences varies extensively combined with behavioral variations as well. Some developers like collaborating while some like to work alone. Some like to dive in and some like to investigate and prepare before diving in.
As simple as it sounds, it is still a complex variable of sprint and iterations. The first and foremost important step towards a successful scrum team is to understand the team mix itself. The team size, expertise, specialization and ownership is very important before planning any sprint.

What really matters …. Is how the team is organized around the feature or product? Do they have everything in place to be able to deliver a working tested increment of the software? Do they fully understand the user stories? Can they understand the business value of it? Can they estimate it? Do they fully understand the backlog? Do they have necessary resources to commit?

The team is basically an accountability structure from top to bottom. If the structure does not have everything required, then they cannot be held accountable. When a team doesn’t have everything necessary to be successful, it erodes the sense of urgency around getting things done. Velocity destabilizes and no one cares. No one is accountable for making progress on the product and the team becomes a victim.

The bottom line is that if you cannot create the right mix of scrum team, you are en route to dead-end filled with some nasty surprises. Better find out some other innovative way of developing software than following agile and scrum. It’s that important.

The Project Increment

Like the team, the project also comes in various shapes and sizes. It sounds simple, but when observed carefully and in-depth, a project requires many things to be done before it is demoed to the end user. Planning, Research, Designing, Coding, Reviewing, Testing, Staging, Deploying, Documenting and many other stuffs. All parts of the deliverables should be broken down in ready stories.

The team decides what increment they can produce in the sprint and final deliverables of a sprint should not have any technical debt, pending feedback, pending testing, pending UAT or pending documentation. Anything deferred adds indeterminate amount of work at the backside of the project. What matters at the end is that the team have a measurable chunk of the sprint that can be neatly measured as ratios and timelines.

Backlogs

The essence of successful sprint and deliverables starts from backlog. A poorly formed backlog is almost every time the root cause of almost every problem in a failing sprints. A poorly articulated backlog is an outcome of poor sprint planning meetings. If the backlog isn’t clear, the team comes to sprint planning meeting and spends all the time on “what” needs to be done and not enough time on “how” it can be done. It is visible when the team is trying poker, planning and prioritizing on “what” rather than “how”.
Having backlog, that is well captured, well defined, broken into small chunks of tasks and easy for a team to understand and estimate, increases the success chance of the sprint by many folds. The team feels greater control over the tasks. It also help the product owners to determine what value addition can be committed for the next sprint release.

Ownership

Because the team and the project both varies a lot in shape and size, a concept of ownership of user stories plays a very innovative and critical role. The owners are the subject matter experts and seniors who bring in more experience and quality break down of the story to backlogs. The owners of user stories bring in more holistic view, accuracy, competency and urgency to the sprint. The estimations are also radical and knowledge based. Asking a database administrator to own a QA user story or test plan and try breaking it into tasks is like bringing in grandmother to estimate a rocket engine.
Besides, the owners also brings in a sense of commitment and achievement that is greatly infused with a sense of ownership. The smaller the team wearing many hats, the better the ownership works. However, the owners should be capable and expert of those fields.

User Stories

User stories are the crux of backlog and increments in a sprint. These are the building blocks of the successful sprint. Something, I came across is very interesting concept: “Bill Wake’s INVEST criteria”, which relates user stories to independent investment decisions. Where each letter in the word INVEST is defined as:

I – Independent – The user stories must be independent of each other and self-contained.
N – Negotiable – User stories, up until they are part of an iteration, can always be changed and rewritten.
V – Valuable – A user story must deliver some value to end user.
E – Estimate-able – One must be able to estimate the size of user story
S – Small – A big user story is simply impossible to plan or visualize.
T – Testable – A good user story must have a clear test plan with simple use cases.

A typical format of a user story is “As a (role) I want (do something) so that (business value)”

As simple as it looks it can still be invaluable, excessive or odyssey, when misunderstood and misinterpreted. This always leads the team to sidetrack and get nothing accomplished. Here are some examples of user stories.

Good: As an admin I want to deactivate another user’s account, so that they can no longer log in.

This user story clearly describes role, what needs to be done and business value. It is self-contained, easy to understand, estimate and test. In short it follows all the INVEST aspects.

Invaluable: Send notifications when a contact is created.

The drawback is that it does not state who or what is sending notification. It does not state in what form the notification is sent e.g. email, twitter, SMS or something else. The description does not include any business value information. It is hard to estimate and a tester in the team won’t be able to describe what needs to be tested here. Asking the team to play Poker on it and estimate is like asking a five years old to describe the molecular structure of a rocket fuel.

Excessive: As a salesman, I want to save my customers list, so that I can create a copy, email and print later on.

The drawback here is that the user story is not clear whether saving the list is what’s required or print, copy and email features is what’s needed. There is excessive information in this user story and that leads to misunderstanding and sidetracking. This requires rework in breaking it down or reframing to a more well-defined user stories.

Odyssey: As John the customer, I want to register for an event, so that I can secure my place.

There is nothing wrong in having an epic or odyssey user stories as long as one understands that it must be broken down to ready stories that are self-contained and based on INVEST criteria.

Each user stories in the backlog and sprint increment must meet this INVEST criteria for the team to be able to commit to the working tested increment of the software. The success of a sprint depends on how well these user stories meet the INVEST criteria. Poker, Burndown charts, Retrospection, Sprint planning and all other exercises around sprint is futile otherwise.

Process Overlaps

Many companies who call themselves Agile are not really agile. They still haven’t come off the waterfall model, which ensures more control and tight iterations over sprints. There is a big difference between Agile and Waterfall. Mixing waterfall and agile is too confusing and too uncomfortable. Trying to control over Cooperation becomes a major challenge. Architecture and Design documents getting ready before a single task happens to build the product is simply not agile. It makes the process too complicated and rigid. This doesn't work on scrum. If you are using scrum, your project needs to adhere to the agile mindset. Ownership, communication, co-operation, repeated discussions and elaboration is the key to success. This helps in creating the “YAY” team effect.

Some even try to introduce Kanban outside the scrum process. This creates a distraction and cuts off the modules and member from the rest of the team. A bad practice that develops a feeling of job and duty than achievement and urgencies.

Culture & Shared Understanding

Just following scrum and its top 10 tips does not mean that the scrum will succeed. There has to be a shared understanding. It has to be organization wide. The sprint commitment and risks factors should be well understood by all team members. The culture of the company also plays an important role in how team works. In many cases it’s mostly the culture of control and fear that leads to a non YAY team effect. A non YAY team is like a pieces of robots taking orders.

There is a big gap in knowing scrum and practicing it correct. Without a correct agile mindset it always meets a tussle between waterfall and agile ways. Many keep struggling and sometimes give up scrum process or land up into mix of processes which is too uncomfortable, and leads to no difference than the same old traditional ways. It’s an old saying that people do smart things when they genuinely “want” to do it. A good culture is to fill that gap and inspire people to do things without being gripped by any fear.

Sprint Planning Dynamics

Another effect that works against a good scrum team is influencing planning phase to squeeze time and pressurizing the members to commit to the time squeeze. It should be owner of the user story (not to confuse with product owner) and team to estimate, work and retrospect to improve on next sprint. It must be the part of retrospection on how much each member deviated and how can it be improved. Burn down charts and estimation should be a continuous process. It must be done by the team for the team within the sprint iteration.

Influencing the team during iteration, changing priorities in any way is like taking the commitment part away from team to one’s own shoulder. The basic principle of the Agile and Scrum is to let team plan, commit, deliver and retrospect. Any kind of influence that lets team change the commitments at the beginning or in the middle of the iteration, would result in team always looking for directions and never valuing those commitments.

Motivation

Even in real life one needs to get a feeling of value addition made to the task or team he or she has contributed to. If not so, one quickly gets bored and tries to escape from it as much possible. It is important for the leads and product owner to feed on the motivation and removing any fear or shy factor from the team member. It is very essential for a team to succeed collectively on what they committed to or commit and see it to completion. By seeing it to completion means, a reward or appreciation of job well done puts the team into high motivation zone.

You need to have a “YAY team” effect before you can see the velocity being met or even better, team beating old records and showing the signs of urgencies. Small wins and achievements works towards strengthening the shared understanding bonds and an environment that resembles a truly successful scrum team gaining on velocity, loosing on debts and beating their own past records.

As always seen, the only way to get something awesome done by a team is when there is a genuine willingness to do something. Thus having a motivated and inspired team is very essential for a successful scrum team – YAY Team!

Signs of failing Scrum Team

  • In sprint planning, team discussing “what” and not “how” they would complete the sprint.
  • User stories does not follow Bill Wakes INVEST criteria.
  • Poker and complexity numbers are arbitrary.
  • You’re having hard time having full attendance on stand-ups or it is forced.
  • Adding additional processes in trying to make the sprint successful.
  • Team being victimized as incapable or dumb.
  • Software failing after being Tested.

Summary

Every aspect described here are interrelated. They all must be explored and worked on to create a YAY team effect. If you have most of these right, you are struggling but en route. If not most likely you are trying to control the sprints with multiple processes and still have no clue why it’s failing. You are slipping into waterfall model whereby you are calling it agile. Your team is simply being victimized and coined as incapable to deliver.

 
  Posted in:  Agile