Mankind is not a simple feature vector

I’m not going to talk about politics, but I do want to talk about this article I found on one of my many feeds. I will summarize this extremely briefly. This guy was so baffled by the election outcome and the poor pollster statistics and generally how this wasn’t predicted with machine learning and the sort.
Anyone who knows me, knows that i hate the term “Artificial Intelligence” I remember some expert in the industry who closer compared it to “slightly less dumb”. I have written a number of other articles expressing how we are very far away from anything that is truly intelligent, by much any definition. Tools Ike Watson and DeepMind are impressive highly refined tools that have been fed incredibly large data sets so that it could either be trained or train itself.
It is not at all a surprise that predictions were so far off. When dealing with decision making in general most people choose choice A because they align themselves with many of the principles, philosophies, and ideas. That is at least what we wish people did. The reality is the things that are not about the “issues” are a great deal more important than people may think. People are not Vulcans; they are not making every action or even most of the actions that they perform weighing in all of the possible outcomes and the long term versus the short term benefits.
People are emotional and erratic creatures. People are bigots, racists, or so open minded that they have no actual opinions. We live in a world where people must be politically correct and are shunned for deviation. Money and power go hand in hand provide ample leverage to many scenarios. Corruption is not Republican or Democratic phenomenon, it is a part of humanity. Our nation and its capitalistic mindset is one of the greatest things that make us a unique and great nation. Regulation from monopolies is a way to limit that the gross power a company may posses should only reach so far.
The power of leverage from a huge conglomerate had over the government is tremendous. A true system or checks and balances that provides independence to the capitalistic nature of commerce as well as the freedom for a nation to govern itself is not any easy task in any way. It seems evident that in our over 240 years it is still far from perfect.
What we need to realize is a machine model, and statistics that are collected only account for the features and dimensoins that were taken into account. The nuances of what a single person wants let alone a nation of millions of them is simply impossible without first grasping the human factor. The incredible thing that define each and every person who lives on this planet.
I hope that our understanding of ourselves as well as our capacity to teach our machine models improves so that we can better harness and sustain the wonderful qualities of mankind.

Functional programming in the enterprise world

Recently I have been heavily using Apache Spark. For those of you who don’t know Spark is a very powerful system for working with data and parallel that is written in the Scala language. Scala is not new, but certainly on the “newer” end of the spectrum. Today new languages are coming out all the time so 12 years is fairly long. What many people find attractive about Scala, at least I do is the fact that it runs on the popular JVM. In fact a developer may be able to write code in Java and interact with code that was written in Scala. The challenge is striking the balance between closeness to Java and still providing whatever it is that the creators of the language hope to achieve.

I am a big believer in Object Oriented software design and development. I’m not saying that every project in the world needs to be written exactly the same, as they all have different requirements. I will say that for enterprise software the level of adaptability is truly best achieved with levels of modularity as well as abstraction. In truth, if you can achieve the principles that an enterprise architect looks at in a technology it may be something to be considered. In the past languages like LISP such as Haskell, were never geared towards the enterprise as their mathematical background and dynamic typing didn’t fit the bill for the type of type safety compilation and code reuse that has been found in other enterprise technologies. In general I like languages that aim for simplicity but at the same time aren’t overly opinonated. I recently read a rant about the Go language for the lack of support of assertions because the language creators felt that people used assertions incorrectly, not like Java chose to avoid pointers because of their inherit nature to cause errors. This is an example of a language being dumbed down or muted, with that said I really like many aspects of Go.

I’ve read a number of articles about so called “veteran” developers who have ditched OOP to embrace some sort of functional language. Complaining that the design principles of OOP aren’t applicable and don’t work. I even read recently that a college professor from Carnegie Mellon removed OOP from the syllabus for freshman. I don’t necessarily think that is too awful, but I do think problem decomposition that one would do when designing an OOP system is not only very helpful but also useful. Not everything easily fits into a “map” or a “reduce”. I can’t speak for all people, but I think that OOP is more natural to the domain than functional. If you truly understand the domain and how to break apart a problem into single scoped entities you will find simplicity and elegance.

With that being said, I think that for parallelism functional programming has always been faster and more efficient. I do however think that there is room to bridge the gap. Scala is a multi-paradigm language, not just functional. I believe it is that aspect that can truly bring something special to the table. Technologies like Spark still require a more or less functional approach. There are layers like Dataframes and graphs that attempt to abstract some of the functional aspects from the developer. What I haven’t yet seen is what Hibernate and other ORM technologies did for SQL, with respect to large scale functional parallelism. I think once we bridge that gap that will be the holy grail for enterprise software. I look forward to seeing how these technologies continue to evolve and mature.

AWS and vendor lock-in

Right now AWS is the leader in cloud-computing, without a question. With DELL’s recent acquisition of EMC, which happens to own roughly 80% of the eminent virtualization company Vmware  one can imagine they have their sights set on competing for a piece of the action as well. AWS is so popular that I recently saw companies such a Rackspace who used to be a real competitor to AWS now offers premier managed hosting of AWS. That is the kind of smart attitude shift that companies such as Microsoft have started doing. Now under the tutelage of Satya Nadella Microsoft has been making many wise moves all recognizing that they need to play nicely with other companies and recognize that big bad Redmond isn’t the only company in the ecosystem anymore. With that said most companies I speak to still treat AWS as being synonymous to the cloud, or at least the defacto cloud provider solution. There isn’t anything wrong with that, as long as you recognize what that really means.

Back in the day where companies such as Oracle…actually innovated and dominated the large scale database space. In the realm of commercial databases Oracle is still a big player, of course technologies rooted in open source are becoming more and more common place. Back in the day before ORM’s and when there was no buzz word called the “cloud”, the database was a huge monsterious construct that dominated the stack. Nothing was distributed and hardly clustered. Databases had single masters with some read-only slaves that had stale data. The single master was a bottleneck for writes, you hoped that it was enough because there weren’t the same options that exist today. Lot’s of the database providers would offer their own solutions that were specific to their API. Your DBA and Architects would caution you against vendor lock-in. They would tell you if you implement your system around vendor specific features you will find yourselves unable to break free from their grasp. Now I have always said that you should take advantage of vendor specific features as long as you architect them in a way such that they are more or less lift and shift binding a standard interface with a new vendor implementation so that you can now take advantage of their approach without sacrificing the modularity of your code-base.

This notion of vendor lock-in is not only applicable to databases but in my mind even more important with hosting solutions such as AWS. AWS isn’t a software company, they are a service company. A handful of the services that they offer are actually born and bred in-house, rather most of their services are open source technologies that they have adapted tried and true practices and software solutions and mixed in their automation and high availability in. AWS is different than any other hosting company before it because they provide specialized solutions not just raw hosting. Other companies can and will compete and offer similar if not better products. I anticipate that you will start seeing a great more in specialization. Companies like IBM, who are fairly minimal in their large scale utilization. They do however have a very impressive suite of API’s geared towards machine learning (https://www.ibm.com/watson/developercloud/services-catalog.html) both Google and AWS have some minimal machine learning solution but nothing that compares to the versatile toolkit Watson offers. I’m not telling you that IBM is reliable, performant, or any good at all. I am saying that in addition to their boring Bluemix hosting they are being innovative with their services.

The bottom line is remember that AWS is only a single vendor. Just like all markets, their will be growth and competition. S3 may be a fairly standard key-value system but their API and domain level approaches may be very different from the next leader in the cloud industry. Learn from experience, invest your time in designing your systems to be capable of switching cloud partners without a complete rewrite. It’s okay to use vendor-specific features, as long as you develop them in a way that you can easily accommodate a change.

Cats & Dogs: An intelligent look at AI

I am not a data scientist or expert in machine learning. However, I strongly believe the modern approaches to machine learning has been neither “intelligent” or “learning”. I am not the first person to point out this but perhaps I will have a novel approach that may add additional insight. An infant who has seen a dog and a cat a few times would likely be able to point to the correct animal when inquired which one is the dog. The same sort of task takes an incredible amount of samples to “learn” which is which. You see clear examples of just how unintelligent these systems are with big mistakes like this from Google identifying people with darker skin color as gorillas. In order to omit that result they had to make an exception. This isn’t the same as a child who ran across the street would be scolded by a parent. When a child is scolded the hope is that they understand the severe danger they can put themselves and therefore take additional caution. A neural network doesn’t make such a distinction, its merely a directive with a higher priority, since it really doesn’t “know” what its looking at.

Others have discussed this topic in passing (here, and here, as well as here) but I haven’t seen an example of human-like learning. Let’s take a look. After looking at Wikipedia as a simple  reference point to establish the evolution regarding computer vision things became a great deal clearer. This example is not limited to vision, though for the moment let’s use it as our example.

Let’s separate visual comprehension (such as being able to look at a picture of a dog or a cat and correctly classify it) into two components:

  1. characteristics or traits
  2. colors, textures, and depth.

Today many facial recognition systems will use measurements of parts of the face to uniquely identify one person from another. For a simpler case let’s look at the dog and cat. I think that if you enumerate the characteristics of a dog ordered by most important to least important you will have somewhat of a similar list regarding a cat. As you may recall Google had a little game for improving their image search. One of the things they did was tell you what keyword you could not use. With that being said take your dog/cat list and make a second list for each that must be exclusive. Those lists are the unique identifiers that would help you distinguish a dog from a cat. These factors may not be the most important traits of the respective animal, however they are unique to those animals.

If we separated the mechanism to identify those traits from the traits themselves that would begin to more closely resemble something I would be comfortable calling intelligence. What we lack today is the instinctive low-level capacity of self-learning. Vision is a single domain, one that is very complex. The point is that we need to work on building the base intelligence rather than the domain specific intelligence. Natural language processing is awful, even with the most advanced things out there I can very easily trick them. I had great fun playing with the publicly available Watson API’s. One of which attempted to identify “tone” of a document. This is very tricky and one that was quite easily broken. It was thinking like a robot, if I can identify “positive” adjectives or look for “negative” words there can be assumptions made and a general inference of the tone. Of course with a little imagery I can use very beautiful and poetic imagery of some very dark stuff! I pretended to be a Cannibal writing a letter to someone who he wanted as his next meal. It was fun but a real reassurance that this isn’t intelligence of any sort. It’s an improved Webster’s dictionary.

What is the new thing that is trying to embrace the world? The virtual assistant. Siri sucks, it’s Apple what do you expect? Google has gotten better. So I’ve read from a number of sources that Hound is supposed to be the next evolution of the VA. One of the major things I used to do was ask compound questions. Like “What is the tallest building in the world?”…got that answer “in Dubai”. Then I ask “What is the population of Dubai?” easy one as well. “What is the population of the tallest building in the world?”…ehhh nope. What it can do which is an improvement is remember context. So I can ask what the tallest building in the world. Then ask a follow-up question, what is the population there? It understands the pronoun is referencing the answer to the previous question. This is not about a wealth of information. It also isn’t about natural language processing. It’s about a much more “intelligent” vehicle that drives these basic processes.

I hope to discuss more about what this look like soon. Until then, I have little to no fear that the little steps that we take in supervised, or unsupervised learning is quite literally teaching the dumb. Our problem isn’t it the method of “learning”…it’s our student.

Big Data: Low Latency Solutions

The powerhouses of the past were gigantic mainframes that were built for high yield computing tasks. The notion of distributed computing extends far beyond the “cloud” frontier. I remember running [email protected] as well as [email protected] in the early 2000’s. The earliest known project I could find was GIMPS which dates back to 1997. Suffice to say, there are advancements in distributed computing. Unfortunately, the general problems from then are still very much a reality in the present. Applications are forced to choose between processing throughput over latency. Powerful data processing appliances answer the questions knowing you will have to wait around for the computation to complete.

This has been the big problem that we knew needed an answer. Depending on the industry there has been a need for performing computationally intensive tasks like finding prime numbers which were not necessarily dealing with large data. On the flip side we had financial records which accumulated to a fairly large size with much simpler computation. Handling the computationally intensive we still strive for with light at the end of the tunnel with hope for a quantum computing solution. Until quantum computing becomes a reality and an economically feasible technology we will have to seek alternative avenues.

Hadoop and  provided a well known distributed file system (HDFS) which was tightly coupled with the ability to load data from disk, process it and store the results back to disk. HDFS were inspired by Google and their experience with big data. MapReduce is kind of what I think of as the caveman’s approach to data processing. It’s raw and plentiful. It’s fault tolerant so hopefully it all get’s done sometime, eventually. MapReduce on Hadoop is only as sophisticated as the underlying system and the jobs executing. There is no linkage between Hadoop jobs so the jobs need to be self contained or connect to external data sources…which add latency. With complete isolation between mapping jobs the reduction at the end intends on bridging the gap and calculating the final results. The design paradigm works but there is  a lot of room for improvement.

While this certainly works time and experience has led experts to much more “intelligent” approaches that remove much of the redundancy and attempts to limit duplication of efforts as much as possible. If you were thinking of Spark, you are right. Spark was created as a distributed compute system to maximize throughput and minimize redundancy at all costs. Spark is very fast in the areas where Hadoop was sluggish and simplistic. Spark has multiple libraries that can run on top of it, SQL, GraphX and MLib. Spark also support stream processing, we’ll get back to that soon. Ultimately attempting to do the same approach as Hadoop…just smarter. This will be faster, but still not what you need for an OLTP.

Did he just mention OLTP!? No, this guy is off his rocker…Hadoop isn’t for OLTP, its for batch processing. Come on everyone knows that.

What is the major differences between an OLTP and a batch processing system? An OLTP is catered to provide data for application consumption and low-latency usage. The number of active users are typically much higher and the scope of their computations are simpler in complexity. Whereas a batch processing system is geared towards low concurrent usage with complex computations. There is no question that these are very different tools and addressing different users. My interest is in providing a way where the two systems may live together able to harness the capacity for concurrent complex computations as well as parallel consumption on the user level. I am looking into possible solutions that may fit this void. So far https://www.citusdata.com is the closest thing that could fit this. Additionally, http://druid.io seems to also be a possible contender. I will review and discuss these technologies in future entries. Keep your eye open for technologies that are going to stretch your perception as to what is a database and what is a data-warehousing tool. I hope to see the lines blur between data storage, data processing, and real-time data analytics. The need is there for these sort of advancements, and I think with the proper approach the technology isn’t too far from our grasp.

Data Evolution to Revolution

I apologize in advance as this is an attempt to solidify some ideas that I have been having into something a tad more cohesive. Let’s start small. Not sure if this is a chicken and egg sort of thing but I think it just may be. Does the evolution of technology yield more data and thus more complex and intricate mechanisms to capture, evaluate, utilize and wield said data are needed. Advances in technology bring about more data quicker and more accurately. Strides forward in technology enable greater insight into data that may have existed for years maybe even decades or more. Most of these technologies are merely building on top of their predecessor taking things one step further and refining them. Occasionally there are novel ideas that truly disrupt the pace and direction shattering preconceptions and misconstrued understandings. We have grown very little in our “PC” age in true advancement. We have sleeker, smaller, faster infrastructure. We have smart watches, phones, tablets and more. We have this buzz word “cloud” and “IoT” that people love to throw around. Anyone today can make an “app” and they think that is novel, that will change the world. We are so far away from true advancement on the computing level to use the word of intelligence. I would not pretend to be an expert on anything to do with AI or machine learning. I do however know that we have neither the precision or speed capable of coming remotely close to anything of substance. We are playing “Go” and Jeopardy, we are writing cookbooks, and more. True creativity is alien to our creations. We are doing nothing more then creating formidable copy-cats. Sure it may consume many different approaches to chess or some other topic. Ultimately it is illustrating a path that will attempt to “beat” its opponent. I am not enough of a philosopher or scientist to evaluate the state of that level of comprehension. It is certainly complex and well structured, and it may beat a human. It is however a very far cry from the human intellect.

Now that I have gotten this far I can try to say with a bit more clarity that I am trying to illustrate that computers and technology evolve constantly. Through their evolution they yield more and more data. New technologies are needed to understand and discover all of the depths of that data. Ultimately we are further burrowing into the initial data acquired. Nothing new only uncovered.

To teach a computer to act like a man you cannot, but you can empower a computer with the collected knowledge of a man and enable it to utilize that knowledge.

The inspiration for this rant is because I have been toying with the notion of creating a layer that will assist in abstracting data persistence from the application developer. This layer won’t understand your data only understand how it relates to each other, the types of data, and the types of analysis performed on that data.

The primary difference between TCP and UDP is fairly simple and can illustrate the beginning of this abstraction TCP has error checking and used when order matters. If you care about speed and nothing else, UDP is appropriate. TCP is heavy but reliable for data integrity, order, and consistency. I’m sure I’m not the first to draw a parallel to that of the traditional RDBMS your SQL variation versus the newer NoSQL variant. SQL is typically  characterized as transactional, a strict schema, relational and challenging to scale horizontally. Whereas the NoSQL variant is typically assumed to be non transactional, schemaless and easy to scale horizontally. There of course have been the newer variants often dubbed as NewSQL which exhibit a hybrid approach attempting to take th best qualities of the aforementioned technologies and provide a hell of a solution. There have been advancements in technologies specific for very large file store, JSON documents, and full text searchable data. Suffice to say, there is no size fits all solution. Depending on the software needs, business requirements among many other factors there may be dozens of different factors that can attribute to proposing a design. The DBA used to be the overlord with his stored procedures and locked down tables. Not every problem can be solved with a store procedure. With an ever growing technological world, and more and more data emerging a strategy to stay afloat ahead of the curve seems faint and distant. I think that with the right approach it can be done and change the nature of how we interact with data today.

Spring Data hopes to provide a layer of abstraction when accessing data from different sources. Varying from SQL sources, NoSQL variants, In-Memory sources or others. With Spring Data the @Entity represents your top-level data that we are dealing with. The other being the @Repository which is the abstraction layer that interacts directly with the store engine that is configured. Spring Data can support many different types of store engines but ultimately the onus lies upon the application architect to decide what the repository connects to. Image if there was a layer that determined how the defined data would be persisted evolving over time.

The relationship between an FPGA and processing is similar to this new layer and the data that it will persist. With an FPGA the gates are constructed according to the task at hand and how the module was programmed to react under those given circumstances. Similarly, this new layer which I am going to dub as an “Adaptable Data Source Gateway” will utilize the different components that are at its disposal based on the data inputted and the configured priorities.

Here is a high level overview of how this may be accomplished. The comparison to an FPGA only goes so far. An FPGA doesn’t “automatically” change its design, rather it changes its design based on its programming. To add this functionality with regards to data it will be necessary to maintain a list of possible storage engine types (engine interfaces) as well as implementations of those interfaces. We will also need rules to identify what data requirements are best suited for each interface. Ultimately we need a metadata system that will allow the developers to identify the data from afar allowing the system to gradually grasp the data needs.

I have some ideas for the secret ingredients that will make this possible. I am going to toy with these ideas so far and do some rapid prototyping. I hope to write more soon about my findings. As always any input is appreciated!

Crate.io – Part 1

So I started playing a bit with crate.io while I waited for the power to come back on here at the office. Crate is powered as a NoSQL data model with implied typing and optimistic locking. What makes crate unique is that it attempts to provide the aggregation functionality commonly found in RDBMS. However the eventually consistent non-transactional data model is very foreign. It has things like full text search and geometric querying functionality, which is nice but nothing to write home about.

The data types it claims to handle add the additional array and object types not found in traditional SQL systems. PostgreSQL does support these types but they are not without their limitations. Crate’s ability to handle JSON type input seems a great deal more natural than has been my experience with PSQL 9.4+, which was fairly awful.

Let’s create the following table:

 now we make an insert with this data:

 Now if I attempt to use insert another row with a data type that is different from the previous row there is  a problem:

yields:

now that may be logical…what crate does is for the complex data types (object and arrays) it detects the data types of the inutted values and creates a schema around those values. There are other data storage engines that perform similarly. Creating a schema on the fly. This makes querying fast and more efficient, especially where the actual database is written in a format that is much more type rigid.

Now this was an unexpected and annoying issue:

yields:

The strongly typed Java roots of Crate may be apparent from these few limitations. The type flexibility within arrays is not an uncommon convention in JSON. As for the rigidity of the schema, I imagine that for performance reasons Crate initially detects inputted types and creates a schema to adhere to it. This is not uncommon for some NoSQL databases that attempt to have what I’ve seen called gradual typing.

Crate supports blob types, like MySQL and others. Blobs are supposed to allow for binary storage of data. Crate doesn’t point out any clear limitation.

I want to quickly summarize my findings so far. I have not reviewed crate for performance, high availability and reliability and many other things for that matter. My initial focus was evaluating its support for JSON and having “dynamic schema” while using a SQL variant. It is non-transactional system that utilizes optimisic locking. If you want to store your JSON which may have a “fuzzy” schema or hybrid schema you may run into problems. Crate locks into its perceived schema based on inputted data. If your JSON is consistent and you want to support database aggregation (the way it should be) crate may be for you.

Bottom line: Looks like a promising solution for dealing with data that has a schema with limited fluidity. Has many of the features you would expect from an RDMBS with the scalability of the newer NoSQL variants. Warrants further investigation and looks “optimistic”.

SCP, SMB and more

Recently I was asked to add additional connection protocols as a means to submit samples to be analyzed by our  automated forensic analysis platform. We had our UI, and REST API and multiple REST clients (C#, Java and Python). These are all standard but all require integration or manual intervention. Protocols like FTP, SFTP, SCP, SMB and many others are used to transfer files for everyday use. They have commercially available clients and many operating systems already have built-in support as well. My challenge was providing a smart, powerful and flexible integration.

I knew that there are several open servers available to receive files. In order to integrate in to our architecture something would need to consume the uploaded files. Additionally, we would have to handle the authentication, authorization aspects necessary to upload to the server. We can’t just have users created for every system reading directories as that is clunky, error prone, and not scalable. I opted in for an alternate approach.

Our core components are mostly written in Java and I was looking for a solution that would directly integrate or by means of JNI. I began with SCP and immediately found the Apache project MINA. It provides a complete Java SSHD/SFTP/SCP solution from soup to nuts. The intention of the SCP/SFTP is to be written to disk but with little ingenuity I Was able to completely cut out that step stream directly into our system without ever writing to disk. We are already using Spring Security for our authentication. While I wasn’t going to take the time to extend Spring Security to handle the SSH protocol I did utilize the ThreadLocal security context SecurityContextHolder. This enabled connection between the authentication mechanism that MINA provides and the data transfer to identify the user based on the security context setup. This enable me to continue using the rest of the application I had already secured. The rest of the system thought it came from HTTP, or didn’t really care. Ideally I would extend some interfaces in Spring Security and actually bind the protocol but that would be a nice addon that I can recommend to the Spring Integration team who already support SFTP. Click here to view the gist for the scp integration.

Some of this code is just extending the ScpHelper and the ScpCommand. This provided an easy way to access my existing authentication service and setup the security context.

Scp was out of the way but SMB was the more challenging integration. I didn’t find nearly as much on the topic and there are a lot more complications to handle. SCP/SFTP is safeguarded with TLS the same encryption process that makes HTTPS secure using public/private key encryption. After the initial handshake all data sent over the wire is encrypted. This facilitates authentication in comparison to many other protocols out there. Much to my surprise and naivety I was hoping that I would be able to utilize the stored credentials already encrypted and protected which we use to access via Spring Security. Instead SMB usually send encrypted credentials that are compared to already obtained credentials. This is a typical practice passing challenge data as to conceal the secret information and prevent any false information. I had to result in storing the hash digest MD4 of the user’s password to be compared to the client provided hashes password.

The library I used was developed by Alfresco called JLAN. It is on the older side and scarcely maintained. Sadly, its documentation was slightly better than you’d expect to find from a Jboss product. There is a developer’s guide and installation guide. For general usage it may be fine, but for what I was planning on doing it was tad more challenging. Some software engineers try to protect their future code by using final for anything and everything and making things very rigid and hard to extend. They only let you access to very small selection of methods and may not even document those well. I wanted a way to hook in my UserAuthenticationService that we used in the scp service. My challenge was that the SecurityConfigSection would only let you specify the UsersInterface by specifying a String of the Class that implements said interface. That class is instantiated and made completely inaccessible there after. This made accessing my Spring managed bean very difficult to nearly impossible. Usually I would try to have @ComponentScan pickup the class and either @Autowired the interface or use the BeanFactory to retrieve it dynamically. I came up with a simple but really nice approach to handle this.

In my @Configuration class I set the BeanFactory in the enum making it statically accessible. Thus, even our annoying UsersInterface implementation can take advantage of our managed beans without having to deal with any final mess.

After I worked out that spring bean issues I still had to deal with the frustrations of learning the ins and outs of the SMB protocol. This approach also would allow for a transfer that requires no writes to disk and authentication that can flow through the existing system. Look here for a gist of the general approach. If you really want to use the JLAN library realize that Alfresco sells an enterprise license and probably supports a great deal of options with it.

Here is a snippet of what you will want to add to a POM (maven) to play around with these code samples.

More to come!

Nested Workflows with jBPM

I am working on a project where  we are utilizing BPMN for authoring and controlling the processing of analysis. Any single analysis task may yield several descendants like how a .ZIP file has many child files. Additionally, many analyzers also yielded additional analysis for both the inputted artifact as well as additional artifacts. Currently we we treating each and every child artifact, as well as child workflows, as completely separate entities.

This hurt us for a number of reasons: For artifacts that had a very large number of children, like large APK file types, they would clog up our system and prevent other users from utilizing the system until that processing was complete. Additionally, it was never evident when the total analysis was completed on the initial artifact. That made it that analysis from a descendant that may actually affect the final analysis of a top level artifact inaccurate and misleading.

I was tasked with fixing this issue to help both the system utilization issue as well as being able to accurately  determine when a top-level workflow is indeed complete. I theorized that we were in fact under utilizing jBPM and that it is the proper way to handle this task. Initially I used the forEach block which would iterate through all the new work orders. Instead of invoking the worker directly I recursively called this new handler I created. As each process completed it was returning the child workflows allowing jBPM to further invoke all child workflows. This worked really well causing all child workflows to finish up prior to considering its parent is complete.

Unfortunately, after some testing this proved to be a disappointment with respect to performance. The forEach loop is blocking and is single threaded. That means for each child workflow you had to wait for its sibling to complete. This was a tremendous under-utilization of resources and really slowed down performance. I had attempted to optimize in other areas, but this was the bottleneck. I redesigned this quickly and got rid of the forEach loop instead I handle this but submitting the Runnable tasks directly to my threadpool. Now I did have to track those tasks completion which was an added complexity. This was well worth it. The end result yielded performance even faster than the initial non nested version. I strongly recommend this approach for large scale workflows that are utilizing a jBPM engine. This project was using an slightly older version of jBPM (5.5.0.Final) but I think this design would still be useful even with 6.X.X. I hope to post some sample code soon to better illustrate how to leverage this technique.

As for the other issue of clogging the system, now we can manually adjust how many “child workflows” consume the thread-pool. In fact, I configured it so that once the thread-pool became full instead of queuing up the next child workflow, it was run serially. This was necessary because the child workflows determined when the parents were deemed completed. That meant if the child workflows were queued up…it may be possible that the parent workflow could result in a deadlock and never complete. Forcing them to run serially would be slower but would ensure an eventual completion.

To all those that understand appreciate this, enjoy!

Securing your digital life: A brief guide

I want to begin by saying that I am not an authority on cyber security but am trying to compile a guide of best practices to secure your digital life.

This guide is a practical approach as opposed to a list of impossibly complex things that your average Joe couldn’t or wouldn’t do. I’m not going to claim it’s foolproof but I will say that it’s easy enough that I don’t get too inconvenienced while it provides a reasonable security blanket to my digital life.

The first thing you will want to do is purchase a Yubikey (https://www.yubico.com/products/yubikey-hardware/yubikey4/). There are a number of different vendors of U2F (https://en.wikipedia.org/wiki/Universal_2nd_Factor)  devices but the Yubikey 4 has support for a number of different protocols that we will take advantage of.

I’m going to tailor this around using Gmail as your email provider and LastPass (https://lastpass.com/) as your password manager and using Authy (https://www.authy.com/) as your two-factor authentication manager. If you choose to use different services they may not support all of the actions described here.

I use Windows 10, Linux, Mac OS (not by choice…company issued), and Android with this setup. I don’t own an iOS device but I do not anticipate any compatibility issues there.

Everything starts with safeguarding your email account. Most accounts you use on the Internet provide a forgot password feature. This is a very serious vulnerability if you are not careful. The first thing you should do is create an email address only you know. Do not use it publicly and don’t name it something that could even closely be identified as your email address by a third party. The purpose of this is to limit access to your accounts with a common link. Your email address would be out of plain sight from the public domain making it an unlikely target should your identity ever be targeted.

That email address as well as your publicly known email address will be locked down. Setup your accounts to support a two factor authentication mechanism. There are many different types of two-factor authentication mechanisms, there are pros and cons to each. One of the most common forms of TFA (Two-factor authentication) is by sending an SMS text message to your mobile device with a unique code for you to enter. This exists in multiple forms replace SMS with an automated phone call or by a simple email. The time synchronized codes have benefits over the more simplistic send a unique code to xxxxx. The difference is when you setup your TFA you get a special secret that is used to generate unique codes that are time synchronized based on the secret key. This secret key is stored within an application that you may use to generate the authentication code. There are also hardware fobs that can provide this same functionality (http://www.emc.com/security/rsa-securid/rsa-securid-hardware-tokens.htm). We will see soon that there is also the OTP (One time password) and U2F that the Yubikey supports that really are the swiss-army knife of account security.

The idea is to remove as many possible vulnerabilities as possible. If you are relying on SMS/email/phone all of which can be compromised independently. A physical key is just like a secret but even simpler because all you do is stick it in your USB and you are done.

In addition to the verification code, you will use a physical security key which is the Yubikey you purchased. This is even easier than the security code and perhaps even more secure. Should you not have your security key you can still enter in your authentication code.

The next phase is securing everything else. That is where the password manager comes in. For the rest of your accounts LastPass will generate and remember all your authentication credentials. Use your private email address for accounts wherever you can and let LastPass autogenerate a very long and complex password for the site. LastPass works really well with Android for automatically entering your credentials into various applications. Of course for the few apps that aren’t supported you can always copy and paste your credentials manually.

LastPass can offer to change your password automatically and remember that password and can even notify you if you have duplicate passwords to mitigate security breaches. I haven’t used these features myself too much…but I probably should!

As for securing LastPass itself, you should set up both the verification code two factor authentication. In addition, you should use the one-time password that the Yubikey supports (https://lastpass.com/yubico/).  This makes your security ironclad. You should have three different secure passwords that you must remember: your public email address, your private address, and your LastPass account password. Additionally I use Authy which tracks the authentication codes that sync between devices. This also has a password which you can set. There are other two-factor authentication managers like Google Authenticator, I like Authy better because you can sync it between many devices including Chrome, Android. Authy will make you verify from one device when you want to add a new device which is a very nice security mechanism.

Change your passwords often, never the same password for an account. Only use secure passwords: minimum of eight characters with mixed case, numbers and special characters. Just like you wouldn’t walk down a street that looks unsafe…don’t open an email that looks suspicious. The mugger of today may be more likely to steal from your digital life than your physical one. I’m not saying that with an actual statistic…thought I wouldn’t be surprised depending on the location. I want to offer a word to wise, and that is security is never going to make your life easier. It also won’t happen magically. Don’t wait to become a statistic and be one of the many people who are taken advantage of and have aspects of your life pried away from you. Do your due diligence and taking these precautions.

Resources