Data Lake vs. Data Leak
The advent of scalable enterprise search based on (amongst other things) Lucene-based storage clusters has brought about a renaissance in analytics techniques. Knowledge discovery across the enterprise has become a possibility and many have capitalized on this. Finding the relationship between response time and online sales is huge. Discovering that productivity drops when manufacturing backlogs are large, or which flight delays have the biggest economic impact on an airline, etc. These are powerful pieces of business intelligence that cannot obviously be seen in large amounts of raw data. However, they can now be discovered, thanks to the scale and power of machine learning analytical techniques applied to large bodies of data stored in these “data lakes.”
Perhaps the tail has started wagging the dog. Since “some strategically chosen data” combined with “select machine learning algorithms” has yielded “value” to the enterprise, it is only logical that we should dump in “more and more data” to get more and more “value.” However, this is not always true. Intuitively there is declining value in adding additional data sources: each additional data stream will have some logical overlap with things you already know. Yet, this has not stopped many enterprises from simply bringing in “all the data” from as many sources as possible with the hope to eventually exploit its value through machine learning.
Moreover, since storage is cheap, what is the harm? You can always delete it later if you end up not using it. Perhaps not surprisingly, we have recently seen a notable increase in the number of attacks hitting these storage clusters. From password brute-force attacks to exploits for software bugs—attackers are systematically finding ways into these enterprise data treasure troves. The more data you “centralize” in one location, the bigger the damage when the wrong hands gain access to this location. This is when your large data lake turns into a regrettable data leak.
So although there is clear value in bringing data together for analytical purposes, we must weigh this against the risks of the loss of this data. When data is decentralized there is an implicit amount of data security. It will be harder for one attacker or malicious insider to walk out with all the crown jewels—but once data is uploaded to the data lake, a lot of control over this data is lost. Note also that similar arguments have been made surrounding cloud storage, backup solutions and other products, which has given rise to a rich fabric of data privacy solutions, data centralization, and access control. For enterprise data lakes, however, this may still prove to be a challenge.
Ultimately, enterprises should judiciously decide how to provision and use their data lakes; what flows into them; and what the impact is of a data leak. Often there is a natural way to decentralize data, and obviously proper network security practices must be followed. Additionally, many analytical techniques can make use of existing database APIs that allow data to be analyzed from many decentralized sources. This does not require that all data is drawn into the data lake, thus decentralizing the data, and making use of native access control mechanisms. Obviously, once access is provisioned, a data leak may still occur, but the scope will be significantly limited!