Categories
Data Science System

Big Data Handy References

Integrating Apache Hive with Kafka, Spark, and BI: https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/integrating-hive/content/hive_hivewarehousesession_api_operations.html

Categories
Algorithms OMSCS

How To Compute Reconstruction Error for Random Projection

Random Projection is an interesting Dimensionality Reduction technique. You may choose to create a random projection for 1,2,3,..,n dimensional projections. Now how to tell which one is best? So you would need to calculate loss of data due to this reduction in data size. # data has this shape: row, col = 4898, 11 random_projection […]

Categories
Algorithms OMSCS

Artificial Neural Network: Perceptron Training

There could be two types of training algorithms for the weights for a neuron. First is to minimize the error between predicted y_hat and y. Here y_hat = boolean(activation >= threshold). This type of perceptron-based learning works best for linearly separable data and guarantees finite iterations. Second type is Gradient Descent algorithm which minimizes the […]

Categories
Algorithms

Properties of Tree

Quoting from Dasgupta, Papadimitrou, and Vazirani textbook: “Trees A tree is an undirected graph that is connected and acyclic.” p.135 Trees have these properties, quoting from DPV again: A tree on n nodes has n − 1 edges Any connected, undirected graph G = (V, E) with |E| = |V| − 1 is a tree. […]

Categories
Algorithms

Detect Cycle in a Directed Graph

A directed graph without any cycle is a Directed Acyclic Graph (DAG). If there is a cycle, then there will be a back edge, which goes backwards. For such edge(u,v), postorder number for u will be smaller than that of v, i.e. post(u) < post(v). So after DFS, if any edge satisfies post(u) < post(v) […]

Categories
Algorithms

Connected Components in Graphs

There can be three types of graphs here. 1. Undirected Graph For undirected graphs, we can use Depth First Search (DFS) to find the connected component number for each vertex. The runtime is O(|V|+|E|). 2. Directed Graph Directed graphs can be of two types. Directed Acyclic Graphs (DAG) and General Directed Graphs. DAGs’s do not […]

Categories
Algorithms

Topologically Sorting a DAG

In Directed Acyclic Graphs (DAG), there are no cycles. So it is simple to find the connected components just by sorting the vertices by post-order visit number in decreasing order after one run of DFS. The run time for DFS is again O(|V|+|E|)

Categories
Algorithms

Finding x in an Infinite Array

This is a programming problem where the given array A is of infinite length and we have to find the position of a value x in it. The first n values are sorted and after n-th number, all remaining values in the array are None. For example: A= [1, 3, 5, 100, 102, 1050, 1061, […]

Categories
Algorithms Data Science

Introduction to Reinforcement Learning: Key Terminologies

An important Machine Learning concept is Reinforcement Learning which is different from the more common Supervised or Unsupervised Learning models. In Supervised learning, you have the labels for training, in Unsupervised learning, there is no labeled data. Reinforcement Learning falls in between the two because it does not have a label but it learns from […]

Categories
Data Science System

Distributed/Big Data Geospatial Processing Tools

Work-in-progress. I will write more about each approach later in details. Just summarizing the tools for connecting to Hadoop and running geospatial processing on a large dataset. I am working on a ~100 GB Hive Table which is just a small subset of the original dataset http://geospark.datasyslab.org/ https://pypi.org/project/geopyspark/ https://github.com/Esri/gis-tools-for-hadoop/wiki Kinetica GPU Database – Graph solver […]