Categories
Data Science System

Tips: Reading Hive Tables from Spark

Collection of useful tips when working with Big Data tools including Hadoop, Hive, Spark

Categories
Data Science System

Big Data Handy References

Integrating Apache Hive with Kafka, Spark, and BI: https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.1.5/integrating-hive/content/hive_hivewarehousesession_api_operations.html

Categories
System

System Design (Draft Post)

Classical Design Patterns Consistency Basics of Distributed Systems How databases work Message queueing Performance Applicability Scalability Reliability How do you design a messaging service? How do you design a database system? How do you design a scalable hashtagging system? Spend 30-40 minutes on each problem. Spend a certain time on each aspect for example to […]

Categories
Data Science System

Distributed/Big Data Geospatial Processing Tools

Work-in-progress. I will write more about each approach later in details. Just summarizing the tools for connecting to Hadoop and running geospatial processing on a large dataset. I am working on a ~100 GB Hive Table which is just a small subset of the original dataset http://geospark.datasyslab.org/ https://pypi.org/project/geopyspark/ https://github.com/Esri/gis-tools-for-hadoop/wiki Kinetica GPU Database – Graph solver […]

Categories
System

Merge a repo with another as a subfolder

Sometimes we may end up with one main repository and another independently developed repository for a new feature. Later it may turn out, that the independent repo needs to become a part of the main repo as a subfolder. To do that, we can use the git command using subtree. We need to put the […]

Categories
Linux System

Best way to setup PYTHONPATH for crontab

There are many suggestions on this. Add PYTHONPATH at the end of the ~/.bash_profile or ~/.bash_login files. If they do not exist, add it to ~/.bash_profile as suggested by this StackOverflow post. export PYTHONPATH=”${PYTHONPATH}:/home/path/to/your/python/package/” But this will add the package to the current user’s PYTHONPATH. To ensure crontab gets it right away. add this line to […]

Categories
Linux System

Best way to setup PYTHONPATH for crontab

When setting up a crontab job in Linux machine, these essential steps are required for a successful system operation Update the cron file by adding the new script on schedule Check the frequency of the schedule. Such as for running at 7 minutes interval, use */7 * * * *   python /path/to/script.py Or for running every hour […]

Categories
Linux System

How to check size of a directory in Linux

Personal note: du -shc /path/to/directory Here, shorthand -s stands for ‘specific’, -h for human-readable and -c for complete (total volume)

Categories
Linux System

How to unzip and read gzipped JSON files from URL in Python

The Problem Sometimes we end up zipping JSON files and putting up somewhere on the interweb. Now we need to read it back from the HTTP server and parse the file using Python. For that situation, let us assume that the zipped JSON is located at this URL: http://example.com/python_list_turned_into.json.gz. To read this file, we need to […]