Spark Quick Start

Apache Spark™ is a unified analytics engine for large-scale data processing.

  • Speed
    • Computation optimization (DAG, thread-based)
    • In memory computing
  • Easy to Use & Generality
    • Interactive Shell with the Scala, Python, R, and SQL shells.
    • 80 high-level operators that make it easy to build parallel apps.
    • Combine SQL, streaming, and complex analytics.(多个框架的学习成本)
  • Runs Everywhere
    • Spark runs on Hadoop, Apache Mesos, Kubernetes, standalone, or in the cloud.
    • 技术选型!Access data in HDFS, Apache Cassandra, Apache HBase, Apache Hive, and hundreds of other data sources.

Overview

History

Quick Start Reference : Quick Start Spark

Install Spark without Hadoop

  • Download a packaged release of Spark from the Spark website.
  • Since we won’t be using HDFS, you can download a package for any version of Hadoop.
  • 下载即可用

Interact with the Spark Shell

Spark’s shell provides a simple way to learn the API, as well as a powerful tool to analyze data interactively. It is available in either Scala or Python.

Start

  • Start it by running pyspark in the Spark directory
1
./bin/pyspark

DataSet

Spark’s primary abstraction is a distributed collection of items called a Dataset.(Before Spark2.x, it called Resilient Distributed Dataset (RDD)) Datasets can be created from Hadoop InputFormats (such as HDFS files) or by transforming other Datasets.

Due to Python’s dynamic nature, we don’t need the Dataset to be strongly-typed in Python. As a result, all Datasets in Python are Dataset[Row], and we call it DataFrame to be consistent with the data frame concept in Pandas and R.

  • Make a new DataFrame from the text of the README file
1
2
3
4
5
6
>>> textFile = spark.read.text("README.md")
>>> textFile.count() # Number of rows in this DataFrame
103

>>> textFile.first() # First row in this DataFrame
Row(value=u'# Apache Spark')
  • Transform this DataFrame to a new one
    1
    2
    3
    4
    5
    # We call filter to return a new DataFrame with a subset of the lines in the file.
    >>> linesWithSpark = textFile.filter(textFile.value.contains("Spark"))
    # How many lines contain "Spark"?
    >>> textFile.filter(textFile.value.contains("Spark")).count()
    20

Self-Contained Applications

Scala (with sbt), Java (with Maven), and Python (pip).

  • This program just counts the number of lines containing ‘a’ and the number containing ‘b’ in a text file.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
"""spark-test.py"""
from pyspark.sql import SparkSession

# TODO : YOUR_SPARK_HOME
logFile = "/Users/kaihaoli/Downloads/spark-2.3.0-bin-hadoop2.7/README.md"
spark = SparkSession.builder.appName("SimpleApp").getOrCreate()
logData = spark.read.text(logFile).cache()

numAs = logData.filter(logData.value.contains('a')).count()
numBs = logData.filter(logData.value.contains('b')).count()

print("Lines with a: %i, lines with b: %i" % (numAs, numBs))

spark.stop()
  • Install pySpark

    1
    > pip install pyspark
  • Run the simple application

    1
    2
    3
    4
    > python spark-test.py
    ........
    Lines with a: 61, lines with b: 30
    .......

Spark Streaming With Kafka

SparkContext

  • Main entry point for Spark functionality.
  • A SparkContext represents the connection to a Spark cluster, and can be used to create RDD and broadcast variables on that cluster.
  • master = “local[2]”, appName = “StockAveragePrice“
    1
    2
    3
    from pyspark import SparkContext
    sc = SparkContext("local[2]", "StockAveragePrice")
    sc.setLogLevel('ERROR')

StreamingContext

  • Main entry point for Spark Streaming functionality.
  • A StreamingContext represents the connection to a Spark cluster, and can be used to create DStream various input sources.
    1
    2
    3
    from pyspark.streaming import StreamingContext
    ssc = StreamingContext(sc, 5)
    # sparkContext = sc, batchDuration = 5

KafkaUtils

  • KafkaUtils.createDirectStream(ssc, topics, kafkaParams,…)
  • Create an input stream that directly pulls messages from a Kafka Broker and specific offset.
    1
    2
    3
    4
    5
    6
    7
    8
    from pyspark.streaming.kafka import KafkaUtils
    topic = "test"
    brokers = "192.168.99.100:9092"
    # Kafka Consumer
    directKafkaStream = KafkaUtils.createDirectStream(
    ssc,
    [topic],
    {'metadata.broker.list': brokers})
Share

Bit Manipulation 位运算

Basic

知识点1 原码 反码 补码

最高位存符号位

正数:原码=反码=补码 ;

负数:符号位不变,原码 –(取反)–> 反码 –(+1)–> 补码

浮点数的位表示

1
2
3
4
5
6
>>> 0.2+0.4==0.6
False
>>> 0.2+0.4
0.6000000000000001
# Python, Java(Double), Why not 0.6?
# tips:做浮点数的时候,知道有精度丢失这么个点

知识点2 位运算符

  • Java 位运算符
位运算符号
& 注意和逻辑运算符区别 &&
| 逻辑运算符 ||
逻辑运算符 !
^ 异或 相异为1 相同为0
<< left shift 补0, 会溢出的”*2”
>> right shift 补0或1,保持和最高位相同Why?”/2”
>>> 无符号right shift 补0
  • Python 位运算符
位运算符号
& and
| or
not
^ 异或 相异为1 相同为0
<< left shift 补0 无溢出的 “*2”
>> right shift 补0或1,保持和最高位相同 “/2”

知识点3 空间优化

Contain Unique Letter Word

Determine whether a word contains all letters that are unique (no duplicate letters in the word).

  • 1 HashSet
  • 2 Array,256个ASCII码 长度256的Array
  • 3 Bit Vector 充分利用空间!!
    • xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx 0~31th bit
    • xxxx xxxx xxxx xxxx xxxx xxxx xxxx xxxx 32~63th bit
    • … …

318. Maximum Product of Word Lengths

给定一个字符串数组words,寻找length(words[i]) * length(words[j])的最大值,其中两个单词不包含相同字母。你可以假设每一个单词只包含小写字母。如果不存在这样的两个单词,返回0

  • 1 Two words do not share common letters :Bit!
  • 2 包含相同类型letter的word只用取长度最长的
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
class Solution(object):
def maxProduct(self, words):
n = len(words)
if n <= 1: return 0

d = {}
for word in words:
mask = 0
for ch in word:
mask |= (1 << (ord(ch) - ord('a')))
d[mask] = max(d.get(mask, 0), len(word))

# pythonic:
# return max([d[x]*d[y] for x in d for y in d if not x & y])
res = 0
for x in d:
for y in d:
if not x & y:
res = max(res, d[x] * d[y])
return res

异或^

  • 两个相等的数经过异或运算会相互抵消

通常用来删除偶数个相同的数字,保留基数个个数的数字

Swap Two Integers

不建议在实际应用中采用

  • 1 不一定比朴素方法快
  • 2 a 和 b 引用同一变量
1
2
3
4
5
6
7
8
9
void Swap(int &a, int &b)  
{
if (a != b)
{
a ^= b;
b ^= a;
a ^= b;
}
}

389. Find the Difference

给定两个字符串s和t,都只包含小写字母。字符串t由字符串s打乱顺序并且额外在随机位置添加一个字母组成。
寻找t中新增的那个字母。(和single number一样)

1
2
3
4
5
6
7
class Solution(object):
def findTheDifference(self, s, t):
res = 0
n = len(s)
for i in range(n):
res ^= ord(s[i]) ^ ord(t[i])
return chr(res ^ ord(t[n]))

268. Missing Number

给定一个包含从0, 1, 2, …, n, 选出的n个不同数字的数组,从中找出数组中缺失的那一个数。这里是少了一个数,single number 是多了一个数。

1
2
3
4
5
6
class Solution(object):
def missingNumber(self, nums):
ret = 0
for n in nums+range(len(nums)+1):
ret ^= n
return ret

136. Single Number

所有数字都出现偶数次,只有一个数字出现奇数次,求出现奇数次的那个数

  • 解法1. HashSet

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    # Time : O(n), Space : O(n)
    class Solution(object):
    def singleNumber(self, nums):
    numSet = set()
    for n in nums:
    if n in numSet:
    numSet.remove(n)
    else:
    numSet.add(n)
    return list(numSet)[0]
  • 解法2.异或

    1
    2
    3
    4
    5
    6
    7
    8
    # Time : O(n), Space : O(1)
    # 偶数次异或可以抵消
    class Solution(object):
    def singleNumber(self, nums):
    ret = 0
    for n in nums:
    ret ^= n
    return ret

137. Single Number II

所有数字出现三次,只有一个数字出现一次,求出现一次那个数。

  • 解法1. HashTable

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    # Time : O(n), Space : O(n)
    class Solution(object):
    def singleNumber(self, nums):
    dic = {}
    for n in nums:
    if n in dic:
    if dic[n] == 2:
    del dic[n]
    else:
    dic[n] += 1
    else:
    dic[n] = 1
    return dic.keys()[0]
  • 解法2. 用二进制模拟三进制运算
    从一个bit出发考虑到32个bits

1
2
3
4
cnt  : 0    1    2     3          4
one : 0 -> 1 -> 0 -> (1 -> 0) -> 1
two : 0 -> 0 -> 1 -> (1 -> 0) -> 0
mask : 1 1 1 1 0
1
2
3
4
5
6
7
8
9
10
11
# Time : O(n), Space : O(1)
class Solution(object):
def singleNumber(self, nums):
one = two = 0
for n in nums:
two |= n & one
one ^= n
mask = ~(one & two)
one &= mask
two &= mask
return one

260. Single Number III

所有数字都出现两次,但是有两个数字只出现了一次。

1
2
3
4
a ^ a ^ b ^ b ^ c ^ d = c ^ d
0101 0100 c
1001 0010 d
1100 0110 c^d

  • 把只出现一次的c和d分到两个组里
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
class Solution(object):
def singleNumber(self, nums):
mask = 0
for n in nums:
mask ^= n

mask &= (-mask)

a, b = 0, 0
for n in nums:
if n & mask:
a ^= n
else:
b ^= n
return [a, b]

371. Sum of Two Integers

  • 用^和&,构成加法运算

^表示相加后的结果,&后的<< 1表示相加后的进位
(理解加减法的位操作)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
// Iterative
public int getSum(int a, int b) {
if (a == 0) return b;
if (b == 0) return a;

while (b != 0) {
int carry = a & b;
a = a ^ b;
b = carry << 1;
}
return a;
}

// Recursive
public int getSum(int a, int b) {
return (b == 0) ? a : getSum(a ^ b, (a & b) << 1);
}

n & (n-1)

  • 算是一个小trick!
  • n&(n-1)可消除n中最右边(低位)的一个1
1
2
3
0101 1000 n
0101 0111 n - 1
0101 0000 n & (n-1)

191. Number of 1 Bits

统计一个整数用二进制表示时1的个数

1
2
3
4
5
6
7
8
9
// Java
public int hammingWeight(int n) {
int cnt = 0;
while (n != 0) {
cnt++;
n &= (n - 1);
}
return cnt;
}

231. Power of Two

判断一个数是不是2的n次幂

1
2
3
4
0000 0001 2^0
0000 0010 2^1
0000 0100 2^2
0000 1000 2^3

1
2
3
4
5
6
// Java
class Solution {
public boolean isPowerOfTwo(int n) {
return (n > 0) && (((n - 1) & n) == 0);
}
}

326. Power of Three

  • 也有一行代码更数学上tricky的方法:return n > 0 && 1162261467 % n == 0;
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
// Java1 - Math
public class Solution {
public boolean isPowerOfThree(int n) {
return (Math.log10(n) / Math.log10(3)) % 1 == 0;
}
}

// Java2 - Forloop
class Solution {
public boolean isPowerOfThree(int n) {
while (n > 0 && (n%3 == 0)) {
n /= 3;
}
return n == 1;
}
}

342. Power of Four

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// Java1 for loop
class Solution:
def isPowerOfFour(self, n):
if n < 1:
return False
while n % 4 == 0:
n = n / 4
return n == 1

// Java2 Math
// 0000 0001 1
// 0000 0100 4
// 0001 0000 16
// 0100 0000 64
class Solution {
public boolean isPowerOfFour(int n) {
//一个数学的方法:(4^n - 1) is multiple of 3
return (n > 0) && (n&(n-1)) == 0 && (n-1) % 3 == 0;
}
}

461. Hamming Distance

两个整数的汉明距离是指其二进制不相等的位的个数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
class Solution(object):
def hammingDistance(self, x, y):
n = x ^ y # => Count bits of a number
cnt = 0
while n:
n &= (n-1)
cnt += 1
return cnt

def hammingDistance1(self, x, y):
n = x ^ y
cnt = 0
while n:
cnt += n & 1
n >>= 1
return cnt

477. Total Hamming Distance

两个整数的汉明距离是指其二进制不相等的位的个数, 计算给定的整数数组两两之间的汉明距离之和。

  • 解法1. n^2 paris (TLE)
  • 解法2. 拆解integer
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    # Time : O(32 * n), Space : O(1)
    class Solution(object):
    def totalHammingDistance(self, nums):
    n = len(nums)
    if n <= 1: return 0

    res = 0
    mask = 1
    for _ in xrange(32):
    cnt1 = 0
    for num in nums:
    if num & mask:
    cnt1 += 1
    mask <<= 1
    res += (n - cnt1) * cnt1
    return res

201. Bitwise AND of Numbers Range

1
2
3
4
5
6
# [26, 30]
11010
11011
11100
11101
11110

仔细观察这个过程我们可以得出^_^,最后的结果是该范围内所有数的左边的共同部分,即公共左边首部(left header)

1
2
3
4
5
class Solution(object):
def rangeBitwiseAnd(self, m, n):
while m < n:
n &= n-1
return n

Advanced Topic

n & (-n)

n &-n returns the rightmost 1 bit in n., 属于第一次见,很难理解的小trick,这个会在Binary Indexed Tree中用到!

1
2
3
4
5
6
7
8
9
10
11
12
n & (n ^ (n-1)) = n & (-n)

0000 1011 11
1111 0100
1111 0101 -11
0000 0001 11 & (-11)

1010 1000 168
0101 0111
0101 1000 -168
0000 1000 168 & (-168)
# 充分利用了“取反加一”!

正整数n的二进制表示中,只保留最低位的1的值!

Share

Kafka Quick Start using Docker

Reference : Kafka Quick Start Guide

Pre-Requisites

Docker

1 Create Virtual Machine

Get the env for docker-machine, 2CPU, 2G

1
> docker-machine create --driver virtualbox --virtualbox-cpu-count 2 --virtualbox-memory 2048 bigdata

This command downloads a lightweight Linux distribution (boot2docker) with the Docker daemon installed, and creates and starts a VirtualBox VM with Docker running.

  • docker-machine create 创建一个Docker主机
  • –driver virtualbox flag to indicate which provider (VirtualBox, DigitalOcean, AWS, etc.) the machine should be created on.
  • ‘bigdata’ is an argument to indicate the name of the created machine.

list available machines.

1
> docker-machine ls

Get the IP address of one or more machines.

1
> docker-machine ip bigdata

2 Connect your shell to the machine

  • Show the virtual machine environment variable

    1
    2
    3
    4
    5
    > docker-machine env bigdata
    export DOCKER_TLS_VERIFY="1"
    export DOCKER_HOST="tcp://192.168.99.100:2376"
    export DOCKER_CERT_PATH="/Users/yourhostname/.docker/machine/machines/bigdata"
    export DOCKER_MACHINE_NAME="bigdata"
  • Run this command to configure your shell

    1
    2
    # Each time you want to use this virtual machine, you should run this command first!
    > eval $(docker-machine env bigdata)

3 Pull Images

1
2
> docker pull wurstmeister/zookeeper    
> docker pull wurstmeister/kafka

4 Run Images

  • Start zookeeper container

    1
    2
    3
    4
    5
    6
    7
    > docker run \
    -d \
    -p 2181:2181 \
    -p 2888:2888 \
    -p 3888:3888 \
    --name zookeeper \
    wurstmeister/zookeeper
  • Start kafka container

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    > docker run \
    -d \
    -p 9092:9092 \
    # IP=$(docker-machine ip ${MACHINE_NAME})
    -e KAFKA_ADVERTISED_HOST_NAME="${IP}" \
    -e KAFKA_ADVERTISED_PORT=9092 \
    --name kafka \
    --link zookeeper:zookeeper \
    -e KAFKA_ZOOKEEPER_CONNECT=zookeeper:2181 \
    wurstmeister/kafka

5 Kafka Test

  • Enter the Kafka container

    1
    2
    # here CONTAINER ID = kafka
    > docker exec -it ${CONTAINER ID} /bin/bash
  • Change to kafka default directory

    1
    > cd opt/kafka_2.xx-x.xx.x.x/
  • Create a Topic

    1
    > bin/kafka-topics.sh --create --zookeeper zookeeper:2181 --replication-factor 1 --partitions 1 --topic mykafka
  • Run a Producer

    1
    > bin/kafka-console-producer.sh --broker-list localhost:9092 --topic mykafka
  • Run a Consumer(in another terminal!)

    1
    > bin/kafka-console-consumer.sh --zookeeper zookeeper:2181 --topic mykafka --from-beginning
  • TEST
    When you type in some words in producer, it will appear in the consumer terminal. Kafka will replay the messages you have sent as a producer.

All Done! :)

Share

Docker

参考资源:

History

  • Docker 最初是 dotCloud 公司创始人 Solomon Hykes 在法国期间发起的一个公司内部项目,它是基于 dotCloud 公司多年云服务技术的一次革新,并于 2013 年 3 月以 Apache 2.0 授权协议开源,主要项目代码在 GitHub 上进行维护。Docker 项目后来还加入了 Linux 基金会,并成立推动 开放容器联盟(OCI)。

  • Docker 自开源后受到广泛的关注和讨论,至今其 GitHub 项目已经超过 4 万 6 千个星标和一万多个 fork。甚至由于 Docker 项目的火爆,在 2013 年底,dotCloud 公司决定改名为 Docker。Docker 最初是在 Ubuntu 12.04 上开发实现的;Red Hat 则从 RHEL 6.5 开始对 Docker 进行支持;Google 也在其 PaaS 产品中广泛应用 Docker。

  • Docker 使用 Google 公司推出的 Go 语言 进行开发实现

  • Docker 在容器的基础上,进行了进一步的封装,从文件系统、网络互联到进程隔离等等,极大的简化了容器的创建和维护。使得 Docker 技术比虚拟机技术更为轻便、快捷。

Why use Docker?

  • Docker - Build, Ship, and Run Any App, Anywhere
  • 向服务器部署代码非常困难
  • 复杂的技术栈, 不同技术栈的版本依赖与冲突
  • How to solve it?

    Docker vs VM

What is Docker?

Docker的三个概念

  • 1 镜像(Image)
    • 类似于虚拟机中的镜像,是一个包含有文件系统的面向Docker引擎的只读模板。
    • 任何应用程序运行都需要环境,而镜像就是用来提供这种运行环境的。例如一个Ubuntu镜像就是一个包含Ubuntu操作系统环境的模板,同理在该镜像上装上Apache软件,就可以称为Apache镜像。
  • 2 容器(Container)
    • 类似于一个轻量级的沙盒,可以将其看作一个极简的Linux系统环境(包括root权限、进程空间、用户空间和网络空间等),以及运行在其中的应用程序。
    • Docker引擎利用容器来运行、隔离各个应用。容器是镜像创建的应用实例,可以创建、启动、停止、删除容器,各个容器之间是是相互隔离的,互不影响
    • 注意:镜像本身是只读的,容器从镜像启动时,Docker在镜像的上层创建一个可写层,镜像本身不变。
  • 3 仓库(Repository)
    • 类似于代码仓库,这里是镜像仓库,是Docker用来集中存放镜像文件的地方。
    • 注意与注册服务器(Registry)的区别:注册服务器是存放仓库的地方,一般会有多个仓库;而仓库是存放镜像的地方,一般每个仓库存放一类镜像,每个镜像利用tag进行区分,比如Ubuntu仓库存放有多个版本(12.04、14.04等)的Ubuntu镜像。

Union File System

Docker Components

  • Docker daemon
    • Runs on a host machine. It does the heavy lifting of building, running, and distributing Docker containers
  • Docker client
    • The primary user interface to Docker. It accepts commands from the user and communicates back and forth with a Docker daemon
  • Docker images
    • Docker image is a read-only template, which is used to create containers
  • Docker registries
    • Docker registries hold images, which are public or private stores from which you upload or download images.
  • Docker containers
    • Docker containers are similar to a directory.

性质

  • A Docker container holds everything that is needed for an application to run.
  • Each container is created from a Docker image.
  • Docker containers can be run, started, stopped, moved, and deleted.
  • Each container is an isolated and secure application platform.

Application

Docker 的主要用途,目前有三大类。

  • 1 提供一次性的环境。比如,本地测试他人的软件、持续集成的时候提供单元测试和构建的环境。
  • 2 提供弹性的云服务。因为 Docker 容器可以随开随关,很适合动态扩容和缩容。
  • 3 组建微服务架构。通过多个容器,一台机器可以跑多个服务,因此在本机就可以模拟出微服务架构。

Docker on Mac

Install

1
$ brew cask install docker
Share

Message Queue 消息队列

参考资料:

What is Message Queue

  • A message queue is a queue of messages sent between applications. 中间人
  • Message queues provide an asynchronous communications protocol, meaning that the sender and receiver of the message do not need to interact with the message queue at the same time.
  • An application framework for sending and receiving messages
  • A way to communicate between applications / systems
  • A way to decouple components
  • A way to offload work (handle to another worker)

Application

  • Allow web servers to respond to requests quickly instead of being forced to perform resource-heavy procedures.
  • Able to distribute a message to multiple recipients for consumption or for balancing loads between workers.
  • 消息队列中间件是分布式系统中重要的组件,主要解决应用耦合,异步消息,流量削锋等问题。实现高性能,高可用,可伸缩和最终一致性架构。是大型分布式系统不可缺少的中间件。
  • 目前在生产环境,使用较多的消息队列有ActiveMQ,RabbitMQ,ZeroMQ,Kafka,MetaMQ,RocketMQ等。

Message Queue in Real Life

对实时性要求不高的任务!

  • Image Resizing
  • Video Processing
  • Sending out Emails (Ad Campaigns)
  • Log Analysis

AMPQ协议

Message Queue Protocol

  • AMQP - Advanced Message Queue Protocol (RabbitMQ)
  • STOMP - Streaming Text Oriented Messaging Protocol (ActiveMQ)
  • XMPP - Extensible Messaging and Presence Protocol

Network wire-level protocol

  • Defines how clients and brokers talk
  • Data serialization, heartbeat

AMQP Model

  • Defines routing and storing of messages
  • Defines rules how these are wired together
  • Exported API

RabbitMQ

  • RabbitMQ是流行的开源消息队列系统,用erlang语言开发。RabbitMQ是AMQP(高级消息队列协议)的标准实现。Implements AMQP
  • Easy to use, 支持多种客户端,如:Python、Ruby、.NET、Java、JMS、C、PHP、ActionScript、XMPP、STOMP等,支持AJAX,持久化。用于在分布式系统中存储转发消息,在易用性、扩展性、高可用性等方面表现不俗。
  • Open source and commercially supported

RabiitMQ Solution

Share

Redis

(参考资料:Redis学习教程Gitbook, RUNOOB Redis 教程)

What is Redis

  • Redis is an opensource(BSDlicensed), in-memory datas tructure store, used as database, cache and message broker.
  • database只是Redis的一个应用, In-memory data structure store
Data Structure Operations
Strings - 字符串 SET, GET, APPEND
Hashes - 哈希表 HSET, HGET
List - 列表 LPUSH, LRANGE, LSET, LTRIM
Set - 集合 SADD, SMEMBERS
Sorted Set ZADD, ZRANGE, ZRANGEBYSCORE

Why Redis

  • Redis 有三个主要的特点, 有别于其它很多竞争对手 :
    • Redis支持数据的持久化,可以将内存中的数据保持在磁盘中,重启的时候可以 再次加载进行使用
    • Redis不仅仅支持简单的key-value类型的数据,同时还提供list,set,zset,hash等数据结构的存储
    • Redis支持数据的备份,即master-slave模式的数据备份。
  • Redis 优势
    • 性能极高 – Redis能读的速度是110000次/s,写的速度是81000次/s 。
    • 丰富的数据类型 – Redis支持二进制案例的 Strings, Lists, Hashes, Sets 及 Ordered Sets 数据类型操作。
    • 原子 – Redis的所有操作都是原子性的,意思就是要么成功执行要么失败完全不执行。单个操作是原子性的。多个操作也支持事务,即原子性,通过MULTI和EXEC指令包起来。
    • 丰富的特性 – Redis还支持 publish/subscribe, 通知, key 过期等等特性。

Application

  • (1) Cache
    • Replacement - LRU LFU
    • Pre-load
  • (2) Message Broker
    • MQ
  • (3) Database

Redis Install & Config

  • Install Redis with MAC
1
2
3
4
5
6
7
8
9
10
# Install
brew install redis

# Start
brew services start redis
# or redis-server /usr/local/etc/redis.conf

# CMD
~ % redis-cli
127.0.0.1:6379>

Try CMD here : Redis Interactive Online

Redis in Python

The Python interface to the Redis key-value store.

Install

1
sudo pip install redis

Connection

Opening a Connection to Redis Using redis-py

1
2
3
4
5
6
#The following code creates a connection to Redis using redis-py:
import redis
redis_client = redis.Redis(
host = 'hostname',
port = port # redis default port 6379
)

  • redis提供两个类Redis和StrictRedis用于实现Redis的命令,StrictRedis用于实现大部分官方的命令,并使用官方的语法和命令,Redis是StrictRedis的子类,用于向后兼容旧版本的redis-py。

  • redis连接实例是线程安全的,可以直接将redis连接实例设置为一个全局变量,直接使用。如果需要另一个Redis实例(or Redis数据库)时,就需要重新创建redis连接实例来获取一个新的连接。同理,python的redis没有实现select命令。

1
2
3
4
5
# set, get
r.set('name', 'junxi') # key是"foo" value是"bar" 将键值对存入redis缓存
print(r['name'])
print(r.get('name')) # 取出键name对应的值
print(type(r.get('name')))
Share

Binary Search

Given a sorted array of n elements, write a function to search a given element x in array.

  • Search算法里的入门算法,其实很多高级算法的思路都是以二分查找为基本思想引出的,特别是那些和logn相关的算法,比如说自平衡二叉树

经典二分查找


  • 有序, 数组
  • Why 链表不可以?Why 无序不可以?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
class Solution:
def findPosition(self, nums, target):
if not nums:
return -1

# 1 Two Pointers 指向可行解的范围
start, end = 0, len(nums)-1

while start <= end:
# 2 Median Index
mid = start + (end - start) // 2

# 3 Check Median Index Val
if nums[mid] == target:
return mid
elif nums[mid] > target:
end = mid - 1 # Adjust end
else:
start = mid + 1 # Adjust start
return -1

74. Search a 2D Matrix

  • 一维Array和二维Matrix的转换
    • index => (x, y)
    • x = index / col
    • y = index % col
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
class Solution(object):
def searchMatrix(self, matrix, target):
if not matrix:
return False
m, n = len(matrix), len(matrix[0])
start, end = 0, m*n-1
while start <= end:
mid = start + (end - start) / 2
row, col = mid / n, mid % n
if matrix[row][col] < target:
start = mid + 1
elif matrix[row][col] > target:
end = mid - 1
else:
return True
return False

九章算法二分查找模板

  • while 循环 ?
    • while start + 1 < end, 留两个值做post processing
    • while start < end, 留一个值做post processing
    • while start < = end, 不留值
  • 缩小查找范围 ?
    • start = mid 还是 start = mid + 1
    • end = mid 还是 end = mid + 1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def findPosition(self, A, target):
if(len(A) == 0):
return -1

start = 0
end = len(A) - 1

while(start + 1 < end):
mid = start + (end - start) / 2
if(A[mid] == target):
return mid
if(A[mid] < target):
start = mid
else:
end = mid

#post processing
if (A[start] == target):
return start
elif (A[end] == target):
return end
else:
return -1
  • Double Check 小技巧:
    • 当前这个mid对应的值是不是 可能是要找的结果,mid + 1 还是 mid ?
    • start = mid 一定要用 while start + 1 < end !!!
    • 要找的结果是不是一定在数组当中!start < end 还是 start < = end

边界系列


  • Input是标准的 Sorted Array(单调增或者单调减)
    • First or Last of Occrance!
    • 大于等于,小于等于,大于,小于,等于
  • 每次淘汰的空间要确定是不属于结果范围的!

278. First Bad Version

  • 000000011111111
  • Find The First Occrance of an Element
    • 找的元素在不在数组里?while start < end 还是 while start < = end
    • How about Last Good Version?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# 写法1. 这种写法条件是target存在,且start = mid + 1
class Solution(object):
def firstBadVersion(self, n):
start, end = 1, n
while start < end:
mid = start + (end - start) / 2
ret = isBadVersion(mid)
if ret:
end = mid
else:
start = mid + 1
return start

# 写法2. 九章算法模板 - 预留两位做post processing
class Solution(object):
def firstBadVersion(self, n):
start, end = 1, n

# 预留两位最后处理
while start + 1 < end:
mid = start + (end - start) / 2
ret = isBadVersion(mid)
if ret:
end = mid
else:
start = mid + 1

# Post Processing
if isBadVersion(start):
return start
else:
return end

35. Search Insert Position

  • 找 > = target 的 first occrence
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
class Solution(object):
def searchInsert(self, nums, target):
if not nums:
return 0
if target > nums[-1]:
return len(nums)

start, end = 0, len(nums)-1
while start + 1 < end:
mid = start + (end - start) / 2
if nums[mid] >= target:
end = mid
else:
start = mid + 1

# Post Precessing
if nums[start] >= target:
return start
else:
return end

34. Search for a Range

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
class Solution(object):
def searchRange(self, nums, target):
if not nums:
return [-1, -1]

# First Occurance
start, end = 0 , len(nums) - 1
while start + 1 < end:
mid = start + (end - start) / 2
if nums[mid] >= target:
end = mid
else:
start = mid + 1

if nums[start] == target:
first = start
elif nums[end] == target:
first = end
else:
return [-1, -1]

# Last Occurance
start, end = 0, len(nums) - 1
while start + 1 < end:
mid = start + (end - start) / 2
if nums[mid] <= target:
start = mid
else:
end = mid - 1

if nums[end] == target:
return [first, end]
else:
return [first, start]

旋转数组系列


  • 只有Sorted Array才能做Binary Search
    • Input数组不再是单调递增或者单调递减
  • 每次淘汰的空间要确定是不属于结果范围的!

162. Find Peak Element

  • 根据要找的Target条件缩小Search的范围!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# 模版解法
class Solution(object):
def findPeakElement(self, nums):
if len(nums) <= 1: return 0

start, end = 0 , len(nums)-1
while start + 1 < end:
mid = start + (end - start) / 2
if nums[mid] < nums[mid+1]:
start = mid + 1
else:
end = mid

if nums[start] > nums[end]:
return start
else:
return end
1
2
3
4
5
6
7
8
9
10
11
12
13
class Solution(object):
def findPeakElement(self, nums):
if len(nums) <= 1: return 0

start, end = 0 , len(nums)-1
while start < end:
mid = start + (end - start) / 2
if nums[mid] < nums[mid+1]:
start = mid + 1
else:
end = mid

return start

153. Find Minimum in Rotated Sorted Array

  • Why nums[mid] 和 nums[end]比较?
1
2
3
4
5
6
7
8
9
10
11
# 模版解法
class Solution(object):
def findMin(self, nums):
start, end = 0, len(nums)-1
while start + 1 < end:
mid = start + (end - start) / 2
if nums[mid] > nums[end]:
start = mid + 1
else:
end = mid
return min(nums[start], nums[end])
1
2
3
4
5
6
7
8
9
10
class Solution(object):
def findMin(self, nums):
start, end = 0, len(nums)-1
while start < end:
mid = start + (end - start) / 2
if nums[mid] > nums[end]:
start = mid + 1
else:
end = mid
return nums[start]

154. Find Minimum in Rotated Sorted Array II

  • What if duplicates are allowed?
  • Worst Case O(n)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
# nums[mid] == nums[end] ?
class Solution(object):
def findMin(self, nums):
start, end = 0, len(nums)-1
while start < end:
mid = start + (end - start) / 2
if nums[mid] < nums[end]:
end = mid
elif nums[mid] > nums[end]:
start = mid + 1
else:
end -= 1
return nums[start]

# 套模板
class Solution(object):
def findMin(self, nums):
start, end = 0, len(nums)-1
while start + 1 < end:
mid = start + (end - start) / 2
if nums[mid] < nums[end]:
end = mid
elif nums[mid] > nums[end]:
start = mid + 1
else:
end -= 1
return min(nums[start], nums[end])

33. Search in Rotated Sorted Array

  • 1 把Array先根据nums[end]分成在前段还是后段
  • 2 再根据nums[mid]的大小
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
class Solution(object):
def search(self, nums, target):
start, end = 0, len(nums)-1
while start <= end:
mid = start + (end - start) / 2
if nums[mid] == target:
return mid
# 前还是后,根据 num[end] 根据nums[start]也行
if nums[mid] <= nums[end]:
if nums[mid] < target <= nums[end]:
start = mid + 1
else:
end = mid - 1
else:
if nums[start] <= target < nums[mid]:
end = mid - 1
else:
start = mid + 1
return -1

81. Search in Rotated Sorted Array II

  • nums may contain duplicates
  • nums[mid] == nums[start] 或者 nums[mid] == nums[end]的时候无法判断nums[mid]是在前段还是后段
    • start += 1 缩减范围, Worst Case : O(n)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
class Solution(object):
def search(self, nums, target):
start, end = 0, len(nums)-1
while start <= end:
mid = (start + end) / 2
if nums[mid] == target:
return True

if nums[mid] < nums[start]:
if nums[mid] < target <= nums[end]:
start = mid + 1
else:
end = mid - 1
elif nums[mid] > nums[start]:
if nums[start] <= target < nums[mid]:
end = mid - 1
else:
start = mid + 1
else:
start += 1
return False

数值二分型


  • 典型的二分法: 输入是数组,双指针分别是数组的index
  • Input不再是Array,不再以index,而是以value来做二分

374. Guess Number Higher or Lower

1
2
3
4
5
6
7
8
9
10
11
12
13
class Solution(object):
def guessNumber(self, n):
start, end = 1, n
while start < end:
mid = start + (end - start) // 2
ret = guess(mid)
if ret == 0:
return mid
elif ret < 0:
end = mid - 1
else:
start = mid + 1
return start

367. Valid Perfect Square

  • ret = mid * mid 来判断范围
1
2
3
4
5
6
7
8
9
10
11
12
13
class Solution(object):
def isPerfectSquare(self, num):
start, end = 1, num
while start <= end:
mid = start + (end - start) / 2
ret = mid * mid
if ret > num:
end = mid - 1
elif ret < num:
start = mid + 1
else:
return True
return False

69. Sqrt(x)

  • 在0~x中找最大的数r,满足r * r < = x
  • Why start + 1 < end ?
  • 剩余的搜索范围都是满足搜索条件的!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
class Solution(object):
def mySqrt(self, x):
start, end = 0, x
while start + 1 < end:
mid = start + (end - start) / 2
ret = mid * mid
if ret > x:
end = mid - 1
elif ret < x:
start = mid
else:
return mid

if end * end <= x:
return end
else:
return start
Share

Segment Tree

Segment Tree 线段树
线段树参考整理自:九章算法线段树教程

1 线段树是什么?

Google常考数据结构,国内也经常问这个。

  • 线段树是一种高级数据结构,也是一种树结构,准确的说是二叉树,它能够高效的处理区间修改查询等问题。
  • 线段树是一种二叉搜索树,与区间树相似,它将一个区间划分成一些单元区间,每个单元区间对应线段树中的一个叶结点
  • 线段树中的每一个非叶子节点[a,b],它的左儿子表示的区间为[a,(a+b)/2]右儿子表示的区间为[(a+b)/2+1,b]。因此线段树是平衡二叉树,最后的子节点数目为N,即整个线段区间的长度。

2 线段树的创建

因为每次将区间的长度一分为二,所有创造的节点个数,即底层有n个节点,那么倒数第二次约n/2个节点,倒数第三次约n/4个节点,依次类推:

1
2
3
n + 1/2 * n + 1/4 * n + 1/8 * n + ...
= (1 + 1/2 + 1/4 + 1/8 + ...) * n
= 2n

所以构造线段树的时间复杂度和空间复杂度都为O(n)

LintCode 201. Segment Tree Build

线段树的创建,其实就是按照区间的index进行二分,然后Recursion的定义!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
"""
Definition of SegmentTreeNode:
class SegmentTreeNode:
def __init__(self, start, end):
self.start, self.end = start, end
self.left, self.right = None, None
"""
class Solution:
def build(self, start, end):
if start > end: return
root = SegmentTreeNode(start, end)
if start == end: return root
mid = (start + end ) / 2
root.left = self.build(start, mid)
root.right = self.build(mid + 1, end)
return root

LintCode 439. Segment Tree Build II

每个节点node除了有区间index的信息外,还包括其他信息,比如区间内的最大值。Node(start, end, val)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
"""
Definition of SegmentTreeNode:
class SegmentTreeNode:
def __init__(self, start, end, max):
self.start, self.end, self.max = start, end, max
self.left, self.right = None, None
"""
class Solution:
def build(self, A):
if not A: return
return self.buildTree(0, len(A)-1, A)

def buildTree(self, start, end, A):
if start > end: return
root = SegmentTreeNode(start, end, A[start])
if start == end: return root
mid = (start + end) / 2
root.left = self.buildTree(start, mid, A)
root.right = self.buildTree(mid+1, end, A)
# Post Order update Max
if root.left and root.left.max > root.max:
root.max = root.left.max
if root.right and root.right.max > root.max:
root.max = root.right.max
return root

如果需要区间的最小值:

1
root.min = Math.min(root.left.min, root.right.min);

如果需要区间的和:

1
root.sum = root.left.sum + root.right.sum;

3 线段树的更新

更新是从叶子节点一路走到根节点, 去更新线段树上的值。因为线段树的高度为log(n),所以更新序列中一个节点的复杂度为log(n)。

LintCode 203. Segment Tree Modify

给一个Maximum Segment Tree, 更新某个index的value。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
class Solution:
def modify(self, root, index, value):
if not root: return
if root.start == root.end:
root.val = value
root.max = value
return
mid = (root.start + root.end) / 2
if index <= mid:
self.modify(root.left, index, value)
else:
self.modify(root.right, index, value)
if root.right and root.left:
root.max = max(root.right.max, root.left.max)

4 线段树的查询

构造线段树的目的就是为了更快的查询

  • 给定一个区间,要求区间中最大的值。线段树的区间查询操作就是将当前区间分解为较小的子区间,然后由子区间的最大值就可以快速得到需要查询区间的最大值。
  • 任意长度的线段,最多被拆分成logn条线段树上存在的线段,所以查询的时间复杂度为O(log(n))

LintCode 202. Segment Tree Query

1
2
3
4
5
6
7
8
9
10
11
12
13
class Solution:
def query(self, root, start, end):
if not root: return
if root.start == start and root.end == end:
return root.max
mid = (root.start + root.end) / 2
if start > mid:
return self.query(root.right, start, end)
elif end <= mid:
return self.query(root.left, start, end)
else:
return max(self.query(root.left, start, mid),
self.query(root.right, mid+1, end))

5 线段树的应用

线段树的基本应用

  • 支持动态更改数组一个元素的值 O(logn)
  • 区间的和、最大值、最小值 O(logn)
  • 创建,更新,求和或者求最大最小,只有这三个function!

307. Range Sum Query - Mutable

区间求和,如果原序列不变的话直接用preSum前缀和就可以了,但是如果序列可变的话update前缀和的复杂度就变成O(n)

LintCode 249. Count of Smaller Number before itsel

统计数组中每个元素后面比自己小的数。

  • 初始化[0, max(nums)]的数组全部都是0
  • 把数组中的数值当作index,统计数量就转化成了求区间和

其他解法:Binary Indexed Tree,Binary Search Tree

315. Count of Smaller Numbers After Self

  • 值为负数或者最大值特别大的时候用Segment Tree,val无法作为index,导致 非常不合适!
  • 所以这个题得考虑别的解决方案
Share