2016-10-26

日语N2听力题爬虫凑合着用版

我原以为第一部分的登录做好后就没有难题了，结果呵呵……
主要是两个坑，一个是selenium元素模拟点击，另外一个是用PIL处理图片。

###坑1
这个题库的题目是五个一组，做完一组后交卷，网站给你打分，然后做下一组。我本来的打算是让程序将所有题目选A后交卷，再做下一组，顺便把题目爬下来。结果做到第二题后再模拟点击A选项后报错，报的是“元素可能不唯一或不可见”。time.sleep(5)用了，元素的唯一性也用Chrome的开发者工具查过了，大神学长也查不出为何
坑填不了就绕过去呗。还好这个题库设计得很智障，只要把页面刷新一次就会再随机生成5道题，只要一直刷新就能爬取一堆题目（但这样做会有重复）。
所以有关代码是这样写的：

while( counter1 < 20 ):
    driver.refresh()
    time.sleep(0.5)
    counter = 0
    while ( counter < 5 ):
    //具体的爬取过程

###坑2
题目的形式多样，有文字题也有图片题。一个个具体解析太麻烦了，干脆每道题截个图存下来吧。
这是有关代码：

fname = "C:/Users/何立/PycharmProjects/crawler/question/pi     c/"+str(counter1)+"question"+str(counter)+".png"
driver.save_screenshot(fname)

然而这是截全屏的图，如果只截取规定元素多好啊。问度娘，还真有答案。
这是原文链接，文章写得很好
以下是这次用到的代码

driver = webdriver.Chrome()
driver.get('http://stackoverflow.com/')
driver.save_screenshot('screenshot.png')
left = element.location['x']
top = element.location['y']
right = element.location['x'] + element.size['width']
bottom = element.location['y'] + element.size['height']
im = Image.open('screenshot.png') 
im = im.crop((left, top, right, bottom))
im.save('screenshot.png')

问题出在定位元素（要截图的题目）上。元素的id是随机生成的，刷新后原本的id就不对了，xPath也不会对。好在最后幸运地找到了一个id不变的元素，虽然截图范围大了一些，还能接受
最后附上代码全文吧

from urllib.request import urlretrieve
from PIL import Image

from selenium import webdriver
import time

driver = webdriver.PhantomJS(executable_path='C:\\WEB\\phantomjs\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe')

#登录
driver.get("http://tiku.hujiang.com/n2/")
driver.set_window_size(1200, 900)

#先点击右上方的“登录”，调用JS生成登录框
driver.find_element_by_id("btn_denglu").click()
login = driver.find_element_by_id("hp-login-normal")

NameField = login.find_element_by_name("username")
PasswordField = login.find_element_by_name("password")
submitButton = login.find_element_by_class_name("hp-btn")

NameField.send_keys("你的账号")
PasswordField.send_keys("你的密码")
submitButton.click()

time.sleep(1)
#登录成功后会弹出一个窗口，点击“跳过”按钮
skipButton = driver.find_element_by_class_name("hp-skip")
skipButton.click()

time.sleep(1)

#页面跳转至做题的网页
driver.get("http://tiku.hujiang.com/s/93/")
downloadList = []
fname = ''
counter = 0
counter1 = 0

#循环刷新，爬取题目
while( counter1 < 20 ):
    driver.refresh()
    time.sleep(0.5)
    counter = 0
    while ( counter < 5 ):
        next = driver.find_element_by_id("btnNext")
        #downloadList用来存储音频文件的下载地址
        downloadList.append(driver.find_element_by_class_name("pnl_flashPlayer").get_attribute("src"))
        #点击next切换至下一题
        next.click()
        time.sleep(0.2)

        #文件存储地址和命名
        fname = "C:/Users/何立/PycharmProjects/crawler/question/pic/"+str(counter1)+"question"+str(counter)+".png"
        driver.save_screenshot(fname)

        element = driver.find_element_by_id("quiz_q_con")

        left = element.location['x']
        top = element.location['y']
        right = element.location['x'] + element.size['width']
        bottom = element.location['y'] + element.size['height']

        #利用PIL图加工文件
        im = Image.open(fname)
        im = im.crop((left, top, right, bottom))
        im.save(fname)
        print(counter1,counter)
        counter += 1
    counter1 += 1

#所有音频文件放在最后统一下载
counter = 0
for download in downloadList:
    fname =  "C:/Users/何立/PycharmProjects/crawler/question/"+"question"+str(counter)+".mp3"
    counter += 1
    print(counter)
    urlretrieve(download, fname)

driver.close()