Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
public:codein:activity_extractor_technical_docs [2016/12/18 23:38]
manveer_b created
public:codein:activity_extractor_technical_docs [2016/12/24 03:47]
manveer_b
Line 3: Line 3:
 This page contains how the service modules were coded and also how to add support for a new service. This page contains how the service modules were coded and also how to add support for a new service.
  
-==== Module ​Documentation ​==== +===== Module ​Information ===== 
-=== Main Module ​===+==== ActivityExtractor.py ====
 This module is responsible for processing the parameters passed through the command line and calling the appropriate streaming service. This module is responsible for processing the parameters passed through the command line and calling the appropriate streaming service.
  
Line 14: Line 14:
     '​email':​ self.email,     '​email':​ self.email,
     '​password':​ self.password,​     '​password':​ self.password,​
-    '​chrome_args':​ self.chrome_args,​ 
     '​user':​ self.user     '​user':​ self.user
   }   }
   ​   ​
-**url:** The url the driver initially navigates to.\\ +url: The url the driver initially navigates to.\\ 
-**email:** The email required to log into the service.\\ +email: The email required to log into the service.\\ 
-**password:** The password associated with the email.\\ +password: The password associated with the email.\\ 
-**chrome_args:​** Potential arguments that can be used when initializing the chromedriver.\\ +user: //(Only required for Netflix)// The profile name the user wishes to retrieve viewing activity from.\\
-**user:** //Only required for Netflix// The profile name the user wishes to retrieve viewing activity from.\\+
  
-=== Hulu === +==== common.py ​==== 
-This module gets viewing activity from Hulu\\ +Contains modules common to all services. 
-== getActivity() == +== > output_activity(SERVICE,​ activity_list) == 
-This function is called ​from the Main Module. It's main purpose is to initialize the process and call loginHulu() +Module to output activity into a .txt file.\\ 
-== loginHulu() ==+\\ 
 +Accepts 2 parameters '​SERVICE'​ and '​activity_list':​\\ 
 +SERVICE: Name of the service calling the function.\\ 
 +activity_list:​ List of viewing activity ​extracted ​from the streaming service.\\ 
 + 
 +==== hulu.py ==== 
 +Gets viewing activity from Hulu. 
 +== > get_activity() == 
 +Called ​from the Main Module. It's main purpose is to initialize the process and call login_hulu() 
 +== > login_hulu() ==
 First this function creates an instance of Chrome and passes potential arguments to the driver.\\ First this function creates an instance of Chrome and passes potential arguments to the driver.\\
-It then navigates to www.hulu.com and logs in with the user credentials. +It then navigates to www.hulu.com and logs in with the user credentials. ​Then calls navigate_site() 
-=== YouTube ​===+== > navigate_site() ​== 
 +The main purpose of this function is to navigate to the '​History'​ page on Hulu. 
 +== > navigate_pages() == 
 +Depending on the length of the user's viewing history there may be multiple pages of viewing history.\\ 
 +This function calls get_page_activity() for every page of viewing history. Then calls common.output_activity().\\ 
 +== > get_page_activity() == 
 +Gets all the viewing activity on the current viewing history page. Also displays a progress bar to the user.\\
  
-We first require the page source of the video\\  +==== amazon.py ==== 
-The function createSoupObject() is responsible for this. For this purpose we use the requests module. We parse the HTML with the help of BeautifulSoup library. ​ +Gets viewing activity from Amazon. 
-The getTitle function returns the title of the videoThis is also used for naming ​the file. +== > get_activity() == 
-  <​title>​VIDEO NAME - YouTube</​title+Called from the Main ModuleIt's main purpose ​is to initialize ​the process and call login_amazon() 
-The function ​getRawSubtitleLink returns ​the Raw Link which is in encoded formatThis is still an incomplete URLThe variable UglyString contains the complete URLThe link is present ​in the BeautifulSoup. +== login_amazon() == 
-We now prompt ​the user to choose ​the desired language from the available choicesThe available subtitle language choices are extracted from the UglyString.\\ +First this function ​creates an instance of Chrome and passes potential arguments to the driver.\\ 
-Based on the chosen language, ​the corresponding language code is indexed from the language dictionaryThis language code is appended to the decoded Link. +It then navigates to https://www.amazon.com/​gp/​sign-in.html and logs in with the user credentials. It then navigates ​to the viewing history page by passing a url to the driverCalls navigate_pages() 
-\\ +== > navigate_pages() == 
-This final URL contains the subtitles as an XML file+Depending ​on the length of the user's viewing history there may be multiple pages of viewing history.\\ 
-Now, the XML file is converted to .srt file using BeautifulSoup function calls. +This function calls get_page_activity() for every page of viewing history. Then calls common.output_activity().\\ 
-\\+== > get_page_activity() == 
 +Gets all the viewing activity on the current viewing history page.\\
  
-=== Amazon ​===+==== netflix.py ​==== 
 +Gets viewing activity from Netflix. 
 +== > get_activity() == 
 +Called from the Main Module. It's main purpose is to initialize the process and call login_netflix() 
 +== > login_amazon() == 
 +First this function creates an instance of Chrome and passes potential arguments to the driver.\\ 
 +It then navigates to https://​www.netflix.com/​Login and logs in with the user credentials. It then calls get_active_profile() 
 +== > get_active_profile() == 
 +Selects user profile based on profile name present in parameters['​user'​]. Calls navigate_site() 
 +== > navigate_site() == 
 +Calls hover_click() then clicks the '​Viewing Activity'​ link once hover_click() has navigated to the user's account page. Then calls scroll_to_bottom() 
 +== > hover_click() == 
 +Hovers on the profile icon in the top right corner of the Netflix homepage. Then clicks on 'Your Account'​ on the dropdown menu that appears. Returns True or False depending on whether the process was successful. 
 +== > scroll_to_bottom() == 
 +Depending on the length of the user's viewing activity Netflix displays only a portion of it. In order to have Netflix display the full list this function is called.\\ 
 +Scrolls to the bottom of the page and waits for Netflix to load the next dynamic page of activity. This may be repeated multiple time until all of the activity is displayed. Calls get_page_activity() 
 +== > get_page_activity() == 
 +Gets all viewing activity from the page. Displays a progress bar to the user. Calls common.output_activity()\\
  
-The subtitle URL for Amazon is present in this URL -+===== New Service Instructions ===== 
 +In order to add a new service to the platform, follow these steps. 
 +==== Instructions ==== 
 +**1. Add your service and it's parameters to the file '​userconfig.ini'​**\\ 
 +Follow ​this format when adding your service:
  
-"​PreURL":"​https://​atv-ps.amazon.com/​cdp/​catalog/​GetPlaybackResources?",​ +  [SERVICE_NAME] 
- "​asin" ​                             : ""​ , +  ​url      = service_login_page_url 
- "​consumptionType" ​                  : "​Streaming"​ , +  ​email    = [email protected].com 
- "​desiredResources" ​                 : "​SubtitleUrls"​ , +  ​password ​test
- "​deviceID" ​                         : "​b63345bc3fccf7275dcad0cf7f683a8f"​ , +
- "​deviceTypeID" ​                     : "​AOAGZA014O5RE"​ , +
- "​firmware" ​                         : "​1"​ , +
- "​marketplaceID" ​                    : "​ATVPDKIKX0DER"​ , +
- "​resourceUsage" ​                    : "​ImmediateConsumption"​ , +
- "​videoMaterialType" ​                : "​Feature"​ , +
- "​operatingSystemName" ​              : "​Linux"​ , +
- "​customerID" ​                       : ""​ , +
- "​token" ​                            : ""​ , +
- "​deviceDrmOverride" ​                : "​CENC"​ , +
- "​deviceStreamingTechnologyOverride"​ : "​DASH"​ , +
- "​deviceProtocolOverride" ​           : "​Https"​ , +
- "​deviceBitrateAdaptationsOverride" ​ : "​CVBR,​CBR"​ , +
- "​titleDecorationScheme" ​            : "​primary-content"​ +
-\\ +
-The primary parameters we need to get are ASIN ID, customerID and TOKEN. These are obtained from the config file. \\ +
-The config file is generated from the setup.py file. The setup.py file takes the users login and password and generates the config file. +
-The ASINID is taken from the URL directly. ​ +
-  ​https://​www.amazon.com/​dp/​B019DSWVYC/?​autoplay=1 +
- +
-Now, add the parameters to the dictionary and generate the final URL. The final URL will look something like this - +
-  ​https://​atv-ps.amazon.com/​cdp/​catalog/​GetPlaybackResources?&​consumptionType=Streaming&​titleDecorationScheme=primary-content&​firmware=1&​marketplaceID=ATVPDKIKX0DER&​resourceUsage=ImmediateConsumption&​deviceTypeID=AOAGZA014O5RE&​videoMaterialType=Feature&​token=6463643hhhdfhdhf7374747&​deviceBitrateAdaptationsOverride=CVBR,​CBR&​operatingSystemName=Linux&​deviceProtocolOverride=Https&​deviceID=b63345bc3fccf7275dcad0cf7f683a8f&​deviceStreamingTechnologyOverride=DASH&​asin=B0141BACGU&​desiredResources=SubtitleUrls&​customerID=A1234GH2343&​deviceDrmOverride=CENC+
   ​   ​
-This is where the Subtitle URL is present +**2Create ​a .py file for your service.**\\ 
-We get JSON response from this URL and it contains a subtitle URL with .dfxp format+Take a look at hulu.py, amazon.py or netflix.py as a reference.\\
-We request that subtitle URL and download the subtitles.+
 \\ \\
-With BeautifulSoup ​and Python regex we convert this dfxp to .srt format(File - Amazon_XmlToSrt.py)+Your file must have a class containing all of the functions required to login and get viewing activityThe class should be named like '​SERVICE_NAMEActivityExtractor'​.\\ 
 +Example: NetflixActivityExtractor\\
 \\ \\
 +Your class'​s init() function needs to accept an argument that will contain the parameters which ActivityExtractor.py will pass.\\
 +Here is the general format:
  
-=== BBC === +  def __init__(self,​ parameters):​ 
- +    self.parameters ​parameters 
-We first need to extract the episode ID from the URLSample URL - +    self.driver = None 
-  ​http://​www.bbc.co.uk/​iplayer/​episode/​p03rkqcv/​shakespeare-lives-the-works +    ... 
-   +     
-The episode ID is p03rkqcv.\\ +The main things your function should accomplish:\\ 
-The episode PID and episode Title(for naming the fileare present in the URL - + - Log into streaming service\\ 
-  ​http://​www.bbc.co.uk/​programmes/<​episode_id>​.xml+ - Navigate to viewing activity page\\ 
 + - Retrieve viewing activity\\ 
 + - Display progress bar (if possible)\\ 
 + - Call common.output_activity() to output viewing activity into a .txt file\\
 \\ \\
-The subtitle URL is present in the following link - +common.output_activity() accepts ​parameters. The first is the name of the viewing servicethe second is a list containing ​all of the viewing activity. Make user to 'import common'​ to use the function.\\
-  http://​open.live.bbc.co.uk/​mediaselector/​5/​select/​version/​2.0/​mediaset/​pc/​vpid/<​pid>​ +
-The PID is nothing but the episode PID obtained above. There are multiple PID's present. Sowe try all the URL's until the page request is successful.+
 \\ \\
-If the request is successful we get the subtitle link by parsing the XML page using Beautiful Soup   +**3Add you service into ActivityExtractor.py.**\\ 
-The subtitles obtained are in XML formatThey are converted to .srt by using BeautifulSoup function calls and regexThe conversion takes place in the file Bbc_XmlToSrt.py +In '​ActivityExtractor.py', create an import statement to import your service class from your service ​file.\\ 
-\\ +It should be something like this: 
-=== CrunchyRoll ===+ 
 +  from SERVICE_NAME import SERVICE_NAMEActivityExtractor
  
 +Next you have to add your service into supported_services.\\
 +In the ActivityExtractor,​ the init() function has a dictionary named '​self.supported_services'​. Add your service into the dictionary following the format of the other services.\\
 +It should look something like this:\\
  
-This is one of the methodologies to get the subtitles ID.  +  self.supported_services ​{ 
-In the Beautiful soup text it can be found that every video has this parameter. +    '​amazon'​AmazonActivityExtractor
-<​code>​ +    '​hulu'​HuluActivityExtractor
-                  <​div>​Subtitles:​  +    '​netflix'​NetflixActivityExtractor
-   <span class="​showmedia-subtitle-text">​ +    'SERVICE_NAME': ​SERVICENAMEActivityExtractor 
-     <img src="​http://​static.ak.crunchyroll.com/​i/​country_flags/​us.gif"/>​  +  ​}
-     <a href="/​naruto-shippuden/​episode-464-ninshu-the-ninja-creed-696237?​ssid=206027"​ title="​English (US)">​English (US)</​a>​+
-     <img src="​http://​static.ak.crunchyroll.com/​i/​country_flags/​sa.gif"/>​  +
-     <a href="/​naruto-shippuden/​episode-464-ninshu-the-ninja-creed-696237?​ssid=206015"​ title="​العربية">​العربية</​a>​+
-     <img src="​http://​static.ak.crunchyroll.com/​i/​country_flags/​it.gif"/>​  +
-     <a href="/​naruto-shippuden/​episode-464-ninshu-the-ninja-creed-696237?​ssid=206733"​ title="​Italiano">​Italiano</​a>​,  +
-     <img src="​http://​static.ak.crunchyroll.com/​i/​country_flags/​de.gif"/>​ +
-     <a href="/​naruto-shippuden/​episode-464-ninshu-the-ninja-creed-696237?​ssid=206033"​ title="​Deutsch">​Deutsch</​a>​ +
-   </​span>​ +
- </​div>​ +
-</​code>​ +
-We need to obtain all the SSID's. We return all the id's as a list along with the respective Language title attached. \\ +
-For the above HTML we should have this - +
-<​nowiki>​ +
-[['​206027',​ '​English (US)'​],​ ['​206015',​ '​العربية'​],​ ['​206733',​ '​Italiano'​],​ ['​206033',​ '​Deutsch'​]] +
-</​nowiki>​ +
-\\ +
-We prompt the user to choose the language and based on the choice, we append the ID from the list obtained above.  +
-A sample subtitle URL, where a script_id(206027) has been appended to the base URL :  +
-  ​http://​www.crunchyroll.com/​xml/?​req=RpcApiSubtitle_GetXml&​subtitle_script_id=206027+
   ​   ​
-The encrypted subtitles are extracted from the above URL. +**4Test the program ​with your service ​and report any errors.**\\ 
-The decryption of these subtitles has been taken from another Open Source software : youtube-dl. +If your service worked successfully create ​pull request to the repository ​and it'll be addedIf any errors are thrown that you can't solve yourselfcreate an issue in the repository ​and we'll try helping you out.\\
- +
-=== Netflix === +
- +
-The user needs to input his username and password of Netflix in the userconfig.ini file. Netflix requires login to download the subtitles. \\ +
- +
-We use python-selenium browser to automate the process. +
-The first step is to login to Netflix ​with the config file information. Chrome WebDriver is used as the driver for selenium. \\ +
-After a successful login from selenium browser, we request for the video URL. \\ +
-The chrome Network tab gives a list of resources fetched from the server. We use the command : +
-  return window.performance.getEntries();​ +
-This command returns all the fetched URL's. It was observed that all the Netflix videos had this sub-string in common ​and it was unique**/?o** \\ +
-So we query for **/?o** and let the browser fetch the resources until we find such a URL. If we do not find the URL before the time out, we exit the application. If such URL is found we save the URL and follow the standard procedure. \\ +
-We request ​the URL using requests module and save the file. \\ +
-The module //​Netflix_XmlToSrt.py//​ is used to convert XML to .srt format. +
-  +
- +
-=== FOX === +
- +
- +
-We first require ​the page source of the video. \\  +
-The function createSoupObject() is responsible for this. For this purpose we use the requests module. We parse the HTML with the help of BeautifulSoup library. +
- +
-The video URL follows a specific standard throughout.  +
-<​code>​ http://​www.fox.com/​watch/​684171331973/​7684520448 </​code>​ +
-We need to split and return "​684171331973"​This is the required contentID. \\ +
- +
-This is the alternative method to obtain the contentID.  +
-In the soup text there is a meta tag which also contains the video URL. This is helpful in case the user inputs a shortened URL. +
- +
-<​code>​ <meta content="​http://​www.fox.com/​watch/​684171331973/​7684520448"​ property="​og:​url"/>​ </​code>​ +
-As stated above we split the URL and return the require contentID//​684171331973//​ +
-The other parameters required for obtaining the subtitle URL are also present ​in the HTML page source. +
- +
-The required script content ​ looks like this- +
-<​code>​ +
- jQuery.extend(Drupal.settings,​ {"":​...............});​  +
-</​code>​  +
-  *We add everything to a new string after encountering the first "​{"​. +
-  *Remove the last parentheses ​and the semi-colon to create a valid JSON---- '​);'​+
 \\ \\
-The JSON has the standard format and the required parameters follow ​this naming.  +**5. Add your service to this documentation page.**\\ 
-The json content : +Contact Carlos ​to get login credentials for this page and add your service following ​the format of the other streaming services.\\
-<​code>​  +
-{"​foxProfileContinueWatching":​{"​showid":"​empire","​showname":"​Empire"​},​.............. +
-"​foxAdobePassProvider":​ {......,"​videoGUID":"​2AYB18"​}} +
-</​code>​ +
- +
-We use the json module ​to parse the json and extract ​the parameters namely //showid// , //​showname//​ , //​videoGUID//​+
 \\ \\
- +\\ 
-Sample Subtitle Links - +**For bug fixes create an issue on the repository**\\ 
-  http://​static-media.fox.com/​cc/​sleepy-hollow/​SleepyHollow_3AWL18_660599363942.srt +**For any other inquiries contact me at [email protected].com**
-  http://​static-media.fox.com/​cc/​sleepy-hollow/​SleepyHollow_3AWL18_660599363942.dfxp +
- +
-The standard followed is - +
-  http://​static-media.fox.com/​cc/​[showid]/​showname_videoGUID_contentID.srt +
-  http://​static-media.fox.com/​cc/​[showid]/​showname_videoGUID_contentID.dfxp +
- +
-Some Subtitle URL's follow this standard - +
-  http://​static-media.fox.com/​cc/​[showid]/​showname_videoGUID.dfxp +
-  http://​static-media.fox.com/​cc/​[showid]/​showname_videoGUID.srt +
- +
-So we store both URL's and check for both the varieties. +
-We request both the varieties of URL and save the subtitles file when a successful request is returned. +
- +
- +
-=== General rules === +
- +
-Each service has a unique way of fetching the subtitles from the server. We can get to know the methodology by following some steps - +
- +
-  ​*The easiest way is to first open the Developer tools in Chrome/​Firefox and check for XHR requests. Generally we find the subtitle URL's here. +
-  ​*The next step is to find out a general pattern in the subtitle URL's of that particular service. +
-  ​*If a pattern is found, it is most likely that we can request the subtitle page by forming the URL's from the required parameters.  +
-  ​*Generally, the parameters can be found in the HTML page source. We need to search for them and query the URL. +
-  *Sometimes the required parameters for the URL are found in some other links in JSON format. A quick check of the fetched JSON resources will reveal the availability of them. +
-  ​*For services such as Netflix, the parameters have some kind of hashing in them which is difficult to decryptIn such cases we can use selenium browser and search for keywords like **.srt**, **.dfxp**, **cc**, **sub** +
-  *By checking for multiple videos we can find out common sub-strings in the subtitle URLs. These common sub-strings(have to be unique) can be used for querying the resources from selenium browser. +
-  *In most cases, the subtitle URL is fetched only if the user is logged in. So we first need to setup login and then go to the video URL in the WebDriver. +
-  *The subtitles can then be downloaded from the URLs.     +
- +
-== If you are a developer and want to add support for new services or fix bugs please feel free to send a pull request or contact me for further assistance. ==+
  • public/codein/activity_extractor_technical_docs.txt
  • Last modified: 2016/12/24 03:47
  • by manveer_b