GSOC 2016 Project ideas

Improving GitHub CI

Estimated task time: 1 month+ - Difficulty: moderate-high**

Last year, as part of GSoC 2015, a first step towards CI on GitHub was made for the CCExtractor repository. CCExtractor possesses a huge repository with samples that are broken or were broken at some point. When adding new features the validity of those samples must be checked. At this moment we have a GitHub bot, written using python (50%), php (38%) and bash (12%). This bot [1] checks GitHub every minute for notifications, and followingly runs code changes either directly on the server (for trusted contributors) or in a VM (using Virtualbox) for all other cases.

The obvious better way would be to fully integrate the bot into GitHub using webhooks, like for example Jenkins [2] does. However, running unknown, potentially dangerous code on the server can not be allowed from a security point of view, hence the solution to run a virtual machine for the untrusted code, and a direct run on the server for known contributors. However, a test ran directly on the server takes about 20 minutes, but a test in VirtualBox takes over 14 hours!

This is caused by VirtualBox, which is virtualizing the access to the files (shared folder/samba network share), so the this task involves researching a better alternative for VirtualBox. Some interesting keywords: KVM [3], OpenVZ [4], Linux-VServer [5], ...

After doing the research the chosen solution should be implemented (modifying the current code to integrate the new chosen virtualization platform), or starting from scratch. It must both be integrated with the sample platform [6] (built last year) and with the GitHub repository using webhooks [7].

[1] https://github.com/canihavesomecoffee/ccx_gitbot [2] https://jenkins-ci.org/ [3] https://en.wikipedia.org/wiki/Kernel-based_Virtual_Machine [4] https://en.wikipedia.org/wiki/OpenVZ [5] https://en.wikipedia.org/wiki/Linux-VServer [6] https://github.com/canihavesomecoffee/ccx_submissionplatform [7] https://developer.github.com/webhooks/

Move our website over to GitHub & update it

Estimated task time: 1 week - Difficulty: easy**

For years the CCExtractor[1] website has been hosted on the SourceForge platform. With our sourcecode on GitHub, and the (recent) news about the problems with SourceForge [2][3], it would be easier & safer to fully move away from SourceForge.

Your task would be to create a new github repository for this, and add a updated website to it (a new design might be nice, using bootstrap for example), and update the documentation for CCExtractor on it.

[1] http://ccextractor.org/ [2] http://www.howtogeek.com/218764/warning-don%E2%80%99t-download-software-from-sourceforge-if-you-can-help-it/ [3] https://en.wikipedia.org/wiki/SourceForge#Controversies

Adding metadata for caption stream for real-time repository

Difficulty: not really, but a lot of work

There is a server applications which allows to receive caption stream from TV tuners using CCExtractor. Then, these captions can be viewed from web site in real-time.

Video stream from the tuners sometimes have some embedded metadata (EPG or XDS) about the current TV program (title, description, lang, category etc). This metadata is transmitted to the repository and, in turn, is shown to end-user. The problem is that not all the TV channels have these metadata, so usually all we have are channel's name (which is set manually by admin) and bare caption stream.

What you have to do is to find services (online TV schedulers, whatever) which provide APIs for accessing this metadata. Then you should create a program which will fetch these metadata by specified channel name and location (country and the city), and store it in database. To speed up accessing metadata from internal repository services it should be stored in advance (lets say for a day ahead). Also, it would be awesome to get somehow icons with channels logos as well.

In the end, you should have a daemon program, which will fetch metadata automatically and insert it in our database. Also you should modify existing database architecture and not to break anything.

Notification service for real-time repository

Difficulty: difficult

Sometime its interesting to know when certain words were mentioned on TV. So, the idea is to create a service where users can specify these words and to receive e-mail notifications when this words were mentioned. These notifications should contain the context of these words and lead to web site where user can view all the captions from TV program and some statistics information.

So, this idea includes creating notification program which will read caption stream (from real-time repository) and creating web site where users can specify words and view when they were mentioned.