Monday 26 May 2014

Detecting the language of a text in C#

Short post about how to use language detection capabilities in your application. I will be demonstrating NTextCat, which is a .NET port of text_cat, a Perl script, itself an implementation of a whitepaper published in 1994 called N-Gram-Based Text Categorization.

    There are four steps to language detection using NTextCat:
  • Reference NTextCat - the library is now available as a NuGet package as well
  • Instantiate an identifier factory (usually RankedLanguageIdentifierFactory)
  • Get an instance of a RankedLanguageIdentifier from the factory by loading a language XML file
  • Call the Identify method on your text and get a list of languages in the order of the probability that the text in that language

Here is a piece of code that does that using the core XML file published with the library. Remember to add the XML to your project and set its property of Copy to Output Directory.

public class LanguageProcessor
{
private RankedLanguageIdentifier _identifier;

public string IdentifyLanguage(string text)
{
if (_identifier == null)
{
var file = new FileInfo(Path.Combine(AppDomain.CurrentDomain.BaseDirectory, "LanguageModels/Core14.profile.xml"));
if (!file.Exists)
{
throw new FileNotFoundException("Could not find LanguageModels/Core14.profile.xml to detect the language");
}
using (var readStream = File.OpenRead(file.FullName))
{
var factory = new RankedLanguageIdentifierFactory();
_identifier = factory.Load(readStream);
}
}
var languages = _identifier.Identify(text);
var mostCertainLanguage = languages.FirstOrDefault();
if (mostCertainLanguage != null)
{
return mostCertainLanguage.Item1.Iso639_3;
}
return null;
}

}

There are a lot of XML files, some taken from Wikipedia, for example, and handling 280+ languages, but for the casual purpose of finding non-English text in a list, the core one will suffice.

Tuesday 20 May 2014

A log4net custom appender that creates JIRA issues and notifies users

It has been a while since I've blogged something technical. It's just that nothing I did seemed to be worthy of this majestic blog... Well, jokes aside, here is a post detailing my log4net JIRA appender.

log4net is a popular (if not the most popular) logging framework for .NET out there. Its strength lies in its configurability, the possibility to create custom loggers, custom appenders, custom filters, etc. I will be talking about a custom appender, a class that can be loaded by log4net to consume the logged lines and put them somewhere. For example to make an application that uses log4net to write the log to the console all you do is configure it to use the console appender. The JIRA appender takes the log output and creates issues in JIRA, notifying users afterwards. JIRA is a tracker for team planning. It is also very popular.

In order to create an appender, one references the log4net assembly (or NuGet package) and then creates a class that inherits from AppenderSkeleton. We could implement IAppender, but the skeleton class has most of what people want from an appender. The next step is to override the Append method and we are done. We don't want to create an issue with each logged line, though, so we will make it so that it creates the issue after a period of inactivity or when the logger closes. For that we use the CancellationTokenSource class to create delayed actions that we can cancel and recreate. We also need to override OnClose().

For the JIRA Api I used a project called AnotherJiraRestClient, but I guess one can used anything out there. You will see that the notify functionality is not implemented so we have to add it.

Here is the appender source code:
using AnotherJiraRestClient;
using log4net.Appender;
using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
using System.Net;
using System.Text;
using System.Threading;
using System.Threading.Tasks;

namespace JiraLog4netAppender
{
public class JiraAppender : AppenderSkeleton // most appender functionality is found in this abstract class
{
private List<string> _notifyUsers;

// properties of the appender, configurable in the log4net.config file as children elements of the appender
public string url { get; set; } // the url of the JIRA site
public string user { get; set; } // the JIRA user
public string password { get; set; } // the JIRA password
public string project { get; set; } // the JIRA project
public string notify // a comma separated list of JIRA users who will be notified of the issue's creation
{
get
{
return string.Join(", ", _notifyUsers);
}
set
{
_notifyUsers.Clear();
if (!string.IsNullOrWhiteSpace(value))
{
_notifyUsers.AddRange(value.Split(',').Select(s => s.Trim()).Where(s => !string.IsNullOrWhiteSpace(s)));
}
}
}

CancellationTokenSource _ts;
StringWriter _sw;
Task _task;
private JiraClient _jc;
private string _priorityId;
private string _issueTypeId;

private object _writerLock=new object();

public JiraAppender()
{
_notifyUsers = new List<string>();
_ts = new CancellationTokenSource(); // use this to cancel the Delay task
_sw = new StringWriter();
}

protected override void Append(log4net.Core.LoggingEvent loggingEvent)
{
this.Layout.Format(_sw, loggingEvent); // use the appender layout to format the log lines (warning: for this you need to override RequiresLayout and return true)
_ts.Cancel(); // cancel the task and create a new one. This means as long as the logger writes, the 5 second delay will be reset
_ts = new CancellationTokenSource();
_task = Task.Delay(5000, _ts.Token).ContinueWith(writeToJira, _ts.Token); // after 5 seconds (I guess you could make this configurable as well) create a JIRA issue
}

protected override bool RequiresLayout
{
get
{
return true;
}
}

/// <summary>
/// write to jira, either when 5 seconds of inactivity passed or the appender is closed.
/// </summary>
/// <param name="task"></param>
private void writeToJira(Task task)
{
string s;
lock (_writerLock) // maybe the method was already in progress when another one was called. We need to clear the StringWriter before we allow access to it again
{
s = _sw.ToString();
var sb = _sw.GetStringBuilder();
sb.Clear();
}
if (!string.IsNullOrWhiteSpace(s))
{
writeTextToJira(s);
}
}

private void writeTextToJira(string text)
{
ensureClientAndValues();
var summary = "Log: " + this.Name; // the issue summary
var labels = new List<string> // some labels
{
this.Name, this.GetType().Name
};
var issue = new AnotherJiraRestClient.JiraModel.CreateIssue(project, summary, text, _issueTypeId, _priorityId, labels); // create the issue with type Issue and priority Trivial
var basicIssue = _jc.CreateIssue(issue);
_jc.Notify(basicIssue, _notifyUsers, "JiraAppender created an issue", null, null); // notify users of the issue's creation
}

/// <summary>
/// Make sure we have a JiraClient and that we know the ids of the Issue type and the Trivial priority
/// </summary>
private void ensureClientAndValues()
{
if (_jc == null)
{
_jc = new JiraClient(new JiraAccount
{
ServerUrl = url,
User = user,
Password = password
});
}
if (_priorityId==null) {
var priority = _jc.GetPriorities().FirstOrDefault(p => p.name == "Trivial");
if (priority == null)
{
throw new Exception("A priority with the name 'Trivial' was not found");
}
_priorityId = priority.id;
}
if (_issueTypeId == null)
{
var meta = _jc.GetProjectMeta(project);
var issue = meta.issuetypes.FirstOrDefault(i => i.name == "Issue");
if (issue == null)
{
throw new Exception("An issue type with the name 'Issue' was not found");
}
_issueTypeId = issue.id;
}
}

protected override void OnClose() //clean what you can and write to jira if there is anything to write
{
_ts.Cancel();
writeToJira(null);
_sw.Dispose();
_ts.Dispose();
_task = null;
base.OnClose();
}

}
}

As I said, AnotherJiraRestClient does not implement the notify API call needed to inform the users of the creation of an issue, so we need to change the project a little bit. Perhaps when you are implementing this, you will find notify already there, with a different format, but just in case you don't:
  • add to JiraClient the following method:
    public void Notify(BasicIssue issue, IEnumerable<string> userList, string subject, string textBody, string htmlBody)
    {
    var request = new RestRequest()
    {
    Method = Method.POST,
    Resource = ResourceUrls.Notify(issue.id),
    RequestFormat = DataFormat.Json
    };
    request.AddBody(new NotifyData
    {
    subject = subject,
    textBody = textBody,
    htmlBody = htmlBody,
    to = new NotifyTo
    {
    users = userList.Select(u => new User
    {
    active = false,
    name = u
    }).ToList()
    }
    });
    var response = client.Execute(request);
    if (response.StatusCode != HttpStatusCode.NoContent)
    {
    throw new JiraApiException("Failed to notify users from issue with id=" + issue.id+"("+response.StatusCode+")");
    }
    }
  • add to ResourceUrls the following method:
    public static string Notify(string issueId)
    {
    return Url(string.Format("issue/{0}/notify", issueId));
    }
  • create the following classes in the JiraModel folder and namespace:
    public class NotifyData
    {
    public string subject { get; set; }
    public string textBody { get; set; }
    public string htmlBody { get; set; }
    public NotifyTo to { get; set; }
    }

    public class NotifyTo
    {
    public List<User> users { get; set; }
    }

    public class User
    {
    public string name { get; set; }
    public bool active { get; set; }
    }

Here is an example configuration. Have fun!
<log4net>
<appender name="JiraAppender" type="JiraLog4netAppender.JiraAppender, JiraLog4netAppender">
<url value="https://my.server.url/jira"/>
<user value="jirauser"/>
<password value="jirapassword"/>
<project value="jiraproject"/>
<notify value="siderite,someotheruser" />
<layout type="log4net.Layout.PatternLayout">
<conversionPattern value="%date [%thread] %-5level %logger [%property{NDC}] - %message%newline" />
</layout>
<filter type="log4net.Filter.LevelRangeFilter">
<levelMin value="WARN" />
<levelMax value="FATAL" />
</filter>
</appender>
<root>
<level value="DEBUG" />
<appender-ref ref="JiraAppender" />
</root>
</log4net>

Sunday 18 May 2014

Algorithms as cash, now algorithms in company board of directors

The world is getting into a trend of using software and hardware in places traditionally reserved for humans or "real life" applications. An extraordinary example of this, something that I consider way ahead of its time, is the creation of Bitcoins. You see, money is supposed to be an abstraction of wealth, but wealth, for all intents and purposes, is determined by some material such as gold or gems. It was the previous layer of abstraction, right after people would do commerce using sheep and goats and women: they used something they mined from under the earth to equate value of real things. Enter Bitcoin, something that you "mine" using processing power, using an algorithm, having weight of value determined only by the computation and strength of encryption used. After all, gold in itself had no value either when it was used as currency: it only had weight of value by the effort of getting out of the ground and its rarity.

I didn't write about Bitcoins until now because frankly I don't get how it works. I get the gist of it, but I would like to understand the algorithm behind, which relies heavily on encryption. I am terrible at any type of cryptography, though. However, the story that prompted me to write a blog entry today is, in a way, even weirder than Bitcoins. A BBC article entitled "Algorithm appointed board director" describes that venture capital firm Vital assigned an automated algorithm a vote in the decision to invest or not in a company. Of course, this has been done before, but indirectly. Some guy would use a computer in some basement of the office building, trying to determine if a company is viable or not using computer programs. He would then pass the information to some manager, who would pass it on to his manager until it got to some member of the board who voted in the council. This story, though, describes a complete removal of middlemen. Of course, there is still some techie that takes care of the IT department. He probably could sway the algorithm to vote for a company he personally likes, if he wanted too, but the techies are always subsumed in the concept of "machine", perhaps until they learn to service each other. The article says the same thing, really, but fails to recognize the importance of the removal of that chain of middle manager who's jobs are only to filter information from downstairs and decisions from upstairs. It is akin to Blake's 7, where the seventh was a machine.

I started this blog entry talking about a trend and I've just described two applications, but there are really a lot more. A trend is also powered by public perception, which is powered by media outlets. So I am not only talking about the self driving cars, the killing drones, the self landing probes, the algorithmic news aggregators and even writers, but also about the TV series and movies that are spreading the seed of this idea of cyber-revolution: Intelligence, Transcendence, The Machine, Robocop, even Silicon Valley, if you've seen the episode where the self driving car abducts someone. All of these are released this year, as well.

Of course, there has always been some sort of golem like creature in human fantasy, but it was always about creatures/machines trying to gain unrestricted power, power that belongs to humans only or maybe only to gods (in essence, something that should not be had). Some of them, like Star Trek's Data, were benign, always trying to achieve the "humanity" evident in his colleagues, but even in that story there was always the underlying idea that if he managed to reach his goal, then Data would become an immortal creature of human ability, but also a lot more than that: a superhuman. In this decade we see the rise of transhumanism, something I wholly support, the self-evolution of the human being, but also of the singularity, the self-evolution of machines to the point where we are left behind. This very familiar notion of competition between memes makes it accessible, in that "us or them" kind of way that every human, no matter how idiotic, resonates with, but also interesting because of the commonality found in the two concepts: it's either the machines overtaking the evolution rate of the human animal or the humans accelerating the evolution of the human animal. It's all about overcoming that beast that exists in us and that we think of as "not me", but that guides most of our existence. I hope no amount of adolescent fantasies and "emotion over matter" garbage will be able to undo this. And I am terribly excited because I believe that by the end of my life I will see (if not become) one or the other happening.

Friday 2 May 2014

Adventures in the .Net file system library System.IO

Intro (click to hide)
Sometimes, after making some software or another and feeling all proud that everything works, I get an annoying exception that the path of the files my program uses is too long, the horrible PathTooLongException. At first I thought it was a filesystem thing or an operating system thing, or even a .Net version thing, but it is not. The exception would appear in .Net 4.5 apps running on Windows 7 systems that used NTFS on modern hardware.

Lazy as any software dev, I googled for it and found a quick and dirty replacement for System.IO called Delimon that mitigated the issue. This library is trying to do most of what System.IO does, but using the so called extended-length paths, stuff that looks like \\?\C:\Windows, with external functions accessed directly from kernel32.dll. All good and nice, but the library is hardly perfect. It has bugs, it doesn't implement all of the System.IO functionality and feels wrong. Why would Microsoft, the great and mighty, allow such a ridiculous problem to persist in their core filesystem library?

And the answer is: because it is a core library. Probably they would have to make an enormous testing effort to change anything there. It is something from a managed code developer nightmare: unsafe access to system libraries, code that spans decades of work and a maze of internal fields, attributes, methods, properties that can be used only from inside the library itself. Not to mention all those people who decided to solve problems in core classes using reflection and stuff. My guess is that they probably want to replace file system usage with another API, like Windows.Storage, which, alas, are only used for Windows Phone.


In this blog post I will discuss the System.IO problems that relate to the total length of a path, what causes them and possible solutions (if any).


Let's start with the exception itself: PathTooLongException. Looking for usages of the exception in the mscorlib.dll assembly of .Net 4.0 we see some interesting things. First of all, there is a direct translation from Windows IO error code 206 to this exception, so that means that, in principle, there should be no managed code throwing this exception at all. The operating system should complain if there is a path length issue. But that is not the case in System.IO.

Most of the other usages of the exception come from the class PathHelper, a helper class used by the System.IO.Path class in a single method: NormalizePath. Wonderful method, that: internal static unsafe. PathHelper is like a multiple personality class, the active one being determined by the field useStackAlloc. If set to true, then it uses memory and speed optimized code, but assumes that the longest path will always be 260. That's a constant, it is not something read from the operating system. If set to false, the max path length is also provided as a parameter. Obviously, useStackAlloc is set to true in most situations. We will talk about NormalizePath a bit later.

The other usages of the PathTooLongException class come from two Directory classes: Directory and LongPathDirectory. If you instantly thought "Oh, God, I didn't know there was a LongPathDirectory class! We can use that and all problems disappear!", I have bad news for you. LongPathDirectory is an internal class. Other than that it seems to be a copy paste clone of Directory that uses Path.LongMaxPath instead of hardcoded constants (248 maximum directory name length, for example) or... Path.MaxPath. If you thought MaxPath and LongMaxPath are properties that can be set to fix long path problems, I have bad news for you: they are internal constants set to 260 and 32000, respectively. Who uses this LongPathDirectory class, though? The System.IO.IsolatedStorage namespace. We'll get back to this in a moment.

Back to Path.NormalizePath. It is a nightmare method that uses a lot of internal constants, system calls, convoluted code; it seems like someone deliberately tried to obfuscate its code. It's an internal method, of course, which makes no sense, as the functionality of path normalization would be useful in a lot of scenarios. Its first parameter is path, then fullCheck, which when true leads to extra character validation. The fourth parameter is expandShortPaths which calls the GetLongPathName function of kernel32.dll. The third parameter is more interesting, it specifies the maximum path length which is sent to PathHelper or makes local checks on the path length. But who uses this parameter?

And now we find a familiar pattern: there is a class (internal of course) called LongPath, which seems to be a clone of Path, only designed to work with long paths. Who uses LongPath? LongPathDirectory, LongPathFile and classes in the System.IO.IsolatedStorage namespace!


So, another idea becomes apparent. Can we use System.IO.IsolatedStorage to have a nice access to the file system? No we can't. For at least two reasons. First of all, the isolated storage paradigm is different from what we want to achieve, it doesn't access the raw file system, instead it isolates files in containers that are accessible to that machine, domain, application or user only. Second, even if we could get an "isolated" store that would represent the file system - which we can't, we would still have to contend with the string based way in which IsolatedStorage works. It is interesting to note that IsolatedStorage is pretty much deprecated by the Windows 8 Windows.Storage API, so forget about it. Yeah, so we have LongPathDirectory, LongPathFile and LongPath classes, but we can't really use them. Besides, what we want is something more akin to DirectoryInfo and FileInfo, which have no LongPath versions.

What can we do about it, then? One solution is to use Delimon. It has some bugs, but they can be avoided or fixed, either by the developers or by getting the source/decompiling the library and fixing the bugs yourself. A limited, but functional solution.
An interesting alternative is to use libraries the BCL team published for long path access: LongPath which seems to contain classes similar to the ones we find in mscorlib, but it's latest release is from 2010 or Zeta long paths which has a more recent release, 2013, but is completely different, using the FileInfo and DirectoryInfo paradigm, too.

Of course, you can always make your own API.

Another solution is to be aware of the places where the length limitation appears and avoid them via other type of development, in other words, a file system best practices compilation that eventually turns into a new file system API.

Both solutions coalesce into using some other library instead of System.IO. That's horrible and I think a stain on .Net's honor!


So let's see where the exception is actually thrown.

I've made some tests. First of all, I used FAR Manager, a file manager, to create folders of various lengths. The longest one was 255, before I got an exception. To my surprise, Windows Explorer could see it, but it could not open or copy/rename/delete it. I reduced the size of its name until the total size of the path was 260, then I could manipulate it in Windows Explorer. So there are external reasons for not creating paths as long, but we see that there are tools that can access files like that. Let's attempt to create some programatically.

System.IO.Directory.CreateDirectory immediately fires the exception. DirectoryInfo has no problem instantiating with the long path as the parameter, but the Create method throws the same exception. Any attempt to create a folder of more than 248 characters, even if the total path was less than 260 characters, failed as well.

However, with reflection to the rescue, I could create paths as long as 32000 characters and folders with names as long as 255 characters using our friend LongPathDirectory:
var longPathDirectoryType = typeof(System.IO.Directory).Assembly.GetTypes().First(t=>t.Name=="LongPathDirectory");
var createDirectoryMethodInfo = longPathDirectoryType.GetMethod("CreateDirectory", System.Reflection.BindingFlags.Static | System.Reflection.BindingFlags.NonPublic);
createDirectoryMethodInfo.Invoke(null, new object[] { path });

What about files? FAR Manager threw the same errors if I tried to create a filename larger than 255 characters. Let's try to create the same programmatically.

File.Create threw the exception, as well as FileInfo.Create and the FileStream constructors.

So can we use the same method and use LongPathFile? No! Because LongPathFile doesn't have the creating and opening functionalities of File. Instead, FileStream has a constructor that specifies useLongPath. It is internal, of course, and used only by IsolatedStorageFileStream!

Code to create a file:
var fileStreamConstructorInfo = typeof(System.IO.FileStream).GetConstructor(System.Reflection.BindingFlags.NonPublic|System.Reflection.BindingFlags.Instance,null,
new Type[] {
typeof(string) /*path*/, typeof(System.IO.FileMode) /*mode*/, typeof(System.IO.FileAccess) /*access*/,
typeof(System.IO.FileShare) /*share*/, typeof(int) /*bufferSize*/, typeof(System.IO.FileOptions) /*options*/,
typeof(string) /*msgPath*/, typeof(bool) /*bFromProxy*/, typeof(bool) /*useLongPath*/, typeof(bool) /*checkHost*/},null);
var stream = (System.IO.Stream)fileStreamConstructorInfo.Invoke(new object[] {
path, System.IO.FileMode.Create, System.IO.FileAccess.Write,
System.IO.FileShare.None, 4096, System.IO.FileOptions.None,
System.IO.Path.GetFileName(path), false, true, false
});
Horrible, but it works. Again, no filenames bigger than 255 and the exception coming from the file system, as it should. Some info about the parameters: msgPath is the name of the file opened by the stream, if bFromProxy is true the stream doesn't try to demand some security permissions, checkHost... does nothing :) Probably someone wanted to add a piece of code there, but forgot about it.

Why did I use 4096 as the buffer size? Because that is the default value used by .Net when not specifying the value. Kind of low, right?

Now, this is some sort of midway alternative to using a third party file system library: you invoke through reflection code that is done by Microsoft and hidden for no good reason. I don't condone using it, unless you really need it. What I expect from the .Net framework is that it takes care of all this for me. It seems (as detailed in this blog post), that efforts are indeed made. A little late, I'd say, but still.